Data Lake

A Data Lake is a central repository from various data sources where the emphasis is put on storing data rapidly and for a low cost, at the expense of a well defined structure.

A wide variety of data can be stored in data lakes such as structured data (like columns and rows in classical RDBMS), semi-structured data (CSV, XML and JSON files), and unstructured data (images, videos, emails, web pagesā€¦).

In a Data Lake, the data is stored in a raw format, untouched, making it flexible for later usage. Data Lakes are, in general, a solid basis for data preparation, reports, visualization, in-depth analysis, data science and "machine learning".

Learn more
Wikipedia

Related articles

CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP

CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP

Categories: Big Data, Data Engineering, Learning | Tags: NiFi, Business intelligence, Data Engineering, Iceberg, Spark, Big Data, Cloudera, CDP, Data Analytics, Data Lake, Data Warehouse

In this hands-on lab session we demonstrate how to build an end-to-end big data solution with Cloudera Data Platform (CDP) Public Cloud, using the infrastructure we have deployed and configured overā€¦

Tobias CHAVARRIA

By Tobias CHAVARRIA

Jul 24, 2023

Data platform requirements and expectations

Data platform requirements and expectations

Categories: Big Data, Infrastructure | Tags: Data Engineering, Data Governance, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Science

A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources. It is composed of severalā€¦

David WORMS

By David WORMS

Mar 23, 2023

Ceph object storage within a Kubernetes cluster with Rook

Ceph object storage within a Kubernetes cluster with Rook

Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Ceph, Cluster, Data Lake, Kubernetes, Storage

Ceph is a distributed all-in-one storage system. Reliable and mature, its first stable version was released in 2012 and has since then been the reference for open source storage. Cephā€™s main perk isā€¦

Luka BIGOT

By Luka BIGOT

Aug 4, 2022

MinIO object storage within a Kubernetes cluster

MinIO object storage within a Kubernetes cluster

Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Cluster, Data Lake, Kubernetes, Storage

MinIO is a popular object storage solution. Often recommended for its simple setup and ease of use, it is not only a great way to get started with object storage: it also provides excellentā€¦

Luka BIGOT

By Luka BIGOT

Jul 9, 2022

Architecture of object-based storage and S3 standard specifications

Architecture of object-based storage and S3 standard specifications

Categories: Big Data, Data Governance | Tags: Database, API, Amazon S3, Big Data, Data Lake, Storage

Object storage has been growing in popularity among data storage architectures. Compared to file systems and block storage, object storage faces no limitations when handling petabytes of data. Byā€¦

Luka BIGOT

By Luka BIGOT

Jun 20, 2022

Comparison of database architectures: data warehouse, data lake and data lakehouse

Comparison of database architectures: data warehouse, data lake and data lakehouse

Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format

Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparingā€¦

Gonzalo ETSE

By Gonzalo ETSE

May 17, 2022

An overview of Cloudera Data Platform (CDP)

An overview of Cloudera Data Platform (CDP)

Categories: Big Data, Cloud Computing, Data Engineering | Tags: SDX, Big Data, Cloud, Cloudera, CDP, CDH, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Warehouse

Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security andā€¦

Alexander HOFFMANN

By Alexander HOFFMANN

Jul 19, 2021

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI

Categories: Data Engineering, Learning | Tags: Cloud, Data Lake, Databricks, Delta Lake, MLflow

Self-paced trainings are proposed by Databricks inside their Academy program. The price is $ 2000 USD for unlimited access to the training courses for a period of 1 year, but also free for customersā€¦

Anna KNYAZEVA

By Anna KNYAZEVA

May 26, 2021

Storage size and generation time in popular file formats

Storage size and generation time in popular file formats

Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in aā€¦

Barthelemy NGOM

By Barthelemy NGOM

Mar 22, 2021

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Categories: Big Data, Cloud Computing, Data Engineering | Tags: NiFi, Hadoop, HDFS, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2

As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recentlyā€¦

Gauthier LEONARD

By Gauthier LEONARD

Nov 5, 2020

Download datasets into HDFS and Hive

Download datasets into HDFS and Hive

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data Engineering, Data structures, Database, Hadoop, HDFS, Hive, Big Data, Data Analytics, Data Lake, Data lakehouse, Data Warehouse

Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,ā€¦). This explains the explosion of the data volume and theā€¦

Aida NGOM

By Aida NGOM

Jul 31, 2020

Snowflake, the Data Warehouse for the Cloud, introduction and tutorial

Snowflake, the Data Warehouse for the Cloud, introduction and tutorial

Categories: Business Intelligence, Cloud Computing | Tags: Cloud, Data Lake, Data Science, Data Warehouse, Snowflake

Snowflake is a SaaS-based data-warehousing platform that centralizes, in the cloud, the storage and processing of structured and semi-structured data. The increasing generation of data produced overā€¦

Jules HAMELIN-BOYER

By Jules HAMELIN-BOYER

Apr 7, 2020

Cloudera CDP and Cloud migration of your Data Warehouse

Cloudera CDP and Cloud migration of your Data Warehouse

Categories: Big Data, Cloud Computing | Tags: Azure, Cloudera, Data Hub, Data Lake, Data Warehouse

While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriateā€¦

David WORMS

By David WORMS

Dec 16, 2019

Innovation, project vs product culture in Data Science

Innovation, project vs product culture in Data Science

Categories: Data Science, Data Governance | Tags: DevOps, Agile, Scrum

Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes areā€¦

David WORMS

By David WORMS

Oct 8, 2019

Data Lake ingestion best practices

Data Lake ingestion best practices

Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customersā€¦

David WORMS

By David WORMS

Jun 18, 2018

Oracle and Hive, how data are published?

Oracle and Hive, how data are published?

Categories: Big Data | Tags: Oracle, Hive, Sqoop, Data Lake

In the past few days, Iā€™ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector withā€¦

David WORMS

By David WORMS

Jul 6, 2013

Introduction to OpenLineage

Introduction to OpenLineage

Categories: Big Data, Data Governance, Infrastructure | Tags: Data Engineering, Infrastructure, Atlas, Data Lake, Data lakehouse, Data Warehouse, Data lineage

OpenLineage is an open-source specification for data lineage. The specification is complemented by Marquez, its reference implementation. Since its launch in late 2020, OpenLineage has been a presenceā€¦

Christophe PARREIRA

By Christophe PARREIRA

Dec 19, 2023

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Scienceā€¦

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain