Data Engineering

La donnée est l’énergie qui alimente la transformation digitale. Les développeurs la consomme dans leurs applicatifs. Les Data Analysts la fouille, la requête et la partage. Les Data Scientists alimentent leurs algorithmes avec. Les Data Engineers ont la responsabilité de mettre en place la chaîne de valeur qui inclue la collecte, le nettoyage, l’enrichissement et la mise à disposition des données.

Gérer le passage à l’échelle, garantir la sécurité et l’intégrité des données, être tolérant aux pannes, manipuler des données par lots ou en flux continu, valider les schémas, publier les APIs, sélectionner les formats, modèles et bases de données appropriés à leurs expositions sont autant de prérogatives à la charge du Data Engineer. De son travail découle la confiance et les succès de ceux qui consomme et exploitent la donnée.

STREAMINGIOTGovernanceSPARKData ScienceKAFKAData MiningETLBATCHFLINKData Quality

Articles associés au Data Engineering

CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP

CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP

Catégories : Big Data, Data Engineering, Learning | Tags : NiFi, Business intelligence, Data Engineering, Iceberg, Spark, Big Data, Cloudera, CDP, Data Analytics, Data Lake, Data Warehouse

In this hands-on lab session we demonstrate how to build an end-to-end big data solution with Cloudera Data Platform (CDP) Public Cloud, using the infrastructure we have deployed and configured over…

Tobias CHAVARRIA

Par Tobias CHAVARRIA

24 juil. 2023

CDP part 1: introduction to end-to-end data lakehouse architecture with CDP

CDP part 1: introduction to end-to-end data lakehouse architecture with CDP

Catégories : Cloud Computing, Data Engineering, Infrastructure | Tags : Data Engineering, Hortonworks, Iceberg, AWS, Azure, Big Data, Cloud, Cloudera, CDP, Cloudera Manager, Data Warehouse

Cloudera Data Platform (CDP) is a hybrid data platform for big data transformation, machine learning and data analytics. In this series we describe how to build and use an end-to-end big data…

Stephan BAUM

Par Stephan BAUM

8 juin 2023

Keycloak deployment in EC2

Keycloak deployment in EC2

Catégories : Cloud Computing, Data Engineering, Infrastructure | Tags : Security, EC2, Authentication, AWS, Docker, Keycloak, SSL/TLS, SSO

Why use Keycloak Keycloak is an open-source identity provider (IdP) using single sign-on (SSO). An IdP is a tool to create, maintain, and manage identity information for principals and to provide…

Stephan BAUM

Par Stephan BAUM

14 mars 2023

Big data infrastructure internship

Big data infrastructure internship

Catégories : Big Data, Data Engineering, DevOps & SRE, Infrastructure | Tags : Infrastructure, Hadoop, Big Data, Cluster, Internship, Kubernetes, TDP

Job description Big Data and distributed computing are at the core of Adaltas. We accompagny our partners in the deployment, maintenance, and optimization of some of the largest clusters in France…

Stephan BAUM

Par Stephan BAUM

2 déc. 2022

Comparison of database architectures: data warehouse, data lake and data lakehouse

Comparison of database architectures: data warehouse, data lake and data lakehouse

Catégories : Big Data, Data Engineering | Tags : Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format

Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing…

Gonzalo ETSE

Par Gonzalo ETSE

17 mai 2022

Databricks logs collection with Azure Monitor at a Workspace Scale

Databricks logs collection with Azure Monitor at a Workspace Scale

Catégories : Cloud Computing, Data Engineering, Adaltas Summit 2021 | Tags : Metrics, Monitoring, Spark, Azure, Databricks, Log4j

Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access to…

Claire PLAYE

Par Claire PLAYE

10 mai 2022

An overview of Cloudera Data Platform (CDP)

An overview of Cloudera Data Platform (CDP)

Catégories : Big Data, Cloud Computing, Data Engineering | Tags : SDX, Big Data, Cloud, Cloudera, CDP, CDH, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Warehouse

Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and…

Alexander HOFFMANN

Par Alexander HOFFMANN

19 juil. 2021

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI

Catégories : Data Engineering, Learning | Tags : Cloud, Data Lake, Databricks, Delta Lake, MLflow

Self-paced trainings are proposed by Databricks inside their Academy program. The price is $ 2000 USD for unlimited access to the training courses for a period of 1 year, but also free for customers…

Anna KNYAZEVA

Par Anna KNYAZEVA

26 mai 2021

Find your way into data related Microsoft Azure certifications

Find your way into data related Microsoft Azure certifications

Catégories : Cloud Computing, Data Engineering | Tags : Data Governance, Azure, Data Science

Microsoft Azure has certification paths for many technical job roles such as developer, Data Engineer, Data Scientist and solution architect among others. Each of these certifications consists of…

Barthelemy NGOM

Par Barthelemy NGOM

14 avr. 2021

Apache Liminal: when MLOps meets GitOps

Apache Liminal: when MLOps meets GitOps

Catégories : Big Data, Containers Orchestration, Data Engineering, Data Science, Tech Radar | Tags : Data Engineering, CI/CD, Data Science, Deep Learning, Deployment, Docker, GitOps, Kubernetes, Machine Learning, MLOps, Open source, Python, TensorFlow

Apache Liminal is an open-source software which proposes a solution to deploy end-to-end Machine Learning pipelines. Indeed it permits to centralize all the steps needed to construct Machine Learning…

Aargan COINTEPAS

Par Aargan COINTEPAS

31 mars 2021

Storage size and generation time in popular file formats

Storage size and generation time in popular file formats

Catégories : Data Engineering, Data Science | Tags : Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…

Barthelemy NGOM

Par Barthelemy NGOM

22 mars 2021

TensorFlow Extended (TFX): the components and their functionalities

TensorFlow Extended (TFX): the components and their functionalities

Catégories : Big Data, Data Engineering, Data Science, Learning | Tags : Beam, Data Engineering, Pipeline, CI/CD, Data Science, Deep Learning, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow

Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yet…

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Catégories : Big Data, Cloud Computing, Data Engineering | Tags : NiFi, Hadoop, HDFS, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2

As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…

Gauthier LEONARD

Par Gauthier LEONARD

5 nov. 2020

Experiment tracking with MLflow on Databricks Community Edition

Experiment tracking with MLflow on Databricks Community Edition

Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn

Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…

Petra KAFERLE DEVISSCHERE

Par Petra KAFERLE DEVISSCHERE

10 sept. 2020

Download datasets into HDFS and Hive

Download datasets into HDFS and Hive

Catégories : Big Data, Data Engineering | Tags : Business intelligence, Data Engineering, Data structures, Database, Hadoop, HDFS, Hive, Big Data, Data Analytics, Data Lake, Data lakehouse, Data Warehouse

Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…

Aida NGOM

Par Aida NGOM

31 juil. 2020

Comparison of different file formats in Big Data

Comparison of different file formats in Big Data

Catégories : Big Data, Data Engineering | Tags : Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

Aida NGOM

Par Aida NGOM

23 juil. 2020

Importing data to Databricks: external tables and Delta Lake

Importing data to Databricks: external tables and Delta Lake

Catégories : Data Engineering, Data Science, Learning | Tags : Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python

During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…

Optimization of Spark applications in Hadoop YARN

Optimization of Spark applications in Hadoop YARN

Catégories : Data Engineering, Learning | Tags : Tuning, Hadoop, Spark, Python

Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…

Ferdinand DE BAECQUE

Par Ferdinand DE BAECQUE

30 mars 2020

MLflow tutorial: an open source Machine Learning (ML) platform

MLflow tutorial: an open source Machine Learning (ML) platform

Catégories : Data Engineering, Data Science, Learning | Tags : AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn

Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…

Logstash pipelines remote configuration and self-indexing

Logstash pipelines remote configuration and self-indexing

Catégories : Data Engineering, Infrastructure | Tags : Docker, Elasticsearch, Kibana, Logstash, Log4j

Logstash is a powerful data collection engine that integrates in the Elastic Stack (Elasticsearch - Logstash - Kibana). The goal of this article is to show you how to deploy a fully managed Logstash…

Paul-Adrien CORDONNIER

Par Paul-Adrien CORDONNIER

13 déc. 2019

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Catégories : Data Engineering, Data Science | Tags : DevOps, Flink, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python

Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…

David WORMS

Par David WORMS

26 nov. 2019

Insert rows in BigQuery tables with complex columns

Insert rows in BigQuery tables with complex columns

Catégories : Cloud Computing, Data Engineering | Tags : GCP, BigQuery, Schema, SQL

Google’s BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, let’s talk about the support of Struct…

César BEREZOWSKI

Par César BEREZOWSKI

22 nov. 2019

Machine Learning model deployment

Machine Learning model deployment

Catégories : Big Data, Data Engineering, Data Science, DevOps & SRE | Tags : DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema

“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…

Oskar RYNKIEWICZ

Par Oskar RYNKIEWICZ

30 sept. 2019

Spark Streaming part 4: clustering with Spark MLlib

Spark Streaming part 4: clustering with Spark MLlib

Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming

Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…

Oskar RYNKIEWICZ

Par Oskar RYNKIEWICZ

27 juin 2019

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Catégories : Big Data, Data Engineering, DevOps & SRE | Tags : DevOps, Learning and tutorial, Spark, Apache Spark Streaming

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…

Oskar RYNKIEWICZ

Par Oskar RYNKIEWICZ

31 mai 2019

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Catégories : Data Engineering, Learning | Tags : Spark, Apache Spark Streaming, Python, Streaming

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…

Oskar RYNKIEWICZ

Par Oskar RYNKIEWICZ

28 mai 2019

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Catégories : Data Engineering, Learning | Tags : Kafka, Spark, Apache Spark Streaming, Big Data, Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…

Oskar RYNKIEWICZ

Par Oskar RYNKIEWICZ

18 avr. 2019

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Catégories : Data Engineering | Tags : Thrift, JDBC, Hadoop, Hive, Spark, SQL

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…

Oskar RYNKIEWICZ

Par Oskar RYNKIEWICZ

25 mars 2019

Apache Flink: past, present and future

Apache Flink: past, present and future

Catégories : Data Engineering | Tags : Pipeline, Flink, Kubernetes, Machine Learning, SQL, Streaming

Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…

César BEREZOWSKI

Par César BEREZOWSKI

5 nov. 2018

Data Lake ingestion best practices

Data Lake ingestion best practices

Catégories : Big Data, Data Engineering | Tags : NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…

David WORMS

Par David WORMS

18 juin 2018

Apache Beam: a unified programming model for data processing pipelines

Apache Beam: a unified programming model for data processing pipelines

Catégories : Data Engineering, DataWorks Summit 2018 | Tags : Apex, Beam, Pipeline, Flink, Spark

In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…

Gauthier LEONARD

Par Gauthier LEONARD

24 mai 2018

What's new in Apache Spark 2.3?

What's new in Apache Spark 2.3?

Catégories : Data Engineering, DataWorks Summit 2018 | Tags : Arrow, PySpark, Tuning, ORC, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Streaming

Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…

César BEREZOWSKI

Par César BEREZOWSKI

23 mai 2018

Execute Python in an Oozie workflow

Execute Python in an Oozie workflow

Catégories : Data Engineering | Tags : Oozie, Elasticsearch, Python, REST

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that. I’ve recently designed a workflow that would interact…

César BEREZOWSKI

Par César BEREZOWSKI

6 mars 2018

Oracle DB synchrnozation to Hadoop with CDC

Oracle DB synchrnozation to Hadoop with CDC

Catégories : Data Engineering | Tags : CDC, GoldenGate, Oracle, Hive, Sqoop, Data Warehouse

This note is the result of a discussion about the synchronization of data written in a database to a warehouse stored in Hadoop. Thanks to Claude Daub from GFI who wrote it and who authorizes us to…

David WORMS

Par David WORMS

13 juil. 2017

EclairJS - Putting a Spark in Web Apps

EclairJS - Putting a Spark in Web Apps

Catégories : Data Engineering, Front End | Tags : Jupyter, Spark, JavaScript

Presentation by David Fallside from IBM, images extracted from the presentation. Introduction Web Apps development has moved from Java to NodeJS and Javascript. It provides a simple and rich…

David WORMS

Par David WORMS

17 juil. 2016

Splitting HDFS files into multiple hive tables

Splitting HDFS files into multiple hive tables

Catégories : Data Engineering | Tags : Flume, Pig, HDFS, Hive, Oozie, SQL

I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our…

David WORMS

Par David WORMS

15 sept. 2013

Testing the Oracle SQL Connector for Hadoop HDFS

Testing the Oracle SQL Connector for Hadoop HDFS

Catégories : Data Engineering | Tags : Database, File system, Oracle, HDFS, CDH, SQL

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other…

David WORMS

Par David WORMS

15 juil. 2013

Options to connect and integrate Hadoop with Oracle

Options to connect and integrate Hadoop with Oracle

Catégories : Data Engineering | Tags : Database, Java, Oracle, R, RDBMS, Avro, HDFS, Hive, MapReduce, Sqoop, NoSQL, SQL

I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…

David WORMS

Par David WORMS

15 mai 2013

Two Hive UDAF to convert an aggregation to a map

Two Hive UDAF to convert an aggregation to a map

Catégories : Data Engineering | Tags : Java, HBase, Hive, File Format

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The…

David WORMS

Par David WORMS

6 mars 2012

Timeseries storage in Hadoop and Hive

Timeseries storage in Hadoop and Hive

Catégories : Data Engineering | Tags : CRM, timeseries, Tuning, Hadoop, HDFS, Hive, File Format

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…

David WORMS

Par David WORMS

10 janv. 2012

Canada - Maroc - France

Nous sommes une équipe passionnée par l'Open Source, le Big Data et les technologies associées telles que le Cloud, le Data Engineering, la Data Science le DevOps…

Nous fournissons à nos clients un savoir faire reconnu sur la manière d'utiliser les technologies pour convertir leurs cas d'usage en projets exploités en production, sur la façon de réduire les coûts et d'accélérer les livraisons de nouvelles fonctionnalités.

Si vous appréciez la qualité de nos publications, nous vous invitons à nous contacter en vue de coopérer ensemble.

Support Ukrain