Articles published in 2020

Policy enforcing with Open Policy Agent

Categories: Cyber Security, Data Governance | Tags: Go Lang, Tools, Kafka, Ranger, Authorization, Big Data, Cloud, Docker, Kubernetes, REST, SSL/TLS

Open Policy Agent is an open-source multi-purpose policy engine. Its main goal is to unify policy enforcement across the cloud native stack. The project was created by Styra and it is currently…

By Leo SCHOUKROUN

Jan 22, 2020

Install and debug Kubernetes inside LXD

Categories: Containers Orchestration | Tags: Debug, Linux, LXD, Docker, Kubernetes, Node

We recently deployed a Kubernetes cluster with the need to maintain clusters isolation on our bare metal nodes across our infrastructure. We knew that Virtual Machines would provide the required…

By Leo SCHOUKROUN

Feb 4, 2020

Introduction to Ludwig and how to deploy a Deep Learning model via Flask

Categories: Data Science, Tech Radar | Tags: CLI, Learning and tutorial, Server, API, Deep Learning, File Format, Ludwig Deep Learning Toolbox, Machine Learning, Python

Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomous…

By Robert Walid SOARES

Mar 2, 2020

MLflow tutorial: an open source Machine Learning (ML) platform

Categories: Data Engineering, Data Science, Learning | Tags: Arch Linux, R, MXNet, Spark MLlib, AWS, Azure, Databricks, Deep Learning, Deployment, H2O, Keras, Machine Learning, MLflow, MLOps, Python, PyTorch, Scikit-learn, TensorFlow, XGBoost

Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…

By Petra KAFERLE DEVISSCHERE

Mar 23, 2020

Expose a Rook-based Ceph cluster outside of Kubernetes

Categories: Containers Orchestration | Tags: Debug, Rook, SSH, Big Data, Ceph, Docker, Kubernetes

We recently deployed a LXD based Hadoop cluster and we wanted to be able to apply size quotas on some filesystems (ie: service logs, user homes). Quota is a built in feature of the Linux kernel used…

By Leo SCHOUKROUN

Apr 16, 2020

Introducing Apache Airflow on AWS

Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: PySpark, Data Engineering, DevOps, Learning and tutorial, Tools, Airflow, Hive, Oozie, Spark, AWS, Amazon S3, Docker, Docker Compose, Python

Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-source…

By Aargan COINTEPAS

May 5, 2020

Snowflake, the Data Warehouse for the Cloud, introduction and tutorial

Categories: Business Intelligence, Cloud Computing | Tags: AWS, Azure, Cloud, Data Lake, Data Science, Data Warehouse, GCP, Snowflake

Snowflake is a SaaS-based data-warehousing platform that centralizes, in the cloud, the storage and processing of structured and semi-structured data. The increasing generation of data produced over…

By Jules HAMELIN-BOYER

Apr 7, 2020

Importing data to Databricks: external tables and Delta Lake

Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Machine Learning, Python

During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…

By Petra KAFERLE DEVISSCHERE

May 21, 2020

Automate a Spark routine workflow from GitLab to GCP

Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Data Engineering, DevOps, Learning and tutorial, Airflow, Spark, CI/CD, Cloud, Git, GitLab, GitOps, GCE, GCP, Terraform, IAM, Unit tests

A workflow consists in automating a succession of tasks to be carried out without human intervention. It is an important and widespread concept which particularly apply to operational environments…

By Ferdinand DE BAECQUE

Jun 16, 2020

Download datasets into HDFS and Hive

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data Engineering, Data structures, Database, Hadoop, HDFS, Hive, Big Data, Data Analytics, Data Lake, Data lakehouse, Data Warehouse

Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…

By Aida NGOM

Jul 31, 2020

Optimization of Spark applications in Hadoop YARN

Categories: Data Engineering, Learning | Tags: Mesos, Tuning, Hadoop, Spark, YARN, Big Data, Clustering, Kubernetes, Python

Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…

By Ferdinand DE BAECQUE

Mar 30, 2020

Installing Hadoop from source: build, patch and run

Categories: Big Data, Infrastructure | Tags: Maven, Debug, Java, LXD, Hadoop, HDFS, CDP, Docker, HDP, TDP, Unit tests

Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more and CDH is now CDP. MapR has been acquired by HP and IBM BigInsights…

By Leo SCHOUKROUN

Aug 4, 2020

Version your datasets with Data Version Control (DVC) and Git

Categories: Data Science, DevOps & SRE | Tags: DevOps, Infrastructure, Operation, Data Hub, Databricks, Git, GitHub, GitLab, GitOps, SCM

Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such…

By Grégor JOUET

Sep 3, 2020

Experiment tracking with MLflow on Databricks Community Edition

Categories: Data Engineering, Data Science, Learning | Tags: Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn

Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…

By Petra KAFERLE DEVISSCHERE

Sep 10, 2020

Plugin architecture in JavaScript and Node.js with Plug and Play

Categories: Front End, Node.js | Tags: Asynchronous, DevOps, Packaging, Programming, Agile, IaC, JavaScript, Open source, Release and features

Plug and Play helps library and application authors to introduce a plugin architecture into their code. It simplifies complex code execution with well-defined interception points, also called hooks…

By David WORMS

Aug 28, 2020

Rebuilding HDP Hive: patch, test and build

Categories: Big Data, Infrastructure | Tags: Maven, Debug, Java, Hadoop, Hive, CDP, Git, GitHub, HDP, Release and features, TDP, Unit tests

The Hortonworks HDP distribution will soon be deprecated in favor of Cloudera’s CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunity…

By Leo SCHOUKROUN

Oct 6, 2020

Data versioning and reproducible ML with DVC and MLflow

Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Operation, Spark, Databricks, Delta Lake, Git, Machine Learning, MLflow, Registry, Storage

Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…

By Petra KAFERLE DEVISSCHERE

Sep 30, 2020

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Categories: Big Data, Cloud Computing, Data Engineering | Tags: HDF, Hadoop, HDFS, MapReduce, NiFi, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), Big Data, Cloud, Data Lake, HDP, OAuth2

As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…

By Gauthier LEONARD

Nov 5, 2020

OAuth2 and OpenID Connect, a gentle and working introduction (Part 1)

Categories: Containers Orchestration, Cyber Security | Tags: Go Lang, JAMstack, LDAP, Active Directory, Security, CNCF, Kubernetes, OAuth2, OpenID Connect, Storage

Understanding OAuth2, OpenID and OpenID Connect (OIDC), how they relate, how the communications are established, and how to architecture your application with the given access, refresh and id tokens…

By David WORMS

Nov 17, 2020

Faster model development with H2O AutoML and Flow

Categories: Data Science, Learning | Tags: PySpark, Automation, JDBC, R, Avro, Hadoop, HDFS, Hive, ORC, Parquet, Cloud, CSV, H2O, Machine Learning, MLOps, On-premises, Open source, Python, Scala

Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…

By Petra KAFERLE DEVISSCHERE

Dec 10, 2020

Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin

Categories: Big Data, Infrastructure | Tags: Maven, Debug, Java, Hadoop, HBase, Hive, Spark, CDP, Git, GitHub, HDP, Release and features, TDP, Unit tests

The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARN…

By Leo SCHOUKROUN

Dec 18, 2020

OAuth2 and OpenID Connect for microservices and public applications (Part 2)

Categories: Containers Orchestration, Cyber Security | Tags: Go Lang, JAMstack, LDAP, Micro Services, Security, CNCF, CoffeeScript, JavaScript Object Notation (JSON), Kubernetes, Node.js, OAuth2, OpenID Connect

Using OAuth2 and OpenID Connect, it is important to understand how the authorization flow is taking place, who shall call the Authorization Server, how to store the tokens. Moreover, microservices and…

By David WORMS

Nov 20, 2020

Comparison of different file formats in Big Data

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Database, MongoDB, Avro, Hadoop, HDFS, Hive, Kafka, MapReduce, ORC, Parquet, Spark, Batch processing, Big Data, CSV, Data Analytics, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers, XML

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

By Aida NGOM

Jul 23, 2020