Articles published in 2020
Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin
Categories: Big Data, Infrastructure | Tags: Maven, Debug, Java, Hadoop, HBase, Hive, Spark, CDP, Git, GitHub, HDP, Release and features, TDP, Unit tests
The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARN…
Dec 18, 2020
Faster model development with H2O AutoML and Flow
Categories: Data Science, Learning | Tags: PySpark, Automation, JDBC, R, Avro, Hadoop, HDFS, Hive, ORC, Parquet, Cloud, CSV, H2O, Machine Learning, MLOps, On-premises, Open source, Python, Scala
Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…
Dec 10, 2020
OAuth2 and OpenID Connect for microservices and public applications (Part 2)
Categories: Containers Orchestration, Cyber Security | Tags: Go Lang, JAMstack, LDAP, Micro Services, Security, CNCF, CoffeeScript, JavaScript Object Notation (JSON), Kubernetes, Node.js, OAuth2, OpenID Connect
Using OAuth2 and OpenID Connect, it is important to understand how the authorization flow is taking place, who shall call the Authorization Server, how to store the tokens. Moreover, microservices and…
By David WORMS
Nov 20, 2020
OAuth2 and OpenID Connect, a gentle and working introduction (Part 1)
Categories: Containers Orchestration, Cyber Security | Tags: Go Lang, JAMstack, LDAP, Active Directory, Security, CNCF, Kubernetes, OAuth2, OpenID Connect, Storage
Understanding OAuth2, OpenID and OpenID Connect (OIDC), how they relate, how the communications are established, and how to architecture your application with the given access, refresh and id tokens…
By David WORMS
Nov 17, 2020
Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: NiFi, HDF, Hadoop, HDFS, MapReduce, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), Big Data, Cloud, Data Lake, HDP, OAuth2
As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…
Nov 5, 2020
Rebuilding HDP Hive: patch, test and build
Categories: Big Data, Infrastructure | Tags: Maven, Debug, Java, Hadoop, Hive, CDP, Git, GitHub, HDP, Release and features, TDP, Unit tests
The Hortonworks HDP distribution will soon be deprecated in favor of Cloudera’s CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunity…
Oct 6, 2020
Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Operation, Spark, Databricks, Delta Lake, Git, Machine Learning, MLflow, Registry, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…
Sep 30, 2020
Experiment tracking with MLflow on Databricks Community Edition
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…
Sep 10, 2020
Version your datasets with Data Version Control (DVC) and Git
Categories: Data Science, DevOps & SRE | Tags: DevOps, Infrastructure, Operation, Data Hub, Databricks, Git, GitHub, GitLab, GitOps, SCM
Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such…
By Grégor JOUET
Sep 3, 2020
Plugin architecture in JavaScript and Node.js with Plug and Play
Categories: Front End, Node.js | Tags: Asynchronous, DevOps, Packaging, Programming, Agile, IaC, JavaScript, Open source, Release and features
Plug and Play helps library and application authors to introduce a plugin architecture into their code. It simplifies complex code execution with well-defined interception points, also called hooks…
By David WORMS
Aug 28, 2020
Installing Hadoop from source: build, patch and run
Categories: Big Data, Infrastructure | Tags: Maven, Debug, Java, LXD, Hadoop, HDFS, CDP, Docker, HDP, TDP, Unit tests
Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more and CDH is now CDP. MapR has been acquired by HP and IBM BigInsights…
Aug 4, 2020
Download datasets into HDFS and Hive
Categories: Big Data, Data Engineering | Tags: Business intelligence, Data Engineering, Data structures, Database, Hadoop, HDFS, Hive, Big Data, Data Analytics, Data Lake, Data lakehouse, Data Warehouse
Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…
By Aida NGOM
Jul 31, 2020
Automate a Spark routine workflow from GitLab to GCP
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Data Engineering, DevOps, Learning and tutorial, Airflow, Spark, CI/CD, Cloud, Git, GitLab, GitOps, GCE, GCP, Terraform, IAM, Unit tests
A workflow consists in automating a succession of tasks to be carried out without human intervention. It is an important and widespread concept which particularly apply to operational environments…
Jun 16, 2020
Importing data to Databricks: external tables and Delta Lake
Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Machine Learning, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…
May 21, 2020
Introducing Apache Airflow on AWS
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: PySpark, Data Engineering, DevOps, Learning and tutorial, Tools, Airflow, Hive, Oozie, Spark, AWS, Amazon S3, Docker, Docker Compose, Python
Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-source…
May 5, 2020
Expose a Rook-based Ceph cluster outside of Kubernetes
Categories: Containers Orchestration | Tags: Debug, Rook, SSH, Big Data, Ceph, Docker, Kubernetes
We recently deployed a LXD based Hadoop cluster and we wanted to be able to apply size quotas on some filesystems (ie: service logs, user homes). Quota is a built in feature of the Linux kernel used…
Apr 16, 2020
Snowflake, the Data Warehouse for the Cloud, introduction and tutorial
Categories: Business Intelligence, Cloud Computing | Tags: AWS, Azure, Cloud, Data Lake, Data Science, Data Warehouse, GCP, Snowflake
Snowflake is a SaaS-based data-warehousing platform that centralizes, in the cloud, the storage and processing of structured and semi-structured data. The increasing generation of data produced over…
Apr 7, 2020
Optimization of Spark applications in Hadoop YARN
Categories: Data Engineering, Learning | Tags: Mesos, Tuning, Hadoop, Spark, YARN, Big Data, Clustering, Kubernetes, Python
Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…
Mar 30, 2020
MLflow tutorial: an open source Machine Learning (ML) platform
Categories: Data Engineering, Data Science, Learning | Tags: Arch Linux, R, MXNet, Spark MLlib, AWS, Azure, Databricks, Deep Learning, Deployment, H2O, Keras, Machine Learning, MLflow, MLOps, Python, PyTorch, Scikit-learn, TensorFlow, XGBoost
Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…
Mar 23, 2020
Introduction to Ludwig and how to deploy a Deep Learning model via Flask
Categories: Data Science, Tech Radar | Tags: CLI, Learning and tutorial, Server, API, Deep Learning, File Format, Ludwig Deep Learning Toolbox, Machine Learning, Python
Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomous…
Mar 2, 2020
Install and debug Kubernetes inside LXD
Categories: Containers Orchestration | Tags: Debug, Linux, LXD, Docker, Kubernetes, Node
We recently deployed a Kubernetes cluster with the need to maintain clusters isolation on our bare metal nodes across our infrastructure. We knew that Virtual Machines would provide the required…
Feb 4, 2020
Policy enforcing with Open Policy Agent
Categories: Cyber Security, Data Governance | Tags: Go Lang, Tools, Kafka, Ranger, Authorization, Big Data, Cloud, Docker, Kubernetes, REST, SSL/TLS
Open Policy Agent is an open-source multi-purpose policy engine. Its main goal is to unify policy enforcement across the cloud native stack. The project was created by Styra and it is currently…
Jan 22, 2020
Comparison of different file formats in Big Data
Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Database, MongoDB, Avro, Hadoop, HDFS, Hive, Kafka, MapReduce, ORC, Parquet, Spark, Batch processing, Big Data, CSV, Data Analytics, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers, XML
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…
By Aida NGOM
Jul 23, 2020