Articles published in 2019

Applying Deep Reinforcement Learning to Poker

Categories: Data Science | Tags: Algorithm, Gaming, Q-learning, Deep Learning, Machine Learning, Neural Network, Python

We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then we’ll apply a version of this algorithm to the game of Poker…

By Oscar BLAZEJEWSKI

Jan 9, 2019

Self-sovereign identities with verifiable claims

Categories: Data Governance | Tags: Authentication, Blockchain, Cloud, GitHub, GitLab, IAM, Ledger

Towards a trusted, personal, persistent, and portable digital identity for all. Digital identity issues Self-sovereign identities are an attempt to solve a couple of issues. The first is the…

By Nabil MELLAL

Jan 23, 2019

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Categories: Data Engineering | Tags: Thrift, JDBC, Hadoop, Hive, Spark, Python, SQL

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…

By Oskar RYNKIEWICZ

Mar 25, 2019

Introduction to Cloudera Data Science Workbench

Categories: Data Science | Tags: Tuning, Azure, Azure Data Catalog, Azure Data Factory, Cloud, Cloudera, Data Hub, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…

By Mehdi ELALAMI

Feb 28, 2019

Multihoming on Hadoop

Categories: Infrastructure | Tags: Hadoop, HDFS, Kerberos, Network

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an…

By Joris RUMMENS

Mar 5, 2019

Gatsby.js, React and GraphQL for documentation websites

Categories: Adaltas Summit 2018, Front End | Tags: Gatsby, HTTP, JAMstack, React.js, SEO, API, GitOps, GraphQL, IaC, JavaScript, Markdown, Node.js

In the last few months, I have started to redesign some of our Open Source project websites. This includes the websites of the Node.js CSV project, the Node.js HBase client and the Nikita project, our…

By David WORMS

Apr 1, 2019

First Class Functions in Python

Categories: Hack, Learning | Tags: Programming, Python

I recently watched a talk by Dave Cheney about first class functions in Go. Python supports first class functions too, so can we use them in the same ways? Absolutely. I have been using Python for a…

By Arthur BUSSER

Apr 15, 2019

Recover from an EFI failure on a dedicated server

Categories: Hack | Tags: Infrastructure, Linux, Cloud

A few weeks ago, before upgrading our Ubuntu systems, we sort of messed around with our EFI partitions and the impacted servers never came back online on system reboot after the upgrade. Provisionning…

By Grégor JOUET

Apr 16, 2019

Installing Kubernetes on CentOS 7

Categories: Containers Orchestration | Tags: CentOS, cgroups, DevOps, Infrastructure, Namespaces, Red Hat, VM, Ceph, CNCF, Docker, Kubernetes

This article explains how to install a Kubernetes cluster. I will dive into what each step does so you can build a thorough understanding of what is going on. This article is based on my talk from the…

By Arthur BUSSER

Jan 29, 2019

Google Cloud Summit Paris Notes

Categories: Events | Tags: AWS, Azure, Cloud, GCP, Kubernetes, On-premises

Google organized its yearly Summit edition 2019 in Paris on the 18th of June. This year’s event was the biggest yet in Paris, which reflect Google’s commitment to position itself in the French market…

By Tariq SAHNOUNI

Jun 26, 2019

Auto-scaling Druid with Kubernetes

Categories: Big Data, Business Intelligence, Containers Orchestration | Tags: Helm, Metrics, OLAP, Operation, Container Orchestration, EC2, Druid, Cloud, CNCF, Data Analytics, Kubernetes, Prometheus, Python

Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk…

By Leo SCHOUKROUN

Jul 16, 2019

TensorFlow installation on Docker

Categories: Containers Orchestration, Data Science, Learning | Tags: CPU, Linux, AI, Deep Learning, Docker, Jupyter, Python, TensorFlow

TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array…

By Pierre SAUVAGE

Aug 5, 2019

Druid and Hive integration

Categories: Big Data, Business Intelligence, Tech Radar | Tags: Learning and tutorial, LLAP, OLAP, Druid, Hive, Data Analytics, GitLab, PostgreSQL, SQL

This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools description…

By Pierre SAUVAGE

Jun 17, 2019

Running Apache Hive 3, new features and tips and tricks

Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: JDBC, LLAP, Active Directory, Druid, Hadoop, Hive, Kafka, Cloudera, Data Warehouse, PostgreSQL, Python, Release and features, Storage

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…

By Gauthier LEONARD

Jul 25, 2019

Mount Aladdin eToken in Firefox on Archlinux

Categories: Hack | Tags: Arch Linux, Cyber Security, Firefox, Security, Smart card, 2FA

Given you’re on Archlinux and have an Aladdin eToken, let’s see how we can mount it in Firefox for web authentication. An Aladdin eToken is a cryptographic device (token, smart card) that stores…

By César BEREZOWSKI

Jul 12, 2019

Innovation, project vs product culture in Data Science

Categories: Data Science, Data Governance | Tags: DevOps, Agile, Data Lake, Registry, Schema, Scrum, TCO

Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…

By David WORMS

Oct 8, 2019

Notes on the Cloudera Open Source licensing model

Categories: Big Data | Tags: CDSW, License, Cloudera, CDH, Cloudera Manager, HDP, Open source

Following the publication of its Open Source licensing strategy on July 10, 2019 in an article called “our Commitment to Open Source Software”, Cloudera broadcasted a webinar yesterday October 2…

By David WORMS

Oct 25, 2019

Kerberos and Spnego authentication on Windows with Firefox

Categories: Cyber Security | Tags: Cryptography, DevOps, Firefox, HTTP, Big Data, FreeIPA, Kerberos, Network

In Greek mythology, Kerberos, also called Cerberus, guards the gates of the Underworld to prevent the dead from leaving. He is commonly described as a three-headed dog, a serpent’s tail, mane of…

By David WORMS

Nov 4, 2019

Rook with Ceph doesn't provision my Persistent Volume Claims!

Categories: DevOps & SRE | Tags: PVC, Linux, Rook, Ubuntu, Ceph, Cluster, Internship, Kubernetes, PostgreSQL, Redis, Storage

Ceph installation inside Kubernetes can be provisioned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoid…

By Eyal CHOJNOWSKI

Sep 9, 2019

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Categories: Data Engineering, Data Science | Tags: Automation, DevOps, Learning and tutorial, Flink, Hadoop, HBase, Kafka, Spark, Big Data, Container, Elasticsearch, Internship, Kubernetes, NoSQL, Python

Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…

By David WORMS

Nov 26, 2019

Users and RBAC authorizations in Kubernetes

Categories: Containers Orchestration, Data Governance | Tags: Cyber Security, RBAC, Authentication, Authorization, Kubernetes, SSL/TLS

Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication and…

By Robert Walid SOARES

Aug 7, 2019

InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS

Categories: Big Data, Containers Orchestration | Tags: Automation, Data Engineering, DevOps, Learning and tutorial, LXD, Hadoop, Kafka, Spark, Ceph, Git, IaC, Internship, Kubernetes, NoSQL

Context The acquisition of a high-capacity cluster is in line with Adaltas’ desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms are…

By David WORMS

Nov 26, 2019

Insert rows in BigQuery tables with complex columns

Categories: Cloud Computing, Data Engineering | Tags: Business intelligence, Learning and tutorial, Big Data, GCP, BigQuery, Schema, SQL

Google’s BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, let’s talk about the support of Struct…

By César BEREZOWSKI

Nov 22, 2019

Hadoop Ozone part 2: tutorial and getting started of its features

Categories: Infrastructure | Tags: CLI, HTTP, Learning and tutorial, HDFS, Ozone, Amazon S3, Big Data, Cloud, Cluster, Kerberos, Release and features, REST

The releases of Hadoop Ozone come with a handy docker-compose file to try out Ozone. The below instructions provide details on how to use it. You can also use the Katacoda training sandbox which…

By Paul-Adrien CORDONNIER

Dec 3, 2019

Hadoop Ozone part 3: advanced replication strategy with Copyset

Categories: Infrastructure | Tags: HDFS, Ozone, Amazon S3, Big Data, Cloud, Cluster, Kubernetes, Node, Release and features

Hadoop Ozone provide a way of setting a ReplicationType for every write you make on the cluster. Right now is supported HDFS and Ratis but more advanced replication strategies can be achieved. In this…

By Paul-Adrien CORDONNIER

Dec 3, 2019

Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod

Categories: Data Science | Tags: Algorithm, CPU, GPU, Pipeline, Tuning, Deep Learning, Horovod, Keras, Machine Learning, TCO, TensorFlow

The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle which…

By Grégor JOUET

Nov 15, 2019

Should you move your Big Data and Data Lake to the Cloud

Categories: Big Data, Cloud Computing | Tags: DevOps, Hadoop, Kafka, Knox, Spark, AWS, Amazon S3, Azure, Azure Data Lake Storage (ADLS), Azure Data Catalog, Azure Data Factory, Cloud, CDP, Data Hub, Databricks, GCP, Kubernetes, Redis

Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…

By Joris RUMMENS

Dec 9, 2019

Logstash pipelines remote configuration and self-indexing

Categories: Data Engineering, Infrastructure | Tags: DevOps, Pipeline, Container, Docker, Elasticsearch, Kibana, Logstash, Log4j

Logstash is a powerful data collection engine that integrates in the Elastic Stack (Elasticsearch - Logstash - Kibana). The goal of this article is to show you how to deploy a fully managed Logstash…

By Paul-Adrien CORDONNIER

Dec 13, 2019

Cloudera CDP and Cloud migration of your Data Warehouse

Categories: Big Data, Cloud Computing | Tags: EC2, Atlas, Knox, Ranger, Spark, AWS, Amazon S3, Azure, Azure Data Lake Storage (ADLS), Cloudera, Data Hub, Data Lake, Data Warehouse, FreeIPA, Keycloak

While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriate…

By David WORMS

Dec 16, 2019

Machine Learning model deployment

Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: C++, DevOps, Java, Monitoring, Operation, AI, Hadoop, Kafka, Spark, YARN, Cloud, Container, Deep Learning, Docker, Kubernetes, Machine Learning, MLflow, MLOps, Neural Network, On-premises, Python, Schema, TensorFlow, XGBoost

“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…

By Oskar RYNKIEWICZ

Sep 30, 2019

Hadoop Ozone part 1: an introduction of the new filesystem

Categories: Infrastructure | Tags: Container Storage Interface (CSI), HDFS, Hive, MapReduce, Ozone, Spark, Amazon S3, Big Data, Cloud, Cluster, Kubernetes, Release and features

Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This article…

By Paul-Adrien CORDONNIER

Dec 3, 2019

Apache Knox made easy!

Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: Ambari, Shiro, Solr, JDBC, LDAP, Active Directory, Hadoop, Hive, Knox, Ranger, Kerberos, Log4j, REST, SSL/TLS, SSO

Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in a…

By Michael HATOUM

Feb 4, 2019

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Categories: Data Engineering, Learning | Tags: Data Governance, Hadoop, Spark, Apache Spark Streaming, Big Data, Consensus, File Format, Python, Streaming, TCO

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…

By Oskar RYNKIEWICZ

May 28, 2019

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Categories: Data Engineering, Learning | Tags: PySpark, Kafka, Spark, Apache Spark Streaming, Big Data, SQL, Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…

By Oskar RYNKIEWICZ

Apr 18, 2019

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Categories: Big Data, Data Engineering, DevOps & SRE | Tags: DevOps, Learning and tutorial, Spark, Apache Spark Streaming, IaC, Log4j, Python, Scala, Streaming, Unit tests

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…

By Oskar RYNKIEWICZ

May 31, 2019

Spark Streaming part 4: clustering with Spark MLlib

Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming

Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…

By Oskar RYNKIEWICZ

Jun 27, 2019