Articles published in 2019
Spark Streaming part 4: clustering with Spark MLlib
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming
Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…
Jun 27, 2019
Spark Streaming part 3: DevOps, tools and tests for Spark applications
Categories: Big Data, Data Engineering, DevOps & SRE | Tags: DevOps, Learning and tutorial, Spark, Apache Spark Streaming, IaC, Log4j, Python, Scala, Streaming, Unit tests
Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…
May 31, 2019
Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop
Categories: Data Engineering, Learning | Tags: Data Governance, Hadoop, Spark, Apache Spark Streaming, Big Data, Consensus, File Format, Python, Streaming, TCO
Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…
May 28, 2019
Spark Streaming part 1: build data pipelines with Spark Structured Streaming
Categories: Data Engineering, Learning | Tags: PySpark, Kafka, Spark, Apache Spark Streaming, Big Data, SQL, Streaming
Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…
Apr 18, 2019
Cloudera CDP and Cloud migration of your Data Warehouse
Categories: Big Data, Cloud Computing | Tags: EC2, Atlas, Knox, Ranger, Spark, AWS, Amazon S3, Azure, Azure Data Lake Storage (ADLS), Cloudera, Data Hub, Data Lake, Data Warehouse, FreeIPA, Keycloak
While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriate…
By David WORMS
Dec 16, 2019
Logstash pipelines remote configuration and self-indexing
Categories: Data Engineering, Infrastructure | Tags: DevOps, Pipeline, Container, Docker, Elasticsearch, Kibana, Logstash, Log4j
Logstash is a powerful data collection engine that integrates in the Elastic Stack (Elasticsearch - Logstash - Kibana). The goal of this article is to show you how to deploy a fully managed Logstash…
Dec 13, 2019
Should you move your Big Data and Data Lake to the Cloud
Categories: Big Data, Cloud Computing | Tags: DevOps, Hadoop, Kafka, Knox, Spark, AWS, Amazon S3, Azure, Azure Data Lake Storage (ADLS), Azure Data Catalog, Azure Data Factory, Cloud, CDP, Data Hub, Databricks, GCP, Kubernetes, Redis
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…
Dec 9, 2019
Hadoop Ozone part 3: advanced replication strategy with Copyset
Categories: Infrastructure | Tags: HDFS, Ozone, Amazon S3, Big Data, Cloud, Cluster, Kubernetes, Node, Release and features
Hadoop Ozone provide a way of setting a ReplicationType for every write you make on the cluster. Right now is supported HDFS and Ratis but more advanced replication strategies can be achieved. In this…
Dec 3, 2019
Hadoop Ozone part 2: tutorial and getting started of its features
Categories: Infrastructure | Tags: CLI, HTTP, Learning and tutorial, HDFS, Ozone, Amazon S3, Big Data, Cloud, Cluster, Kerberos, Release and features, REST
The releases of Hadoop Ozone come with a handy docker-compose file to try out Ozone. The below instructions provide details on how to use it. You can also use the Katacoda training sandbox which…
Dec 3, 2019
Hadoop Ozone part 1: an introduction of the new filesystem
Categories: Infrastructure | Tags: Container Storage Interface (CSI), HDFS, Hive, MapReduce, Ozone, Spark, Amazon S3, Big Data, Cloud, Cluster, Kubernetes, Release and features
Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This article…
Dec 3, 2019
InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS
Categories: Big Data, Containers Orchestration | Tags: Automation, Data Engineering, DevOps, Learning and tutorial, LXD, Hadoop, Kafka, Spark, Ceph, Git, IaC, Internship, Kubernetes, NoSQL
Context The acquisition of a high-capacity cluster is in line with Adaltas’ desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms are…
By David WORMS
Nov 26, 2019
Internship Data Science & Data Engineer - ML in production and streaming data ingestion
Categories: Data Engineering, Data Science | Tags: Automation, DevOps, Learning and tutorial, Flink, Hadoop, HBase, Kafka, Spark, Big Data, Container, Elasticsearch, Internship, Kubernetes, NoSQL, Python
Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…
By David WORMS
Nov 26, 2019
Insert rows in BigQuery tables with complex columns
Categories: Cloud Computing, Data Engineering | Tags: Business intelligence, Learning and tutorial, Big Data, GCP, BigQuery, Schema, SQL
Google’s BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, let’s talk about the support of Struct…
Nov 22, 2019
Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod
Categories: Data Science | Tags: Algorithm, CPU, GPU, Pipeline, Tuning, Deep Learning, Horovod, Keras, Machine Learning, TCO, TensorFlow
The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle which…
By Grégor JOUET
Nov 15, 2019
Kerberos and Spnego authentication on Windows with Firefox
Categories: Cyber Security | Tags: Cryptography, DevOps, Firefox, HTTP, Big Data, FreeIPA, Kerberos, Network
In Greek mythology, Kerberos, also called Cerberus, guards the gates of the Underworld to prevent the dead from leaving. He is commonly described as a three-headed dog, a serpent’s tail, mane of…
By David WORMS
Nov 4, 2019
Notes on the Cloudera Open Source licensing model
Categories: Big Data | Tags: CDSW, License, Cloudera, CDH, Cloudera Manager, HDP, Open source
Following the publication of its Open Source licensing strategy on July 10, 2019 in an article called “our Commitment to Open Source Software”, Cloudera broadcasted a webinar yesterday October 2…
By David WORMS
Oct 25, 2019
Innovation, project vs product culture in Data Science
Categories: Data Science, Data Governance | Tags: DevOps, Agile, Data Lake, Registry, Schema, Scrum, TCO
Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…
By David WORMS
Oct 8, 2019
Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: C++, DevOps, Java, Monitoring, Operation, AI, Hadoop, Kafka, Spark, YARN, Cloud, Container, Deep Learning, Docker, Kubernetes, Machine Learning, MLflow, MLOps, Neural Network, On-premises, Python, Schema, TensorFlow, XGBoost
“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…
Sep 30, 2019
Rook with Ceph doesn't provision my Persistent Volume Claims!
Categories: DevOps & SRE | Tags: PVC, Linux, Rook, Ubuntu, Ceph, Cluster, Internship, Kubernetes, PostgreSQL, Redis, Storage
Ceph installation inside Kubernetes can be provisioned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoid…
Sep 9, 2019
Users and RBAC authorizations in Kubernetes
Categories: Containers Orchestration, Data Governance | Tags: Cyber Security, RBAC, Authentication, Authorization, Kubernetes, SSL/TLS
Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication and…
Aug 7, 2019
TensorFlow installation on Docker
Categories: Containers Orchestration, Data Science, Learning | Tags: CPU, Jupyter, Linux, AI, Deep Learning, Docker, Python, TensorFlow
TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array…
Aug 5, 2019
Running Apache Hive 3, new features and tips and tricks
Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: JDBC, LLAP, Active Directory, Druid, Hadoop, Hive, Kafka, Cloudera, Data Warehouse, PostgreSQL, Python, Release and features, Storage
Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…
Jul 25, 2019
Auto-scaling Druid with Kubernetes
Categories: Big Data, Business Intelligence, Containers Orchestration | Tags: Helm, Metrics, OLAP, Operation, Container Orchestration, EC2, Druid, Cloud, CNCF, Data Analytics, Kubernetes, Prometheus, Python
Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk…
Jul 16, 2019
Mount Aladdin eToken in Firefox on Archlinux
Categories: Hack | Tags: Arch Linux, Cyber Security, Firefox, Security, Smart card, 2FA
Given you’re on Archlinux and have an Aladdin eToken, let’s see how we can mount it in Firefox for web authentication. An Aladdin eToken is a cryptographic device (token, smart card) that stores…
Jul 12, 2019
Google Cloud Summit Paris Notes
Categories: Events | Tags: AWS, Azure, Cloud, GCP, Kubernetes, On-premises
Google organized its yearly Summit edition 2019 in Paris on the 18th of June. This year’s event was the biggest yet in Paris, which reflect Google’s commitment to position itself in the French market…
Jun 26, 2019
Druid and Hive integration
Categories: Big Data, Business Intelligence, Tech Radar | Tags: Learning and tutorial, LLAP, OLAP, Druid, Hive, Data Analytics, GitLab, PostgreSQL, SQL
This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools description…
Jun 17, 2019
Recover from an EFI failure on a dedicated server
Categories: Hack | Tags: Infrastructure, Linux, Cloud
A few weeks ago, before upgrading our Ubuntu systems, we sort of messed around with our EFI partitions and the impacted servers never came back online on system reboot after the upgrade. Provisionning…
By Grégor JOUET
Apr 16, 2019
First Class Functions in Python
Categories: Hack, Learning | Tags: Programming, Python
I recently watched a talk by Dave Cheney about first class functions in Go. Python supports first class functions too, so can we use them in the same ways? Absolutely. I have been using Python for a…
Apr 15, 2019
Gatsby.js, React and GraphQL for documentation websites
Categories: Adaltas Summit 2018, Front End | Tags: Gatsby, HTTP, JAMstack, React.js, SEO, API, GitOps, GraphQL, IaC, JavaScript, Markdown, Node.js
In the last few months, I have started to redesign some of our Open Source project websites. This includes the websites of the Node.js CSV project, the Node.js HBase client and the Nikita project, our…
By David WORMS
Apr 1, 2019
Publish Spark SQL DataFrame and RDD with Spark Thrift Server
Categories: Data Engineering | Tags: Thrift, JDBC, Hadoop, Hive, Spark, Python, SQL
The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…
Mar 25, 2019
Multihoming on Hadoop
Categories: Infrastructure | Tags: Hadoop, HDFS, Kerberos, Network
Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an…
Mar 5, 2019
Introduction to Cloudera Data Science Workbench
Categories: Data Science | Tags: Tuning, Azure, Azure Data Catalog, Azure Data Factory, Cloud, Cloudera, Data Hub, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook
Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…
Feb 28, 2019
Apache Knox made easy!
Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: Ambari, Shiro, Solr, JDBC, LDAP, Active Directory, Hadoop, Hive, Knox, Ranger, Kerberos, Log4j, REST, SSL/TLS, SSO
Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in a…
Feb 4, 2019
Installing Kubernetes on CentOS 7
Categories: Containers Orchestration | Tags: CentOS, cgroups, DevOps, Infrastructure, Namespaces, Red Hat, VM, Ceph, CNCF, Docker, Kubernetes
This article explains how to install a Kubernetes cluster. I will dive into what each step does so you can build a thorough understanding of what is going on. This article is based on my talk from the…
Jan 29, 2019
Self-sovereign identities with verifiable claims
Categories: Data Governance | Tags: Authentication, Blockchain, Cloud, GitHub, GitLab, IAM, Ledger
Towards a trusted, personal, persistent, and portable digital identity for all. Digital identity issues Self-sovereign identities are an attempt to solve a couple of issues. The first is the…
By Nabil MELLAL
Jan 23, 2019
Applying Deep Reinforcement Learning to Poker
Categories: Data Science | Tags: Algorithm, Gaming, Q-learning, Deep Learning, Machine Learning, Neural Network, Python
We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then we’ll apply a version of this algorithm to the game of Poker…
Jan 9, 2019