Articles published in 2018
LXD: The Missing Piece
Categories: Containers Orchestration | Tags: CPU, Linux, LXD, VM, Docker, Kubernetes
LXD stands for Linux Container Daemon. Yet another container technology. But LXD is very different. It stands apart from the pack. It is not necessarily better nor much faster nor more secure! But it…
Dec 28, 2018
Monitoring a production Hadoop cluster with Kubernetes
Categories: DevOps & SRE | Tags: Thrift, Grafana, Shinken, Hadoop, Knox, Cluster, Docker, Elasticsearch, Kubernetes, Node, Node.js, Prometheus, Python
Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplest…
Dec 21, 2018
CodaLab – Data Science competitions
Categories: Data Science, Adaltas Summit 2018, Learning | Tags: Database, Infrastructure, Machine Learning, MySQL, Node.js, Python
CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it…
Dec 17, 2018
Native modules for Node.js with N-API
Categories: Adaltas Summit 2018, Front End | Tags: C++, NPM, JavaScript, Kerberos, Node.js
How to create native modules for Node.js? How to use N-API, the future of native addons development? Writing C/C++ addon is a useful and powerful feature of the Node.js runtime. Let’s explore them…
Dec 12, 2018
Microsoft introduces Cloud Native Application Bundles
Categories: Containers Orchestration | Tags: CLI, Helm, Packaging, Docker, Kubernetes
At DockerCon EU 2018 in Barcelona, Matt Butcher, Principal Engineer at Microsoft and inventor of Helm, introduced CNAB, Cloud Native Application Bundles, a packaging format for distributed…
Dec 4, 2018
Jumbo, the Hadoop cluster bootstrapper
Categories: Infrastructure | Tags: Ambari, Automation, Ansible, Cluster, Vagrant, HDP, IaC, Python, REST, SCM
Introducing Jumbo, a Hadoop cluster bootstrapper for developers. Jumbo helps you deploy development environments for Big Data technologies. It takes a few minutes to get a custom virtualized Hadoop…
Nov 29, 2018
Main advantages of GraphQL as an alternative to REST
Categories: Front End | Tags: gRPC, API, GraphQL, JavaScript Object Notation (JSON), Node.js, Registry, REST
GraphQL is based on a simple idea, moving the assembly of a request from the server to the client. The client sees the overall strongly-typed schema instead of multiple REST endpoints and he builds…
By David WORMS
Nov 27, 2018
Hadoop cluster takeover with Apache Ambari
Categories: Big Data, DevOps & SRE, Adaltas Summit 2018 | Tags: Ambari, Automation, iptables, Nikita, Systemd, Cluster, HDP, IaC, Kerberos, Node, Node.js, REST, SCM
We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this…
Nov 15, 2018
Node.js CSV version 4 - re-writing and performance
Categories: Node.js | Tags: CLI, Data Engineering, Refactoring, CSV, Release and features
Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as…
By David WORMS
Nov 19, 2018
Managing User Identities on Big Data Clusters
Categories: Cyber Security, Data Governance | Tags: LDAP, Active Directory, Ansible, FreeIPA, IaC, IAM, Kerberos
Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to…
By David WORMS
Nov 8, 2018
Apache Flink: past, present and future
Categories: Data Engineering | Tags: Consistency, Micro Services, Pipeline, Flink, Batch processing, Kubernetes, Ledger, Machine Learning, Scikit-learn, SQL, Storage, Streaming
Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…
Nov 5, 2018
One week to discuss technology in a Moroccan riad
Categories: Adaltas Summit 2018, Learning | Tags: CDSW, Gatsby, React.js, Flink, Hadoop, Knox, Data Science, Deep Learning, Kubernetes, Node.js
Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the…
By David WORMS
Oct 11, 2018
Nvidia and AI on the edge
Categories: Data Science | Tags: Caffe, GPU, NVIDIA, AI, Deep Learning, Edge computing, Keras, PyTorch, TCO, TensorFlow
In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are invented…
By Yliess HATI
Oct 10, 2018
Deploying a secured Flink cluster on Kubernetes
Categories: Big Data | Tags: Encryption, Flink, HDFS, Kafka, Elasticsearch, Kerberos, SSL/TLS
When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native…
By David WORMS
Oct 8, 2018
KVM machines for Vagrant on Archlinux
Categories: DevOps & SRE | Tags: Arch Linux, KVM, Linux, Virtualization, VM, Vagrant
Vagrant supports different providers to manage virtualization. In a Linux environment, you can dramatically improve VM performance by using the libvirt provider and the KVM hypervisor. This tutorial…
Sep 19, 2018
Lando: Deep Learning used to summarize conversations
Categories: Data Science, Learning | Tags: CockroachDB, FoundationDB, Micro Services, NATS, Open API, React.js, Speech to text, Swagger, Vue.js, Kafka, Deep Learning, GitLab, IaC, Internship, JWT, Kubernetes, Neural Network, Node.js, Python
Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…
By Yliess HATI
Sep 18, 2018
Clusters and workloads migration from Hadoop 2 to Hadoop 3
Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Operation, Rolling Upgrade, SLA, Hadoop, HBase, HDFS, Oozie, Spark, YARN, Docker, TCO
Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…
Jul 25, 2018
Deep learning on YARN: running Tensorflow and friends on Hadoop cluster
Categories: Data Science | Tags: GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost
With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…
Jul 24, 2018
Curing the Kafka blindness with the UI manager
Categories: Big Data | Tags: Ambari, Hortonworks, HDF, JMX, UI, Kafka, Ranger, HDP
Today it’s really difficult for developers, operators and managers to visualize and monitor what happens in a Kafka cluster. This articles covers a new graphical interface to oversee Kafka. It was…
Jun 20, 2018
A CoreOS development cluster with Vagrant and VirtualBox
Categories: Hack, Infrastructure | Tags: Arch Linux, CoreOS, Linux, VirtualBox, Clustering, Consensus, etcd, Vagrant
Following CoreOS’s instructions on how to set up a development environment in VirtualBox did not work out well for me. Here are the steps I followed to get Container Linux up and running with Vagrant…
Jun 20, 2018
Guide to Keybase encrypted directories
Categories: Cyber Security, Hack | Tags: Cryptography, Encryption, File system, Keybase, PGP, Authorization
This is a guide to using Keybase’s encrypted directories to store and share files. Keybase is a group, file and chat application who’s goal is to bring public key crypto based on PGP to everyone in…
Jun 18, 2018
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…
By David WORMS
Jun 18, 2018
Apache Hadoop YARN 3.0 – State of the union
Categories: Big Data, DataWorks Summit 2018 | Tags: GPU, Hortonworks, Hadoop, HDFS, MapReduce, YARN, Cloudera, Data Science, Docker, Release and features
This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…
May 31, 2018
Accelerating query processing with materialized views in Apache Hive
Categories: Business Intelligence, DataWorks Summit 2018 | Tags: Calcite, OLAP, Druid, Hive, Release and features, SQL
The new materialized view feature is coming in Apache Hive 3.0. Jesus Camacho Rodriguez from Hortonworks held a talk ”Accelerating query processing with materialized views in Apache Hive” about it…
May 31, 2018
YARN and GPU Distribution for Machine Learning
Categories: Data Science, DataWorks Summit 2018 | Tags: arXiv, GPU, Grafana, MXNet, YARN, Docker, Machine Learning, Neural Network, Storage, TensorFlow
This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…
By Grégor JOUET
May 30, 2018
TensorFlow on Spark 2.3: The Best of Both Worlds
Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow
The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…
By Yliess HATI
May 29, 2018
Apache Metron in the Real World
Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, NiFi, Solr, Storm, pcap, RDBMS, HDFS, Kafka, Metron, Spark, Data Science, Elasticsearch, SQL
Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation was…
May 29, 2018
Running Enterprise Workloads in the Cloud with Cloudbreak
Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, Operation, Hadoop, AWS, Azure, GCP, HDP, OpenStack
This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…
May 28, 2018
Omid: Scalable and highly available transaction processing for Apache Phoenix
Categories: Big Data, DataWorks Summit 2018 | Tags: Omid, Phoenix, Transaction, ACID, HBase, SQL
Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. Credits to Ohad Shacham for his talk and his work for Apache…
May 24, 2018
Apache Beam: a unified programming model for data processing pipelines
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Java, Pipeline, Flink, Spark, Batch processing, Python, Streaming, TCO
In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…
May 24, 2018
Present and future of Hadoop workflow scheduling: Oozie 5.x
Categories: Big Data, DataWorks Summit 2018 | Tags: Hadoop, Hive, Oozie, Sqoop, CDH, HDP, Python, REST
During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…
May 23, 2018
What's new in Apache Spark 2.3?
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, PySpark, Tuning, ORC, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Python, Streaming
Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…
May 23, 2018
Essential questions about Time Series
Categories: Big Data | Tags: Grafana, Druid, HBase, Hive, ORC, Data Science, Elasticsearch, IOT
Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. We…
By David WORMS
Mar 18, 2018
Execute Python in an Oozie workflow
Categories: Data Engineering | Tags: Oozie, Elasticsearch, Python, REST
Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that. I’ve recently designed a workflow that would interact…
Mar 6, 2018
Publishing guidelines
Categories: DevOps & SRE | Tags: Arch Linux, KVM, VM, GitLab, Vagrant, Markdown
This is as much a set of guidelines targeting everyone publishing content on the web as rules for reviewers to ensure no validation is forgotten before submitting for publication. It mostly targets…
By David WORMS
Feb 28, 2018
Ambari - How to blueprint
Categories: Big Data, DevOps & SRE | Tags: Ambari, Automation, DevOps, Operation, Ranger, CDH, HDP, IaC, PostgreSQL, REST
As infrastructure engineers at Adaltas, we deploy Hadoop clusters. A lot of them. Let’s see how to automate this process with REST requests. While really handy for deploying one or two clusters, the…
Jan 17, 2018