All our articles

Innovation, project vs product culture in Data Science

Innovation, project vs product culture in Data Science

Categories: Data Science, Data Governance | Tags: DevOps, Agile, Scrum

Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…

By David WORMS

Oct 8, 2019

Machine Learning model deployment

Machine Learning model deployment

Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: AI, Cloud, DevOps, Machine Learning, On-premise, Operation, Schema

“Enterprise Machine Learning requires looking at the big picture … from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…

By Oskar RYNKIEWICZ

Sep 30, 2019

Rook with Ceph doesn't provision my Persistent Volume Claims!

Rook with Ceph doesn't provision my Persistent Volume Claims!

Categories: DevOps & SRE | Tags: Kubernetes, PVC, Linux, Rook, Ubuntu, Ceph

Ceph installation inside Kubernetes can be provisionned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoid…

By Eyal CHOJNOWSKI

Sep 9, 2019

Users and RBAC authorizations in Kubernetes

Users and RBAC authorizations in Kubernetes

Categories: Containers Orchestration, Data Governance | Tags: Authentication, Authorization, Cyber Security, Kubernetes, RBAC, SSL/TLS

Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication and…

By Robert Walid SOARES

Aug 7, 2019

TensorFlow installation on Docker

TensorFlow installation on Docker

Categories: Containers Orchestration, Data Science, Learning | Tags: AI, CPU, Deep Learning, Docker, Jupyter, Linux, TensorFlow

TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array…

By Pierre SAUVAGE

Aug 5, 2019

Running Apache Hive 3, new features and tips and tricks

Running Apache Hive 3, new features and tips and tricks

Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: Druid, Hive, Kafka, Cloudera, Data Warehouse, JDBC, LLAP, Active Directory, Release and features, Hadoop

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…

By Gauthier LEONARD

Jul 25, 2019

Auto-scaling Druid with Kubernetes

Auto-scaling Druid with Kubernetes

Categories: Big Data, Business Intelligence, Containers Orchestration | Tags: EC2, Druid, Cloud, CNCF, Container Orchestration, Data Analytics, Helm, Kubernetes, Metrics, OLAP, Operation, Prometheus, Python

Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk…

By Schoukroun LEO

Jul 16, 2019

Mount Aladdin eToken in Firefox on Archlinux

Mount Aladdin eToken in Firefox on Archlinux

Categories: Hack | Tags: 2FA, Arch Linux, Cyber Security, Firefox, Security, Smart card

Given you’re on Archlinux and have an Aladdin eToken, let’s see how we can mount it in Firefox for web authentication. An Aladdin eToken is a cryptographic device (token, smart card) that stores…

By César BEREZOWSKI

Jul 12, 2019

Spark Streaming part 4: clustering with Spark MLlib

Spark Streaming part 4: clustering with Spark MLlib

Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming

Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…

By Oskar RYNKIEWICZ

Jul 11, 2019

Google Cloud Summit Paris Notes

Google Cloud Summit Paris Notes

Categories: Events | Tags: AWS, Cloud, GCP, Kubernetes, Azure, On-premise

Google organized its yearly Summit edition 2019 in Paris on the 18th of June. This year’s event was the biggest yet in Paris, which reflect Google’s commitment to position itself in the French market…

By Tariq SAHNOUNI

Jun 26, 2019

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Categories: Big Data, Data Engineering, DevOps & SRE | Tags: Spark, Apache Spark Streaming, DevOps, Learning and tutorial

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…

By Oskar RYNKIEWICZ

Jun 19, 2019

Druid and Hive integration

Druid and Hive integration

Categories: Big Data, Business Intelligence, Tech Radar | Tags: Druid, Hive, Data Analytics, Learning and tutorial, LLAP, OLAP, SQL

This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools description…

By Pierre SAUVAGE

Jun 17, 2019

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Big Data, File Format, Data Governance, Python, Streaming, Hadoop

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…

By Oskar RYNKIEWICZ

May 28, 2019

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Categories: Data Engineering, Learning | Tags: Kafka, Spark, Apache Spark Streaming, Big Data, Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…

By Oskar RYNKIEWICZ

Apr 18, 2019

Recover from an EFI failure on a dedicated server

Recover from an EFI failure on a dedicated server

Categories: Hack | Tags: Cloud, Infrastructure, Linux

A few weeks ago, before upgrading our Ubuntu systems, we sort of messed around with our EFI partitions and the impacted servers never came back online on system reboot after the upgrade. Provisionning…

By Grégor JOUET

Apr 16, 2019

First Class Functions in Python

First Class Functions in Python

Categories: Hack, Learning | Tags: Programming, Python

I recently watched a talk by Dave Cheney about first class functions in Go. Python supports first class functions too, so can we use them in the same ways? Absolutely. I have been using Python for a…

By Arthur BUSSER

Apr 15, 2019

Gatsby.js, React and GraphQL for documentation websites

Gatsby.js, React and GraphQL for documentation websites

Categories: Adaltas Summit 2018, Front End | Tags: API, Gatsby, GraphQL, HTTP, JAMstack, JavaScript, Markdown, Node.js, React.js, SEO

In the last few months, I have started to redesign some of our Open Source project websites. This includes the websites of the Node.js CSV project, the Node.js HBase client and the Nikita project, our…

By David WORMS

Apr 1, 2019

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Categories: Data Engineering | Tags: Hive, Spark, Thrift, JDBC, Hadoop, SQL

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…

By Oskar RYNKIEWICZ

Mar 25, 2019

Multihoming on Hadoop

Multihoming on Hadoop

Categories: Infrastructure | Tags: HDFS, Kerberos, Network, Hadoop

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an…

By Joris RUMMENS

Mar 5, 2019

Introduction to Cloudera Data Science Workbench

Introduction to Cloudera Data Science Workbench

Categories: Data Science | Tags: Cloud, Cloudera, Docker, Git, Kubernetes, Machine Learning, Azure, Notebook, Tuning

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…

By Mehdi ELALAMI

Feb 28, 2019

Apache Knox made easy!

Apache Knox made easy!

Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: Ambari, Hive, Knox, Ranger, Shiro, Solr, JDBC, Kerberos, LDAP, Active Directory, REST, SSL/TLS, Hadoop, SSO

Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in a…

By Michael HATOUM

Feb 4, 2019

Installing Kubernetes on CentOS 7

Installing Kubernetes on CentOS 7

Categories: Containers Orchestration | Tags: CentOS, cgroups, CNCF, DevOps, Docker, Infrastructure, Kubernetes, Namespaces, Red Hat, VM, Ceph

This article explains how to install a Kubernetes cluster. I will dive into what each step does so you can build a thorough understanding of what is going on. This article is based on my talk from the…

By Arthur BUSSER

Jan 29, 2019

Self-sovereign identities with verifiable claims

Self-sovereign identities with verifiable claims

Categories: Data Governance | Tags: Authentication, Blockchain, Cloud, Identity, Ledger

Towards a trusted, personal, persistent, and portable digital identity for all. Digital identity issues Self-sovereign identities are an attempt to solve a couple of issues. The first is the…

By Nabil MELLAL

Jan 23, 2019

Applying Deep Reinforcement Learning to Poker

Applying Deep Reinforcement Learning to Poker

Categories: Data Science | Tags: Algorithms, Deep Learning, Gaming, Machine Learning, Python, Q-learning, Neural Network

We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then we’ll apply a version of this algorithm to the game of Poker…

By Oscar BLAZEJEWSKI

Jan 9, 2019

LXD: The Missing Piece

LXD: The Missing Piece

Categories: Containers Orchestration | Tags: CPU, Docker, Kubernetes, Linux, LXD, VM

LXD stands for Linux Container Daemon. Yet another container technology. But LXD is very different. It stands apart from the pack. It is not necessarily better nor much faster nor more secure! But it…

By Tariq SAHNOUNI

Dec 28, 2018

Monitoring a production Hadoop cluster with Kubernetes

Monitoring a production Hadoop cluster with Kubernetes

Categories: DevOps & SRE | Tags: Knox, Thrift, Docker, Elasticsearch, Graphana, Kubernetes, Node.js, Prometheus, Python, Shinken, Hadoop

Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplest…

By Paul-Adrien CORDONNIER

Dec 21, 2018

CodaLab – Data Science competitions

CodaLab – Data Science competitions

Categories: Data Science, Adaltas Summit 2018, Learning | Tags: Database, Infrastructure, Machine Learning, MySQL, Node.js, Python

CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it…

By Robert Walid SOARES

Dec 17, 2018

Native modules for Node.js with N-API

Native modules for Node.js with N-API

Categories: Adaltas Summit 2018, Front End | Tags: C++, JavaScript, Kerberos, Node.js, NPM

How to create native modules for Node.js? How to use N-API, the future of native addons development? Writing C/C++ addon is a useful and powerful feature of the Node.js runtime. Let’s explore them…

By Xavier HERMAND

Dec 12, 2018

Microsoft introduces Cloud Native Application Bundles

Microsoft introduces Cloud Native Application Bundles

Categories: Containers Orchestration | Tags: CLI, Docker, Helm, Kubernetes, Packaging

At DockerCon EU 2018 in Barcelona, Matt Butcher, Principal Engineer at Microsoft and inventor of Helm, introduced CNAB, Cloud Native Application Bundles, a packaging format for distributed…

By Arthur BUSSER

Dec 4, 2018

Jumbo, the Hadoop cluster bootstrapper

Jumbo, the Hadoop cluster bootstrapper

Categories: Infrastructure | Tags: Ansible, Ambari, Automation, HDP, REST, Vagrant

Introducing Jumbo, a Hadoop cluster bootstrapper for developers. Jumbo helps you deploy development environments for Big Data technologies. It takes a few minutes to get a custom virtualized Hadoop…

By Gauthier LEONARD

Nov 29, 2018

Main advantages of GraphQL as an alternative to REST

Main advantages of GraphQL as an alternative to REST

Categories: Front End | Tags: API, GraphQL, GRPC, JSON, Node.js, Registry, REST

GraphQL is based on a simple idea, moving the assembly of a request from the server to the client. The client sees the overall strongly-typed schema instead of multiple REST endpoints and he builds…

By David WORMS

Nov 27, 2018

Node.js CSV version 4 - re-writing and performance

Node.js CSV version 4 - re-writing and performance

Categories: Node.js | Tags: CLI, CSV, Data Engineering, Refactoring, Release and features

Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as…

By David WORMS

Nov 19, 2018

Hadoop cluster takeover with Apache Ambari

Hadoop cluster takeover with Apache Ambari

Categories: Big Data, DevOps & SRE, Adaltas Summit 2018 | Tags: Ambari, Automation, HDP, iptables, Kerberos, Nikita, Node.js, REST, Systemd

We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this…

By Schoukroun LEO

Nov 15, 2018

Managing User Identities on Big Data Clusters

Managing User Identities on Big Data Clusters

Categories: Cyber Security, Data Governance | Tags: Ansible, FreeIPA, Identity, Kerberos, LDAP, Active Directory

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to…

By David WORMS

Nov 8, 2018

Apache Flink: past, present and future

Apache Flink: past, present and future

Categories: Data Engineering | Tags: Flink, Batch processing, Consistency, Kubernetes, Ledger, Machine Learning, Micro Services, Pipeline, Streaming, SQL

Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…

By César BEREZOWSKI

Nov 5, 2018

One week to discuss technology in a Moroccan riad

One week to discuss technology in a Moroccan riad

Categories: Adaltas Summit 2018, Learning | Tags: Flink, Knox, CDSW, Deep Learning, Gatsby, Kubernetes, Node.js, React.js, Hadoop

Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the…

By David WORMS

Oct 11, 2018

Nvidia and AI on the edge

Nvidia and AI on the edge

Categories: Data Science | Tags: AI, Caffe, Deep Learning, Edge computing, GPU, Keras, NVIDIA, PyTorch, TensorFlow

In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are invented…

By Yliess HATI

Oct 10, 2018

Deploying a secured Flink cluster on Kubernetes

Deploying a secured Flink cluster on Kubernetes

Categories: Big Data | Tags: Flink, HDFS, Kafka, Elasticsearch, Encryption, Kerberos, SSL/TLS

When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native…

By David WORMS

Oct 8, 2018

KVM machines for Vagrant on Archlinux

KVM machines for Vagrant on Archlinux

Categories: DevOps & SRE | Tags: Arch Linux, KVM, Linux, Vagrant, Virtualization, VM

Vagrant supports different providers to manage virtualization. In a Linux environment, you can dramatically improve VM performance by using the libvirt provider and the KVM hypervisor. This tutorial…

By Gauthier LEONARD

Sep 19, 2018

Lando: Deep Learning used to summarize conversations

Lando: Deep Learning used to summarize conversations

Categories: Data Science, Learning | Tags: Deep Learning, Kubernetes, Micro Services, Node.js, Open API, Neural Network

Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…

By Yliess HATI

Sep 18, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Categories: Big Data, Infrastructure | Tags: HBase, HDFS, Oozie, Slider, Spark, YARN, Docker, Erasure Coding, Operation, Rolling Upgrade, SLA, Hadoop

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…

By Lucas BAKALIAN

Jul 25, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Categories: Data Science | Tags: Spark, Spark MLlib, YARN, Deep Learning, GPU, PyTorch, TensorFlow, XGBoost, Hadoop

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…

By Louis BIANCHERIN

Jul 24, 2018

Curing the Kafka blindness with the UI manager

Curing the Kafka blindness with the UI manager

Categories: Big Data | Tags: Ambari, Kafka, Ranger, Hortonworks, HDP, HDF, JMX, UI

Today it’s really difficult for developers, operators and managers to visualize and monitor what happens in a Kafka cluster. This articles covers a new graphical interface to oversee Kafka. It was…

By Lucas BAKALIAN

Jun 20, 2018

A CoreOS development cluster with Vagrant and VirtualBox

A CoreOS development cluster with Vagrant and VirtualBox

Categories: Hack, Infrastructure | Tags: Arch Linux, Clustering, CoreOS, Linux, Vagrant, VirtualBox, etcd

Following CoreOS’s instructions on how to set up a development environment in VirtualBox did not work out well for me. Here are the steps I followed to get Container Linux up and running with Vagrant…

By Arthur BUSSER

Jun 20, 2018

Guide to Keybase encrypted directories

Guide to Keybase encrypted directories

Categories: Cyber Security, Hack | Tags: Authorization, Cryptography, Encryption, File system, Keybase, PGP

This is a guide to using Keybase’s encrypted directories to store and share files. Keybase is a group, file and chat application who’s goal is to bring public key crypto based on PGP to everyone in…

By Arthur BUSSER

Jun 18, 2018

Data Lake ingestion best practices

Data Lake ingestion best practices

Categories: Big Data, Data Engineering | Tags: Avro, Hive, NiFi, ORC, Spark, Data Lake, File Format, Data Governance, HDF, Operation, Protocol Buffers, Registry, Schema

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…

By David WORMS

Jun 18, 2018

Apache Hadoop YARN 3.0 – State of the union

Apache Hadoop YARN 3.0 – State of the union

Categories: Big Data, DataWorks Summit 2018 | Tags: HDFS, MapReduce, YARN, Cloudera, Docker, GPU, Hortonworks, Release and features, Hadoop

This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…

By Lucas BAKALIAN

May 31, 2018

Accelerating query processing with materialized views in Apache Hive

Accelerating query processing with materialized views in Apache Hive

Categories: Business Intelligence, DataWorks Summit 2018 | Tags: Calcite, Druid, Hive, OLAP, Release and features, SQL

The new materialized view feature is coming in Apache Hive 3.0. Jesus Camacho Rodriguez from Hortonworks held a talk ”Accelerating query processing with materialized views in Apache Hive” about it…

By Paul-Adrien CORDONNIER

May 31, 2018

YARN and GPU Distribution for Machine Learning

YARN and GPU Distribution for Machine Learning

Categories: Data Science, DataWorks Summit 2018 | Tags: YARN, GPU, Machine Learning, Storage, Neural Network

This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…

By Grégor JOUET

May 30, 2018

TensorFlow on Spark 2.3: The Best of Both Worlds

TensorFlow on Spark 2.3: The Best of Both Worlds

Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, Spark, YARN, C++, CPU, GPU, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow, Tuning

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…

By Yliess HATI

May 29, 2018

Apache Metron in the Real World

Apache Metron in the Real World

Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithms, HDFS, Kafka, NiFi, Solr, Spark, Storm, Elasticsearch, pcap, RDBMS, Metron, SQL

Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation was…

By Michael HATOUM

May 29, 2018

Running Enterprise Workloads in the Cloud with Cloudbreak

Running Enterprise Workloads in the Cloud with Cloudbreak

Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: AWS, Cloudbreak, GCP, HDP, Azure, OpenStack, Operation, Hadoop

This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…

By Joris RUMMENS

May 28, 2018

Apache Beam: a unified programming model for data processing pipelines

Apache Beam: a unified programming model for data processing pipelines

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Flink, Spark, Batch processing, Java, Pipeline, Python, Streaming

In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…

By Gauthier LEONARD

May 24, 2018

Omid: Scalable and highly available transaction processing for Apache Phoenix

Omid: Scalable and highly available transaction processing for Apache Phoenix

Categories: Big Data, DataWorks Summit 2018 | Tags: ACID, HBase, Omid, Phoenix, Transaction, SQL

Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. Credits to Ohad Shacham for his talk and his work for Apache…

By Xavier HERMAND

May 24, 2018

Present and future of Hadoop workflow scheduling: Oozie 5.x

Present and future of Hadoop workflow scheduling: Oozie 5.x

Categories: Big Data, DataWorks Summit 2018 | Tags: Hive, Oozie, Sqoop, CDH, HDP, REST, Hadoop

During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…

By Schoukroun LEO

May 23, 2018

What's new in Apache Spark 2.3?

What's new in Apache Spark 2.3?

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, ORC, Spark, Spark MLlib, PySpark, Docker, Kubernetes, Streaming, Tuning, pandas

Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…

By César BEREZOWSKI

May 23, 2018

Essential questions about Time Series

Essential questions about Time Series

Categories: Big Data | Tags: Druid, HBase, Hive, ORC, Elasticsearch, Graphana, IOT

Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. We…

By David WORMS

Mar 19, 2018

Execute Python in an Oozie workflow

Execute Python in an Oozie workflow

Categories: Data Engineering | Tags: Oozie, Elasticsearch, Python, REST

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that. I’ve recently designed a workflow that would interact…

By César BEREZOWSKI

Mar 6, 2018

Publishing guidelines

Publishing guidelines

Categories: DevOps & SRE | Tags: Arch Linux, KVM, Markdown, Vagrant, VM

This is as much a set of guidelines targeting everyone publishing content on the web as rules for reviewers to ensure no validation is forgotten before submitting for publication. It mostly targets…

By David WORMS

Feb 26, 2018

Ambari - How to blueprint

Ambari - How to blueprint

Categories: Big Data, DevOps & SRE | Tags: Ambari, Ranger, Automation, CDH, DevOps, HDP, Operation, REST

As infrastructure engineers at Adaltas, we deploy Hadoop clusters. A lot of them. Let’s see how to automate this process with REST requests. While really handy for deploying one or two clusters, the…

By Joris RUMMENS

Jan 17, 2018

Notes after Katacoda Training on Kubernetes Container Orchestration

Notes after Katacoda Training on Kubernetes Container Orchestration

Categories: Containers Orchestration, Learning | Tags: Helm, Ingress, Kubeadm, Kubernetes, CNI, Micro Services, Minikube, SSL/TLS, YAML

A few weeks ago, I dedicated two days to follow the turorials available on Katacoda, the interactive learning platform for Kubernetes or any other container orchestration platform. I’m sharing my…

By David WORMS

Dec 14, 2017

Scaling massive, real-time data pipelines with Go

Scaling massive, real-time data pipelines with Go

Categories: Open Source Summit Europe 2017, Learning | Tags: Algorithms, Data structures, Go, Network, Pipeline, Protocols

Last week at the Open Source Summit in Prague, Jean de Klerk held a talk called Scaling massive, real-time data pipelines with Go. This article goes over the main points of the talk, detailing the…

By Arthur BUSSER

Nov 21, 2017

Mesos Introduction

Mesos Introduction

Categories: Containers Orchestration, Open Source Summit Europe 2017 | Tags: Mesos, Container, Container Orchestration, CUDA, Docker, GPU

Apache Mesos is an open source cluster management project designed to implement and optimize distributed systems. Mesos enables the management and sharing of resources in a fine and dynamic way…

By Louis BIANCHERIN

Nov 15, 2017

Micro Services

Micro Services

Categories: Cloud Computing, Containers Orchestration, Open Source Summit Europe 2017 | Tags: Mesos, CNCF, DNS, Encryption, GRPC, Istio, Kubernetes, Linkerd, Micro Services, MITM, Proxy, Service Mesh, SSL/TLS, SPOF

Back in the days, applications were monolithic and we could use an IP address to access a service. With virtual machines (VM), multiple hosts started to appear on the same machine with multiple apps…

By David WORMS

Nov 14, 2017

Lightweight containerization with Tupperware

Lightweight containerization with Tupperware

Categories: Containers Orchestration, Open Source Summit Europe 2017, Infrastructure | Tags: Zookeeper, Btrfs, Cloud, LXD, Red Hat, Systemd

In this article, I will present lightweight containerization set up by Facebook called Tupperware. What is Tupperware Tupperware is a homemade framework written and used internally at Facebook…

By Lucas BAKALIAN

Nov 3, 2017

Kubernetes Storage Primitives for Stateful Workloads

Kubernetes Storage Primitives for Stateful Workloads

Categories: Cloud Computing, Containers Orchestration, Open Source Summit Europe 2017 | Tags: Docker, Kubernetes, Container Storage Interface (CSI), PVC, Azure, Storage, GCE

This article is based on the presentation “Introduction to Kubernetes Storage Primitives for Stateful Workloads”from the OSS Convention Prague 2017 by the {Code} team. So, let’s start, what is…

By Pierre SAUVAGE

Oct 28, 2017

Apache Thrift vs REST

Apache Thrift vs REST

Categories: DevOps & SRE, Open Source Summit Europe 2017 | Tags: Thrift, GRPC, HTTP, JSON, REST

Adaltas recently attended the Open Source Summit Europe 2017 in Prague. I had the opportunity to follow a presentation made by Randy Abernethy and Jens Geyer of RM-X, a cloud native consulting company…

By Schoukroun LEO

Oct 28, 2017

Nobody* puts Java in a Container

Nobody* puts Java in a Container

Categories: Containers Orchestration, Open Source Summit Europe 2017, Infrastructure | Tags: cgroups, Docker, Java, JRE, JVM, Namespaces

This talk was about the issues of putting Java in a container and how, in its latest version, the JDK is now more aware of the container it is running in. The presentation is led by @joerg_schad…

By Paul-Adrien CORDONNIER

Oct 28, 2017

From Dockerfile to Ansible Containers

From Dockerfile to Ansible Containers

Categories: Containers Orchestration, DevOps & SRE, Open Source Summit Europe 2017 | Tags: Ansible, Docker, Docker Compose, pip, Shell, YAML

This talk was an introduction to the Dockerfile format and to Ansible container’s tool and then a comparison of both. It was hold by Tomas Tomecek from Red Hat’s containerization team. The Dockerfile…

By César BEREZOWSKI

Oct 25, 2017

Multi-Repo, Multi-Node Gating at Massive Scale

Multi-Repo, Multi-Node Gating at Massive Scale

Categories: Cloud Computing, DevOps & SRE, Open Source Summit Europe 2017 | Tags: Ansible, CI/CD, Infrastructure, Jenkins, OpenStack, Red Hat, Zuul

This is a recap and personal review of Monty Taylor’s presentation of OpenStack’s Continuous Integration tool Zuul at the OpenSource Summit 2017 in Prague (not to mix with Netflix’ Zuul project…

By Joris RUMMENS

Oct 24, 2017

Kubernetes 1.8

Kubernetes 1.8

Categories: Containers Orchestration, Open Source Summit Europe 2017 | Tags: containerd, Kubernetes, CRD, Network, OCI, RBAC, Release and features

The 1.8 release of Kubernetes brings a lot of new things. With 2500+ pull request, 2000+ commits, 400+ commiters, Kubernetes added 39 new features in this version. This is the richest release in terms…

By Younes YASSINE

Oct 24, 2017

Yahoo's Vespa Engine

Yahoo's Vespa Engine

Categories: Tech Radar | Tags: Database, Elasticsearch, Search Engine, Tools

Vespa is Yahoo’s fully autonomous and self-sufficient big data processing and serving engine. It aims at serving results of queries on huge amounts of data in real time. An example of this would be…

By Arthur BUSSER

Oct 16, 2017

Cloudera Sessions Paris 2017

Cloudera Sessions Paris 2017

Categories: Big Data, Events | Tags: Altus, EC2, Cloudera, CDH, CDSW, SDX, Azure, PaaS

Adaltas was at the Cloudera Sessions on October 5, where Cloudera showcased their new products and offerings. Below you’ll find a summary of what we witnessed. Note: the information were aggregated in…

By César BEREZOWSKI

Oct 16, 2017

MariaDB integration with Hadoop

MariaDB integration with Hadoop

Categories: Infrastructure | Tags: Hive, Database, HA, MariaDB, Hadoop

During a workshop with one of our customers, Adaltas has identified a potential risk to use MariaDB’s High Availability (HA) strategy. Since the customer selected Cloudera’s CDH 5 distribution, the…

By David WORMS

Jul 31, 2017

Oracle DB synchrnozation to Hadoop with CDC

Oracle DB synchrnozation to Hadoop with CDC

Categories: Data Engineering | Tags: Hive, Sqoop, CDC, Data Warehouse, GoldenGate, Oracle

This note is the result of a discussion about the synchronization of data written in a database to a warehouse stored in Hadoop. Thanks to Claude Daub from GFI who wrote it and who authorizes us to…

By David WORMS

Jul 31, 2017

Managing authorizations with Apache Sentry

Managing authorizations with Apache Sentry

Categories: Data Governance | Tags: Ansible, CDH, Hue, Database, Deployment, LDAP, Nikita, Sentry

Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. With this article, we will show you how we are using Apache Sentry at…

By Axel JACQIN

Jul 24, 2017

Exposing Kafka on two different networks

Exposing Kafka on two different networks

Categories: Infrastructure | Tags: Kafka, Cloudera, CDH, Cyber Security, Network, VLAN

A Big Data setup usually requires you to have multiple networking interface, let’s see how to set up Kafka on more than one of them. Kafka is a open-source stream processing software platform system…

By César BEREZOWSKI

Jul 22, 2017

Change Ambari's topbar color

Change Ambari's topbar color

Categories: Big Data, Hack | Tags: Ambari, Front-end

We recently had a client that has multiple environments (Production, Integration, Testing, …) running on HDP and managed using one Ambari instance per cluster. One of the questions that came up was…

By César BEREZOWSKI

Jul 9, 2017

MiNiFi: Data at Scales & the Values of Starting Small

MiNiFi: Data at Scales & the Values of Starting Small

Categories: Big Data, DevOps & SRE, Infrastructure | Tags: MiNiFi, NiFi, Cloudera, C++, HDP, HDF, IOT

This conference presented rapidly Apache NiFi and explained where MiNiFi came from: basically it’s a NiFi minimal agent to deploy on small devices to bring data to a cluster’s NiFi pipeline (ex: IoT…

By César BEREZOWSKI

Jul 8, 2017

Advanced multi-tenant Hadoop and Zookeeper protection

Advanced multi-tenant Hadoop and Zookeeper protection

Categories: Big Data, Infrastructure | Tags: Zookeeper, Clustering, DoS, iptables, Operation, Scalability

Zookeeper is a critical component to Hadoop’s high availability operation. The latter protects itself by limiting the number of maximum connections (maxConns = 400). However Zookeeper does not protect…

By Pierre SAUVAGE

Jul 5, 2017

HDP cluster monitoring

HDP cluster monitoring

Categories: Big Data, DevOps & SRE, Infrastructure | Tags: Alert, Ambari, HDP, Metrics, Monitoring, REST

With the current growth of BigData technologies, more and more companies are building their own clusters in hope to get some value of their data. One main concern while building these infrastructures…

By Joris RUMMENS

Jul 5, 2017

Hive Metastore HA with DBTokenStore: Failed to initialize master key

Hive Metastore HA with DBTokenStore: Failed to initialize master key

Categories: Big Data, DevOps & SRE | Tags: Hive, Bug, Infrastructure

This article describes my little adventure around a startup error with the Hive Metastore. It shall be reproducable with any secure installation, meaning with Kerberos, with high availability enabled…

By David WORMS

Jul 21, 2016

EclairJS - Putting a Spark in Web Apps

EclairJS - Putting a Spark in Web Apps

Categories: Data Engineering, Front End | Tags: Spark, JavaScript, Jupyter

Presentation by David Fallside from IBM, images extracted from the presentation. Introduction Web Apps development has moved from Java to NodeJS and Javascript. It provides a simple and rich…

By David WORMS

Jul 17, 2016

Apache Apex with Apache SAMOA

Apache Apex with Apache SAMOA

Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Samoa, Storm, Machine Learning, Tools, Hadoop

Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Online…

By Pierre SAUVAGE

Jul 17, 2016

Apache Apex : next gen Big Data analytics

Apache Apex : next gen Big Data analytics

Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Kafka, Storm, Data Science, Machine Learning, Tools, Hadoop

Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel…

By César BEREZOWSKI

Jul 17, 2016

Get in control of your workflows with Apache Airflow

Get in control of your workflows with Apache Airflow

Categories: Big Data, Tech Radar | Tags: Airflow, Cloud, DevOps, Python

Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder. Introduction Use case: how to handle data coming in regularly from customers…

By César BEREZOWSKI

Jul 17, 2016

Hive, Calcite and Druid

Hive, Calcite and Druid

Categories: Big Data | Tags: Analytics, Druid, Hive, Database, Hadoop

BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal…

By David WORMS

Jul 14, 2016

Network Namespace without Docker

Network Namespace without Docker

Categories: Hack | Tags: DNS, Docker, Linux, Namespaces, Network, VLAN

Let’s imagine the following use case: I am connected to several networks (wlan0, eth0, usb0). I want to choose which network I’m gonna use when I launch apps. My app doesn’t allow me to choose a…

By Pierre SAUVAGE

Jul 6, 2016

Red Hat Storage Gluster and its integration with Hadoop

Red Hat Storage Gluster and its integration with Hadoop

Categories: Big Data | Tags: HDFS, GlusterFS, Red Hat, Storage, Hadoop

I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will…

By David WORMS

Jul 3, 2015

A simple connect middleware to transpile CoffeeScript files

A simple connect middleware to transpile CoffeeScript files

Categories: Hack, Node.js | Tags: CoffeeScript, Node.js, Tools

This new module called connect-coffee-script is a Connect middleware used to serve JavaScript files written in CoffeeScript. This middleware is to be used by connect or any Connect compatible…

By David WORMS

Jul 4, 2014

Tutorial for creating and publishing a new Node.js module

Tutorial for creating and publishing a new Node.js module

Categories: Front End | Tags: CoffeeScript, GitHub, JavaScript, Learning and tutorial, License, Mocha, Node.js, NPM, Travis CI, Unit tests

In this tutorial, I provide complete instructions for creating a new Node.js module, writing the code in coffee-script, publishing it on GitHub, sharing it with other Node.js fellows through NPM…

By David WORMS

Dec 3, 2013

Crawl you website including login form with Phantomjs

Crawl you website including login form with Phantomjs

Categories: Front End | Tags: CoffeeScript, JavaScript, Mocha, Node.js, Unit tests

With PhantomJS, we start a headless WebKit and pilot it with our own scripts. Said differently, we write a script in JavaScript or CoffeeScript which controls an Internet browser and manipulates the…

By David WORMS

Nov 27, 2013

Catch 'uncaughtException' error in your mocha test

Catch 'uncaughtException' error in your mocha test

Categories: Node.js | Tags: DevOps, JavaScript, Mocha, Unit tests

This isn’t the first time I faced this situation. Today, I finally found the time and energy to look for a solution. In your mocha test, let’s say you need to test an expected “uncaughtException…

By David WORMS

Oct 27, 2013

Remote connection with SSH

Remote connection with SSH

Categories: Cyber Security | Tags: Automation, HTTP, SSH

While teaching Big Data and Hadoop, a student asks me about SSH and how to use. I’ll discuss about the protocol and the tools to benefit from it. Lately, I automate the deployment of Hadoop clusters…

By David WORMS

Oct 2, 2013

Composants for CDH and HDP

Composants for CDH and HDP

Categories: Big Data | Tags: Flume, Hive, Oozie, Sqoop, Zookeeper, Cloudera, CDH, Hortonworks, HDP, Hadoop

I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting…

By David WORMS

Sep 22, 2013

Splitting HDFS files into multiple hive tables

Splitting HDFS files into multiple hive tables

Categories: Data Engineering | Tags: Flume, HDFS, Hive, Oozie, Pig, SQL

I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our…

By David WORMS

Sep 15, 2013

About the new BSD license and its difference with other BSD licenses

About the new BSD license and its difference with other BSD licenses

Categories: Data Governance | Tags: License, Open source

As a non restrictive Open Source license, the “new BSD license” is a commonly used license accross the Node.js community. However, this is only one of the BSD license available along the original “BSD…

By David WORMS

Aug 8, 2013

Kerberos and delegation tokens security with WebHDFS

Kerberos and delegation tokens security with WebHDFS

Categories: Cyber Security | Tags: HDFS, Big Data, HTTP, Kerberos

WebHDFS is an HTTP Rest server bundle with the latest version of Hadoop. What interests me on this article is to dig into security with the Kerberos and delegation tokens functionalities. I will cover…

By David WORMS

Jul 25, 2013

Testing the Oracle SQL Connector for Hadoop HDFS

Testing the Oracle SQL Connector for Hadoop HDFS

Categories: Data Engineering | Tags: HDFS, CDH, Database, File system, Oracle, SQL

Using Oracle SQL Connector for HDFS, you can use Oracle Database to access and analyze data residing in HDFS files or a Hive table. You can also query and join data in HDFS or a Hive table with other…

By David WORMS

Jul 15, 2013

Maven 3 behind a proxy

Maven 3 behind a proxy

Categories: Hack | Tags: Maven, Java, Proxy

Maven 3 isn’t so different to it’s previous version 2. You will migrate most of your project quite easily between the two versions. That wasn’t the case a fews years ago between versions 1 and…

By David WORMS

Jul 11, 2013

Node CSV version 0.2.7

Node CSV version 0.2.7

Categories: Hack | Tags: CoffeeScript, CSV, Node.js, Pipeline

While I’m release version 0.2.7 of the CSV parser for Node.js, I stop here to drop a few lines of what has made into this release. We are now using the latest CoffeeScript, which is version 1.4.…

By David WORMS

Jul 9, 2013

State of the Hadoop open-source ecosystem in early 2013

State of the Hadoop open-source ecosystem in early 2013

Categories: Big Data | Tags: Flume, Kafka, Mahout, Mesos, Phoenix, Pig, File Format, Hadoop

Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…

By David WORMS

Jul 8, 2013

Oracle and Hive, how data are published?

Oracle and Hive, how data are published?

Categories: Big Data | Tags: Hive, Sqoop, Data Lake, Oracle

In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with…

By David WORMS

Jul 6, 2013

Oracle to Apache Hive with the Oracle SQL Connector

Oracle to Apache Hive with the Oracle SQL Connector

Categories: Business Intelligence | Tags: HDFS, Hive, Network, Oracle

In a previous article published last week, I introduced the choices available to connect Oracle and Hadoop. In a follow up article, I covered the Oracle SQL Connector, its installation and integration…

By David WORMS

May 27, 2013

Options to connect and integrate Hadoop with Oracle

Options to connect and integrate Hadoop with Oracle

Categories: Data Engineering | Tags: Avro, HDFS, Hive, MapReduce, Sqoop, Database, Java, NoSQL, Oracle, R, RDBMS, SQL

I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…

By David WORMS

May 15, 2013

The state of Hadoop distributions

The state of Hadoop distributions

Categories: Big Data | Tags: Cloudera, Hortonworks, Intel, Oracle, Hadoop

Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a…

By David WORMS

May 11, 2013

Apache Hive Essentials How-to by Darren Lee

Apache Hive Essentials How-to by Darren Lee

Categories: Business Intelligence, Learning | Tags: Hive, File Format, UDF, Hadoop, SQL

Recently, I’ve been ask to review a new book on Apache Hive called “Apache Hive Essentials How-to” written by Darren Lee and published by Packt Publishing. To say it short, I sincerely recommend it. I…

By David WORMS

Apr 23, 2013

Hadoop development cluster of virtual machines with static IP using VirtualBox

Hadoop development cluster of virtual machines with static IP using VirtualBox

Categories: Infrastructure | Tags: Ambari, Cloudera, Hortonworks, Network, Red Hat, VirtualBox, VM, VMware

A few days ago, I explained how to set up a cluster of virtual machine with static IPsand Internet access suitable to host your Hadoop cluster locally for development. At the time I made use of VMWare…

By David WORMS

Mar 14, 2013

Definitions of machine learning algorithms present in Apache Mahout

Definitions of machine learning algorithms present in Apache Mahout

Categories: Data Science | Tags: Algorithms, Mahout, Сlassification, Clustering, Machine Learning, Hadoop

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…

By David WORMS

Mar 8, 2013

Virtual machines with static IP for your Hadoop development cluster

Virtual machines with static IP for your Hadoop development cluster

Categories: Infrastructure | Tags: Ambari, Cloudera, Hortonworks, Network, Red Hat, VirtualBox, VM, VMware

While I am about to install and test Ambari, this article is the occasion to illustrate how I set up my development environment with multiple virtual machines. Ambari, the deployment and monitoring…

By David WORMS

Feb 27, 2013

Merging multiple files in Hadoop

Merging multiple files in Hadoop

Categories: Hack | Tags: HDFS, File system, Hadoop

This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of but contrary to “getmerge”, the final…

By David WORMS

Jan 12, 2013

E-commerce electronic cigarettes: first impressions with Prestashop

E-commerce electronic cigarettes: first impressions with Prestashop

Categories: Tech Radar | Tags: HTML, Java, Node.js

Last year, I had to select and integrate an e-commerce software for the website CigarHit selling electronic cigarettes. Considering that the last e-commerce integration I made dated from 2005, I took…

By David WORMS

Jul 25, 2012

Node CSV version 0.2.1

Node CSV version 0.2.1

Categories: Node.js | Tags: CoffeeScript, CSV, Release and features, Streaming

After the announcement of the version 0.2.0 of the Node.js CSV parser at the begining of october, we are releasing today a new version 0.2.1. This is mostly a bug fix release with enhanced…

By David WORMS

Jul 24, 2012

Node CSV version 0.1 and future developments

Node CSV version 0.1 and future developments

Categories: Node.js | Tags: CoffeeScript, CSV, Markdown, Release and features, Streaming

The Node CSV parser has just reach version 0.1 which close the 0.0.x releases. Started almost 2 years ago, the project has received a tremendous amount of participation in the form of bug reports…

By David WORMS

Jul 21, 2012

Convert .flac music files to .mp3 on osx

Convert .flac music files to .mp3 on osx

Categories: Hack | Tags: File Format, OS X

As an osx user for years now, one should know by then that iTunes doesn’t support the flac format. We are now in 2012, I’ve been waiting for this to happen since years know. Loosing patience, dark…

By David WORMS

Jul 20, 2012

Hadoop and R with RHadoop

Hadoop and R with RHadoop

Categories: Business Intelligence, Data Science | Tags: HBase, HDFS, MapReduce, Thrift, Data Analytics, Learning and tutorial, R, Hadoop

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…

By David WORMS

Jul 19, 2012

Asynchronous array iteration in Node.js with Each

Asynchronous array iteration in Node.js with Each

Categories: Node.js | Tags: Asynchronous, CoffeeScript, JavaScript, Release and features

Control flow in Node.js is the sort of library for which almost all the developers have created and publish their own libraries. They usually aim at reducing spaghetti codes made of deep callbacks. I…

By David WORMS

Jul 18, 2012

Installing and using MADlib with PostgreSQL on OSX

Installing and using MADlib with PostgreSQL on OSX

Categories: Data Science | Tags: Database, Greenplum, PostgreSQL, Statistics, SQL

We cover basic installation and usage of PostgreSQL and MADlib on OSX and Ubuntu. Instructions for other environments should be similar. PostgreSQL is an Open Source database with enterprise…

By David WORMS

Jul 7, 2012

Node CSV version 0.2 with streaming API

Node CSV version 0.2 with streaming API

Categories: Node.js | Tags: CSV, Data Engineering, Markdown, Node.js, Streaming

The Node CSV parser in its version 0.2 has just been released. This version is a major enhancement as it aligned the parser with the best Node.js practice in respect of streams. The CSV parser behave…

By David WORMS

Jul 2, 2012

HDFS and Hive storage - comparing file formats and compression methods

HDFS and Hive storage - comparing file formats and compression methods

Categories: Big Data | Tags: Analytics, HBase, HDFS, Hive, ORC, Parquet, File Format

A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The…

By David WORMS

Mar 13, 2012

Two Hive UDAF to convert an aggregation to a map

Two Hive UDAF to convert an aggregation to a map

Categories: Data Engineering | Tags: Analytics, HDFS, Hive, ORC, Parquet

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The…

By David WORMS

Mar 6, 2012

Java versus JS fun, a quote from the Node.js mailing list

Java versus JS fun, a quote from the Node.js mailing list

Categories: Node.js | Tags: Java, JavaScript, Node.js

I just read that one on the mailing list. I found it relevant enough to share it with those who did not subscribe to it: First Lothar Pfeiler: I still wonder, if it’s cool to have such a big…

By David WORMS

Feb 23, 2012

A fresh look at testing Node.js projects: Mocha, Should and Travis

A fresh look at testing Node.js projects: Mocha, Should and Travis

Categories: DevOps & SRE, Node.js | Tags: CI/CD, DevOps, JavaScript, Mocha, Node.js, Unit tests

Today, I finally decided to spend some time around Travis. It’s been a few weeks since that little green image on top of many GitHub homepages has been buzzing me. Well, to be totally honest, this isn…

By David WORMS

Feb 19, 2012

Coffee script, how do I debug that damn js line?

Coffee script, how do I debug that damn js line?

Categories: Hack, Node.js | Tags: CoffeeScript, Debug, JavaScript, Node.js

Update April 12th, 2012: Pull request adding error reporting to CoffeeScript with line mapping Chances are that, if you code in CoffeeScript, you often find yourself facing a JavaScript exception…

By David WORMS

Feb 15, 2012

Announcing Mecano, a set of functions for system deployment

Announcing Mecano, a set of functions for system deployment

Categories: DevOps & SRE, Node.js | Tags: Automation, CoffeeScript, DevOps, Infrastructure, JavaScript, Node.js, Open source

Update July 2016, Mecano is now renamed Nikita. We are releasing Node Mecano on GitHub which gather common functions used while deploying systems. The idea was to group those functions into a…

By David WORMS

Feb 12, 2012

OS module on steroids with the SIGAR Node binding

OS module on steroids with the SIGAR Node binding

Categories: Node.js | Tags: C++, CPU, File system, Metrics, Monitoring, Network

Today we are announcing the first release of the Node binding to the SIGAR library. Visit the project website or the source code repository on GitHub. SIGAR is a cross platform interface for gathering…

By David WORMS

Jan 11, 2012

Timeseries storage in Hadoop and Hive

Timeseries storage in Hadoop and Hive

Categories: Data Engineering | Tags: HDFS, Hive, CRM, File Format, timeseries, Tuning, Hadoop

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…

By David WORMS

Jan 10, 2012

How Node CSV parser may save your weekend

How Node CSV parser may save your weekend

Categories: Hack | Tags: Bash, CSV, Hack, Node.js

Last Friday, an hour before the doors of my customer close for the weekend, a co-worker came to me. He just finished to export 9 CSV files from an Oracle database which he wanted to import into…

By David WORMS

Dec 13, 2011

Node.js is now integrated to the Microsoft Azure platform

Node.js is now integrated to the Microsoft Azure platform

Categories: Cloud Computing, Tech Radar | Tags: Cloud, Linux, Azure, Node.js

Node is now a first class citizen in the Microsoft Azure cloud environment alongside .Net, Java and PHP. This integration is the logical consequence of Microsoft’s involvement in the development of…

By David WORMS

Dec 11, 2011

Hadoop and HBase installation on OSX in pseudo-distributed mode

Hadoop and HBase installation on OSX in pseudo-distributed mode

Categories: Big Data, Learning | Tags: HBase, Big Data, Hue, Deployment, Infrastructure, Hadoop

The operating system chosen is OSX but the procedure is not so different for any Unix environment because most of the software is downloaded from the Internet, uncompressed and set manually. Only a…

By David WORMS

Dec 1, 2010

Storage and massive processing with Hadoop

Storage and massive processing with Hadoop

Categories: Big Data | Tags: HDFS, Nutch, Cloudera, Google, Hadoop

Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects…

By David WORMS

Nov 26, 2010

Node HBase, a NodeJs client for Apache HBase

Node HBase, a NodeJs client for Apache HBase

Categories: Big Data, Node.js | Tags: HBase, Big Data, Node.js, REST

HBase is a “column familly” database from the Hadoop ecosystem built on the model of Google BigTable. HBase can accommodate very large volumes of data (tera or peta) while maintaining high…

By David WORMS

Nov 1, 2010

MapReduce introduction

MapReduce introduction

Categories: Big Data | Tags: MapReduce, Big Data, Java, JavaScript

Information systems have more and more data to store and process. Companies like Google, Facebook, Twitter and many others store astronomical amounts of information from their customers and must be…

By David WORMS

Jun 26, 2010

Node.js, JavaScript on the server side

Node.js, JavaScript on the server side

Categories: Front End, Node.js | Tags: HTTP, JavaScript, Node.js, Server

Waiting for the Next Big Language (NBL for Next Big Language), this is now 3 years or more since I predict to my customers a bright future for JavaScript as a programming language for server…

By David WORMS

Jun 12, 2010

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.