Jumbo - Presentation and future of the Hadoop cluster bootstrapper

Introduction

As experienced Data Engineers, you probably have deployed tens of Hadoop clusters on your computer or in the cloud, and you know how time-consuming it can be to manually change whatever scripts you are using for provisioning. Jumbo was made to bootstrap those scripts in minutes based on your needs.

Speaker: Gauthier Leonard
Duration: 1h15
Format: talk

Presentation

Jumbo is an Open Source project hosted on GitHub that was developed at Adaltas by two interns who had to gain experience with the Hadoop ecosystem. It is a CLI tool written in Python. It offers an abstraction layer that allows any user, experienced or not with Big Data technologies, to describe a cluster that has to be provisioned. It then generates scripts and leverages trusted DevOps tools to provision the cluster.

In its latest version, Jumbo is able to create and provision virtual clusters with the HDP (Hortonworks Data Platform) stack and to Kerberise them, using Vagrant (with VirtualBox or KVM), Ansible and Ambari. Future versions will allow deploying other Hadoop stacks (e.g. CDH - Cloudera Distribution for Hadoop), and other Big Data technologies (e.g. Elasticsearch).

In the talk, we will go throw the concepts that Jumbo uses to generate deployment scripts and how it leverages DevOps tools under the hood. We will also take a look at what's to come for Jumbo and how you can get involved. The talk will be followed by a demo/tutorial of Jumbo.

I invite you to bring your laptop so that you can see the magic in action. To be able to follow the demo, Vagrant, VirtualBox or KVM, and Python 3 have to be installed on your computer!

Author

I am Gauthier Leonard, a Data Engineer working at Adaltas since September 2018. I was an intern in the very same company, where I developed Jumbo with my colleague Xavier Hermand.

I am currently in a mission for Stago, a leader in blood analysis equipment production, as the Big Data referent in a starting Data Lake project. The project involves the two Big Data stacks of Hortonworks HDP (Data Platform) and HDF (DataFlow).

I like designing coherent and optimized Big Data architectures, although I still have a lot to learn in that field. I am also a grammar Nazi when it comes to coding.