Apache Hadoop
Hadoop is massively scalable platform commonly used to process big data workloads. At its core, it is composed of a distributed file system (HDFS) and a resource manager (YARN).
Hadoop provides a high level of durability and availability while still being able to process computational analytical workloads in parallel. The combination of availability, durability, and scalability of processing makes Hadoop a natural fit for Big Data workloads.
- Learn more
- Official website
Related articles

Storage and massive processing with Hadoop
Categories: Big Data | Tags: Hadoop, HDFS, Storage
Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projectsā¦
By David WORMS
Nov 26, 2010

Hadoop and HBase installation on OSX in pseudo-distributed mode
Categories: Big Data, Learning | Tags: Hue, Infrastructure, Hadoop, HBase, Big Data, Deployment
The operating system chosen is OSX but the procedure is not so different for any Unix environment because most of the software is downloaded from the Internet, uncompressed and set manually. Only aā¦
By David WORMS
Dec 1, 2010

Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: CRM, timeseries, Tuning, Hadoop, HDFS, Hive, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rowsā¦
By David WORMS
Jan 10, 2012

Hadoop and R with RHadoop
Categories: Business Intelligence, Data Science | Tags: Thrift, Learning and tutorial, R, Hadoop, HBase, HDFS, MapReduce, Data Analytics
RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters ofā¦
By David WORMS
Jul 19, 2012

Merging multiple files in Hadoop
Categories: Hack | Tags: File system, Hadoop, HDFS
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the āgetmergeā utility of but contrary to āgetmergeā, the finalā¦
By David WORMS
Jan 12, 2013

Definitions of machine learning algorithms present in Apache Mahout
Categories: Data Science | Tags: Algorithm, Š”lassification, Hadoop, Mahout, Clustering, Machine Learning
Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoopā¦
By David WORMS
Mar 8, 2013

Apache Hive Essentials How-to by Darren Lee
Categories: Business Intelligence, Learning | Tags: UDF, Hadoop, Hive, File Format, SQL
Recently, Iāve been ask to review a new book on Apache Hive called āApache Hive Essentials How-toā (edit: the second edition is now available) written by Darren Lee and published by Packt Publishingā¦
By David WORMS
Apr 23, 2013

The state of Hadoop distributions
Categories: Big Data | Tags: Hortonworks, Intel, Oracle, Hadoop, Cloudera
Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is aā¦
By David WORMS
May 11, 2013

State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flumeā¦
By David WORMS
Jul 8, 2013

Composants for CDH and HDP
Categories: Big Data | Tags: Flume, Hortonworks, Hadoop, Hive, Oozie, Sqoop, Zookeeper, Cloudera, CDH, HDP
I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writtingā¦
By David WORMS
Sep 22, 2013

Red Hat Storage Gluster and its integration with Hadoop
Categories: Big Data | Tags: GlusterFS, Red Hat, Hadoop, HDFS, Storage
I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I willā¦
By David WORMS
Jul 3, 2015

Apache Apex: next gen Big Data analytics
Categories: Data Science, Events, Tech Radar | Tags: Apex, Storm, Tools, Flink, Hadoop, Kafka, Data Science, Machine Learning
Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallelā¦
Jul 17, 2016

Apache Apex with Apache SAMOA
Categories: Data Science, Events, Tech Radar | Tags: Apex, Samoa, Storm, Tools, Flink, Hadoop, Machine Learning
Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Onlineā¦
Jul 17, 2016

Hive, Calcite and Druid
Categories: Big Data | Tags: Business intelligence, Database, Druid, Hadoop, Hive
BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnalā¦
By David WORMS
Jul 14, 2016

MariaDB integration with Hadoop
Categories: Infrastructure | Tags: Database, HA, MariaDB, Hadoop, Hive
During a workshop with one of our customers, Adaltas has identified a potential risk to use MariaDBās High Availability (HA) strategy. Since the customer selected Clouderaās CDH 5 distribution, theā¦
By David WORMS
Jul 31, 2017

Present and future of Hadoop workflow scheduling: Oozie 5.x
Categories: Big Data, DataWorks Summit 2018 | Tags: Hadoop, Hive, Oozie, Sqoop, CDH, HDP, REST
During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features ofā¦
May 23, 2018

Apache Hadoop YARN 3.0 ā State of the union
Categories: Big Data, DataWorks Summit 2018 | Tags: GPU, Hortonworks, Hadoop, HDFS, MapReduce, YARN, Cloudera, Data Science, Docker, Release and features
This article covers the āApache Hadoop YARN: state of the unionā talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the twoā¦
May 31, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster
Categories: Data Science | Tags: GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost
With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveragingā¦
Jul 24, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3
Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Rolling Upgrade, HDFS, Spark, YARN, Docker
Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your currentā¦
Jul 25, 2018

Running Enterprise Workloads in the Cloud with Cloudbreak
Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, Operation, Hadoop, AWS, Azure, GCP, HDP, OpenStack
This article is based on Peter Darvasi and Richard Doktoricsā talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworksā automated deployment toolā¦
May 28, 2018

One week to discuss technology in a Moroccan riad
Categories: Adaltas Summit 2018, Learning | Tags: CDSW, Gatsby, React.js, Flink, Hadoop, Knox, Data Science, Deep Learning, Kubernetes, Node.js
Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix theā¦
By David WORMS
Oct 11, 2018

Monitoring a production Hadoop cluster with Kubernetes
Categories: DevOps & SRE | Tags: Thrift, Shinken, Hadoop, Knox, Cluster, Docker, Elasticsearch, Grafana, Kubernetes, Node, Node.js, Prometheus, Python
Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplestā¦
Dec 21, 2018

Apache Knox made easy!
Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: LDAP, Active Directory, Knox, Ranger, Kerberos, REST
Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in aā¦
Feb 4, 2019

Multihoming on Hadoop
Categories: Infrastructure | Tags: Hadoop, HDFS, Kerberos, Network
Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is anā¦
Mar 5, 2019

Publish Spark SQL DataFrame and RDD with Spark Thrift Server
Categories: Data Engineering | Tags: Thrift, JDBC, Hadoop, Hive, Spark, SQL
The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reportsā¦
Mar 25, 2019

Running Apache Hive 3, new features and tips and tricks
Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: JDBC, LLAP, Druid, Hadoop, Hive, Kafka, Release and features
Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available sinceā¦
Jul 25, 2019

Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema