DataWorks Summit 2018: A few days speaking Hadoop

The Adaltas crew went to the DataWorks Summit 2018 held in Berlin on the 18th and 19th of April 2018. On this occasion, we compiled a series of articles about the conferences that have marked us most.

Among the various topics covered through those 2 days, the new version 3 of Hadoop and its subprojects, HDFS and YARN, was probably the hotest. Also, the various talks related to the governance of the platform and the Data Lakes illustrate the maturity achieved by the ecosystem.

Adaltas - DataWorks Summit 2018: A few days speaking Hadoop

Articles related to the summit

Apache Hadoop YARN 3.0 – State of the union

Apache Hadoop YARN 3.0 – State of the union

Categories: Big Data, DataWorks Summit 2018 | Tags: HDFS, MapReduce, YARN, Cloudera, Docker, GPU, Hortonworks, Release and features, Hadoop

This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…

By Lucas BAKALIAN

May 31, 2018

Accelerating query processing with materialized views in Apache Hive

Accelerating query processing with materialized views in Apache Hive

Categories: Business Intelligence, DataWorks Summit 2018 | Tags: Calcite, Druid, Hive, OLAP, Release and features, SQL

The new materialized view feature is coming in Apache Hive 3.0. Jesus Camacho Rodriguez from Hortonworks held a talk ”Accelerating query processing with materialized views in Apache Hive” about it…

By Paul-Adrien CORDONNIER

May 31, 2018

YARN and GPU Distribution for Machine Learning

YARN and GPU Distribution for Machine Learning

Categories: Data Science, DataWorks Summit 2018 | Tags: YARN, arXiv, Docker, GPU, Graphana, Machine Learning, TensorFlow, Neural Network, Storage

This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…

By Grégor JOUET

May 30, 2018

TensorFlow on Spark 2.3: The Best of Both Worlds

TensorFlow on Spark 2.3: The Best of Both Worlds

Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, Spark, YARN, C++, CPU, GPU, JavaScript, Keras, Kubernetes, Machine Learning, TensorFlow, Tuning, Python

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…

By Yliess HATI

May 29, 2018

Apache Metron in the Real World

Apache Metron in the Real World

Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, HDFS, Kafka, NiFi, Solr, Spark, Storm, Elasticsearch, pcap, RDBMS, Metron, SQL

Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation was…

By Michael HATOUM

May 29, 2018

Running Enterprise Workloads in the Cloud with Cloudbreak

Running Enterprise Workloads in the Cloud with Cloudbreak

Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: AWS, Cloudbreak, GCP, HDP, Azure, OpenStack, Operation, Hadoop

This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…

By Joris RUMMENS

May 28, 2018

Omid: Scalable and highly available transaction processing for Apache Phoenix

Omid: Scalable and highly available transaction processing for Apache Phoenix

Categories: Big Data, DataWorks Summit 2018 | Tags: ACID, Omid, Phoenix, Transaction, HBase, SQL

Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. Credits to Ohad Shacham for his talk and his work for Apache…

By Xavier HERMAND

May 24, 2018

Apache Beam: a unified programming model for data processing pipelines

Apache Beam: a unified programming model for data processing pipelines

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Flink, Spark, Batch processing, Java, Pipeline, Streaming, Python, TCO

In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…

By Gauthier LEONARD

May 24, 2018

Present and future of Hadoop workflow scheduling: Oozie 5.x

Present and future of Hadoop workflow scheduling: Oozie 5.x

Categories: Big Data, DataWorks Summit 2018 | Tags: Hive, Oozie, Sqoop, CDH, HDP, REST, Hadoop, Python

During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…

By Leo SCHOUKROUN

May 23, 2018

What's new in Apache Spark 2.3?

What's new in Apache Spark 2.3?

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, ORC, Spark, Spark MLlib, PySpark, Docker, Kubernetes, Streaming, Tuning, pandas, Python

Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…

By César BEREZOWSKI

May 23, 2018

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.