Apache Oozie

Apache Oozie is an open source java web application available under apache license 2.0. It is defined as a job scheduler system designed and deployed to manage and run Hadoop Stack jobs in a distributed storage environment.

An Oozie workflow is a set of actions organized in a Directed Acyclic Graph (DAG). The task chronology, as well as the workflow's start and finish rules, are determined but the control nodes and the execution of tasks are triggered by the action nodes. It comes pre-loaded with a variety of Hadoop Ecosystem actions (including Apache MapReduce and Apache Pig), as well as system-specific jobs (such as shell scripts).

Oozie Coordinator allows you to run Oozie workflows regularly at a given time, according to data avaibility or when an event occurs. A workflow task is launched these conditions are met.

Oozie bundle is a combination of multiple coordinator and workflow jobs in which you manage their lifecycle.

Learn more: Official website
Related tags: Apache Airflow; Argo Workflows

Splitting HDFS files into multiple hive tables

Categories: Data Engineering | Tags: Flume, Pig, HDFS, Hive, Oozie, SQL

I am going to show how to split a CSV file stored inside HDFS as multiple Hive tables based on the content of each record. The context is simple. We are using Flume to collect logs from all over our…

By David WORMS

Sep 15, 2013

Composants for CDH and HDP

Categories: Big Data | Tags: Flume, Hortonworks, Hadoop, Hive, Oozie, Sqoop, Zookeeper, Cloudera, CDH, HDP

I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting…

By David WORMS

Sep 22, 2013

Execute Python in an Oozie workflow

Categories: Data Engineering | Tags: Oozie, Elasticsearch, Python, REST

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that. I’ve recently designed a workflow that would interact…

By César BEREZOWSKI

Mar 6, 2018

Present and future of Hadoop workflow scheduling: Oozie 5.x

Categories: Big Data, DataWorks Summit 2018 | Tags: Hadoop, Hive, Oozie, Sqoop, CDH, HDP, REST

During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…

By Leo SCHOUKROUN

May 23, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Rolling Upgrade, HDFS, Spark, YARN, Docker

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…

By Lucas BAKALIAN

Jul 25, 2018

Introducing Apache Airflow on AWS

Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: PySpark, Learning and tutorial, Airflow, Oozie, Spark, AWS, Docker, Python

Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-source…

By Aargan COINTEPAS

May 5, 2020

Internship in Big Data infrastructure with TDP

Categories: Infrastructure, Learning | Tags: Cyber Security, DevOps, Java, Hadoop, IaC, Internship, TDP

Job Description Big Data and distributed computing is at Adaltas’ core. We support our partners in the deployment, maintenance and optimization of some of France’s largest clusters. Adaltas is also an…

By Daniel HARTY

Oct 25, 2021