Execute Python in an Oozie workflow
Mar 6, 2018
- Categories
- Data Engineering
- Tags
- Oozie
- Elasticsearch
- Python
- REST
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that.
I’ve recently designed a workflow that would interact with ElasticSearch. The workflow is made of the followings sequential actions:
- Create an index.
- Inject a data set.
- Set an alias on success.
- Delete the index on failure.
There are multiple ways to interact with ElasticSearch: Java binary transport or REST API. The majority of languages offer wrapping libraries. To get the job done in Oozie, we defined multiple requirements:
- Code must be portable, meaning include all its dependencies, because the cluster is offline, meaning not connected to the Internet.
- Code must be easy to understand and written in a widely used language to avoid technical debt.
- Prioritize a dynamic language comfortable with JSON and REST manipulation.
- The application must accept multiple CLI entry points, at least one for each Oozie actions.
The original idea was using Bash. However, parsing ElasticSearch’s JSON responses would have been a pain. So we chose Python.
ElasticSearch & Python
A bit off-topic but good to know: Python is really well equipped to deal with ElasticSearch.
The library maintains support for ElasticSearch from version 2.x to 6.x (luckily, we’re on 2.x !) and is very easy to understand and use.
Here’s a sample opening a secure connection and creating an index:
from elasticsearch import Elasticsearch
client = Elasticearch(["https://user:pwd@elastic.host:port"])
response = client.indices.create("my_index")
if "acknowledged" in response and response["acknowledged"] is True:
print("my_index created !")
else:
print("Uh oh, there was an error...")
print(response)
Package Python code
Once the code is ready, we need to package it with all the dependencies. The workflow must be independent of any Internet access. Only the Python binary must be present, which is the case natively on our targeted Operating System, CentOS 7.
Python offers a lot of possibilities for packaging (Wheel, Egg (deprecated in favor of Wheel), Zip…), and associated resources and HOWTOs. However, chosing the right packaging strategy for a newcomer is challenging. Fortunately, Python natively supports packaging a code directory into a zip for further execution. The generated archive behave a bit like a .py file.
Secondly, we need to download the dependencies locally and include them in the package.
Let’s say we have a project structured as following:
my_python_project/
├── EsUtil.py
├── create_index.py
├── set_alias.py
└── rollback.py
Here’s how we package it:
cd my_python_project
# Locally install the dependencies
pip install -t ./ [dependency list]
# Compress everything
zip --recurse-paths --quiet -9 ../my_python_dist.zip ./*
And finally we’d execute it like this:
PYTHONPATH=/path/to/my_python_dist.zip python -m [filename without extension] [args]
PYTHONPATH=/path/to/my_python_dist.zip python -m create_index [args]
Oozie workflow
Now that we have a valid Python package with our scripts, we must integrate it with our Oozie workflow.
There’s no such thing as a Python action in Oozie. We’ll use the closest and most flexible one, the Shell action.
As for any other action, Oozie prepares a container, injects the files you specify, and executes a command.
Here’s what the action would look like:
<action name="python-action">
<shell xmlns="uri:oozie:shell-action:0.2">
<job-tracker>${clusterJobtracker}</job-tracker>
<name-node>${clusterNamenode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${jobQueue}</value>
</property>
</configuration>
<exec>python</exec> <!-- python2 if necessary -->
<argument>-m</argument>
<argument>create_index</argument>
<argument>arg2</argument>
<argument>arg3</argument>
<env-var>PYTHONPATH=pyBundle</env-var>
<file>my_python_dist.zip#pyBundle</file>
</shell>
<ok to="end"/>
<error to="end"/>
</action>
- The
configuration
specifies a user YARN queue to run the Oozie container. env-var
sets the environment variable on the python bundle.file
injects the python bundle in the Oozie Shell action container.
Of course, we need to have Python installed on the YARN nodes (usually it’s shipped with the Linux distro underneath, but it’s a best practice to install one of your choice, using someting like Anaconda).
Some neat feature from Oozie on the Shell action is the
tag. If it’s set, Oozie will capture any line in the output that is formatted as property=value
and allow to re-use it in the workflow to inject in another action with the following syntax: ${wf:actionData('python-action')['property']}
.
References
- ElasticSearch Python lib: https://elasticsearch-py.readthedocs.io/en/master/
- Oozie Shell action: https://oozie.apache.org/docs/4.2.0/DG_ShellActionExtension.html