Apache Oozie – Shell Script Example.

Recently I needed the ability to allow a user to submit jobs that required them to pass arguments to a shell script. While it’s easy enough to submit a job using a Web UI like HUE. I wanted to tie it to a homegrown SaaS solution that we were developing to allow developers to load datasets into a database for testing.

Since I was already using Hadoop and Sqoop to store and load the datasets, and I didn’t want to reinvent the wheel, I decided to use Oozie that I had already installed to handle some of the Hadoop ETL jobs.

I started off by creating a working directory on HDFS.

hdfs dfs mkdir -p /user/me/oozie-scripts/OozieTest

Next I created a simple shell script that take two parameters. For testing, I decided to use curl to retrieve a CSV from google.com and then copy it to HDFS. Keep mind that any application that you use in your shell script needs to be install on your data nodes.

#!/bin/bash
/usr/bin/curl -i “$1″  -o $2
/usr/bin/hdfs dfs -copyFromLocal $2 $2

Now I copied the script to the working directory on HDFS.

shell> hdfs dfs copyFromLocal GetCSVData.sh /usr/me/oozie-scripts/OozieTest

Next I created a simple workflow.xml template to handle the Oozie job. Oozie requires it to be name workflow.xml. This defines the actions and parameters for the actions.

<workflow-app name=”GetCSVData” xmlns=”uri:oozie:workflow:0.4″>
<start to=”GetCSVData”/>
<action name=”GetCSVData”>
<shell xmlns=”uri:oozie:shell-action:0.1″>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>GetCSVData.sh</exec>
<argument>${url}</argument>
<argument>${output}</argument>
<file>/user/root/oozie-scripts/OozieTest/GetCSVData.sh#GetCSVData.sh</file>
</shell>
<ok to=”end”/>
<error to=”kill”/>
</action>
<kill name=”kill”>
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name=”end”/>
</workflow-app>

The important part of this is <shell xmlns=”uri:oozie:shell-action:0.1″>. Which defines the type of action and the requirements for it.

The sets which job tracker to run the job <job-tracker>${jobTracker}</job-tracker>

The name-node where everything is stored <name-node>${nameNode}</name-node>

The name of the Shell script to be executed <exec>GetCSVData.sh</exec>

The first argument to pass to the shell script <argument>${url}</argument>

The second argument to pass to the shell script <argument>${output}</argument>

The location of the shell script on HDFS <file>/user/me/oozie-scripts/OozieTest/GetCSVData.sh#GetCSVData.sh</file>

Now will we need to create a properties file named “oozietest.properties” for submitting the Oozie job. This will basically fill in the all the variables for the workflow.xml.

oozie.wf.application.path=hdfs://localhost:8020/user/me/oozie-scripts/OozieTest
jobTracker=localhost:8032
nameNode=hdfs://localhost:8020
url=http://www.google.com/finance/historical?q=NYSE%3ADATA&ei=TH0mVsrWBce7iwLE86_ABw&output=csv
output=/tmp/DATA.csv

The oozie.wf.application.path is the working directory on HDFS that has the workflow.xml. Where as the rest are key value pairs to fill in the value.

Now all we need to do is submit job.

shell> oozie job -oozie http://localhost:11000/oozie -config oozietest.properties -run

The should generate a job id which we can use to check the status of the job we submitted.

shell> oozie job -oozie http://dbz-datarepo-app02:11000/oozie -info [JOB_ID]

for more information check out Apache Oozie.