Apache Oozie – Shell Script Example.

Recently I needed the ability to allow a user to submit jobs that required them to pass arguments to a shell script. While it’s easy enough to submit a job using a Web UI like HUE. I wanted to tie it to a homegrown SaaS solution that we were developing to allow developers to load datasets into a database for testing.

Since I was already using Hadoop and Sqoop to store and load the datasets, and I didn’t want to reinvent the wheel, I decided to use Oozie that I had already installed to handle some of the Hadoop ETL jobs.

I started off by creating a working directory on HDFS.

hdfs dfs mkdir -p /user/me/oozie-scripts/OozieTest

Next I created a simple shell script that take two parameters. For testing, I decided to use curl to retrieve a CSV from google.com and then copy it to HDFS. Keep mind that any application that you use in your shell script needs to be install on your data nodes.

/usr/bin/curl -i “$1″  -o $2
/usr/bin/hdfs dfs -copyFromLocal $2 $2

Now I copied the script to the working directory on HDFS.

shell> hdfs dfs copyFromLocal GetCSVData.sh /usr/me/oozie-scripts/OozieTest

Next I created a simple workflow.xml template to handle the Oozie job. Oozie requires it to be name workflow.xml. This defines the actions and parameters for the actions.

<workflow-app name=”GetCSVData” xmlns=”uri:oozie:workflow:0.4″>
<start to=”GetCSVData”/>
<action name=”GetCSVData”>
<shell xmlns=”uri:oozie:shell-action:0.1″>
<ok to=”end”/>
<error to=”kill”/>
<kill name=”kill”>
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
<end name=”end”/>

The important part of this is <shell xmlns=”uri:oozie:shell-action:0.1″>. Which defines the type of action and the requirements for it.

The sets which job tracker to run the job <job-tracker>${jobTracker}</job-tracker>

The name-node where everything is stored <name-node>${nameNode}</name-node>

The name of the Shell script to be executed <exec>GetCSVData.sh</exec>

The first argument to pass to the shell script <argument>${url}</argument>

The second argument to pass to the shell script <argument>${output}</argument>

The location of the shell script on HDFS <file>/user/me/oozie-scripts/OozieTest/GetCSVData.sh#GetCSVData.sh</file>

Now will we need to create a properties file named “oozietest.properties” for submitting the Oozie job. This will basically fill in the all the variables for the workflow.xml.


The oozie.wf.application.path is the working directory on HDFS that has the workflow.xml. Where as the rest are key value pairs to fill in the value.

Now all we need to do is submit job.

shell> oozie job -oozie http://localhost:11000/oozie -config oozietest.properties -run

The should generate a job id which we can use to check the status of the job we submitted.

shell> oozie job -oozie http://dbz-datarepo-app02:11000/oozie -info [JOB_ID]

for more information check out Apache Oozie.

Connecting Tableau to DataStax Cassandra with Cassandra CQL ODBC.

Recently, I did some testing with Tableau Desktop connecting to DataStax Cassandra using their newly released DataStax ODBC driver. Before the release of the DataStax ODBC driver, the only way to connect Tableau Desktop to DataStax was the DataStax Enterprise Connector (a.k.a. Hive Thrift Server).

While Hive is a great analytic tool, it is somewhat slow. When you are trying to load data for a report, which should be relatively quick, Hive can take a little time to get up to speed. I’m sure everyone that has used Hive knows what I’m talking about.

I would highly recommend downloading and installing the DataStax ODBC driver.

There is a nice little how-to on the DataStax blog @ http://www.datastax.com/download-drivers.

Connecting Tableau to Google Cloud SQL

Before connecting your Tableau Application up to your Google Cloud SQL Instance, you will need to make sure that you have assigned an IP Address to the instance. You will also need to allow the network in which your Tableau application is located, access to the Google Cloud SQL Instance.

First, I recommend that you use an external source to determine your IP address, such as freegeoip.net or hostip.info, this will help eliminate any network translation issues.

Now that you have your IP Address it is time to configure your Google Cloud SQL Instance. To grant access to your Tableau application, you need to do the following.

  1. Go to the Google Developers Console and select a project by clicking on the project name.
  2. In the sidebar on the left, click Storage > Cloud SQL.
  3. Find the instance to which you want to grant access and click the instance name. Click Edit.
  4. In the IPv4 Address section, select Assign an IPv4 address to my Cloud SQL instance to assign one to the instance.

Note: There are charges when you assign an IPv4 address. For more information, see the pricing page.

Once you have assigned the IP Address to your instance, you will need to allow the IP Address from your Tableau Application access to the instance by doing the following.


In the Allowed Networks area click on the blue border button with the plus sign. In the text box title Network add the IP Address that you obtained earlier.

Now if you haven’t already, I would recommend that you create a read-only user with access to the schema that you want access.

To configure Tableau Desktop 9.0 to Google Cloud SQL Instance, you need to configure a MySQL connection.


1.) Open Tableau Desktop.
2.) On the Left-hand side under Connect click “More Servers.”
3.) Click on “MySQL.”
4.) Fill in the Server text box with the IP Address that was assigned to your Google Cloud SQL Instance, the port should be 3306, which is the default. Now simply fill in the username and password and click OK.


Congratulations! You’re all connected and ready to start building reports.


Elasticsearch is a distributed restful search and analytic tool that is built on the top of Apache Lucene for high performance.

Elasticsearch features include:

Real-Time Data Indexing
High Availability
Full Text Search
Document Orientation

The flow of data never stops so the question is how quickly can that data become available. Elasticsearch indexes the data in real-time allowing the for data to be made available as quickly as it can for real-time analytics.

Elasticsearch is horizontally scalable. An organization can simply add nodes to expand the cluster capacity. Elasticsearch is resilient in detecting and removing failed nodes and rebalancing itself to ensure that an organization’s data is safe and accessible.

Elasticsearch can also host multiple indices that can be queried independently or as a group. Elasticsearch structures it’s data in JSON documents. All fields are indexed by default and all the indices can be used in a single query.

Clients can connect to Elasticsearch by using a standard HTTP REST library. This gives any programming language the ability to connect to Elasticsearch.

Elasticsearch has been used to query 24 billion records in 900ms. It’s currently being used by companies such as Github, Foursquare, and Xing.

website: http://www.elasticsearch.org/