MongoDB Script for counting records in collections in all the databases

Here is a quick script. I wrote for a co-worker.

var host = "localhost"
var port = 27000
var dbslist = db.adminCommand('listDatabases');

for( var d = 0; d < dbslist.databases.length; d++) {
     var db = connect(host+":"+port+"/"+dbslist.databases[d].name);
     var collections = db.getCollectionNames();
     for(var i = 0; i < collections.length; i++){
         var name = collections[i];
         if(name.substr(0, 6) != 'system') {
            print("\t"+dbslist.databases[d].name+"."+name + ' = ' + db[name].count() + ' records');
         }
     }
}

 

Apache Oozie – Shell Script Example.

Recently I needed the ability to allow a user to submit jobs that required them to pass arguments to a shell script. While it’s easy enough to submit a job using a Web UI like HUE. I wanted to tie it to a homegrown SaaS solution that we were developing to allow developers to load datasets into a database for testing.

Since I was already using Hadoop and Sqoop to store and load the datasets, and I didn’t want to reinvent the wheel, I decided to use Oozie that I had already installed to handle some of the Hadoop ETL jobs.

I started off by creating a working directory on HDFS.

hdfs dfs mkdir -p /user/me/oozie-scripts/OozieTest

Next I created a simple shell script that take two parameters. For testing, I decided to use curl to retrieve a CSV from google.com and then copy it to HDFS. Keep mind that any application that you use in your shell script needs to be install on your data nodes.

#!/bin/bash
/usr/bin/curl -i “$1″  -o $2
/usr/bin/hdfs dfs -copyFromLocal $2 $2

Now I copied the script to the working directory on HDFS.

shell> hdfs dfs copyFromLocal GetCSVData.sh /usr/me/oozie-scripts/OozieTest

Next I created a simple workflow.xml template to handle the Oozie job. Oozie requires it to be name workflow.xml. This defines the actions and parameters for the actions.

<workflow-app name=”GetCSVData” xmlns=”uri:oozie:workflow:0.4″>
<start to=”GetCSVData”/>
<action name=”GetCSVData”>
<shell xmlns=”uri:oozie:shell-action:0.1″>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<exec>GetCSVData.sh</exec>
<argument>${url}</argument>
<argument>${output}</argument>
<file>/user/root/oozie-scripts/OozieTest/GetCSVData.sh#GetCSVData.sh</file>
</shell>
<ok to=”end”/>
<error to=”kill”/>
</action>
<kill name=”kill”>
<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
</kill>
<end name=”end”/>
</workflow-app>

The important part of this is <shell xmlns=”uri:oozie:shell-action:0.1″>. Which defines the type of action and the requirements for it.

The sets which job tracker to run the job <job-tracker>${jobTracker}</job-tracker>

The name-node where everything is stored <name-node>${nameNode}</name-node>

The name of the Shell script to be executed <exec>GetCSVData.sh</exec>

The first argument to pass to the shell script <argument>${url}</argument>

The second argument to pass to the shell script <argument>${output}</argument>

The location of the shell script on HDFS <file>/user/me/oozie-scripts/OozieTest/GetCSVData.sh#GetCSVData.sh</file>

Now will we need to create a properties file named “oozietest.properties” for submitting the Oozie job. This will basically fill in the all the variables for the workflow.xml.

oozie.wf.application.path=hdfs://localhost:8020/user/me/oozie-scripts/OozieTest
jobTracker=localhost:8032
nameNode=hdfs://localhost:8020
url=http://www.google.com/finance/historical?q=NYSE%3ADATA&ei=TH0mVsrWBce7iwLE86_ABw&output=csv
output=/tmp/DATA.csv

The oozie.wf.application.path is the working directory on HDFS that has the workflow.xml. Where as the rest are key value pairs to fill in the value.

Now all we need to do is submit job.

shell> oozie job -oozie http://localhost:11000/oozie -config oozietest.properties -run

The should generate a job id which we can use to check the status of the job we submitted.

shell> oozie job -oozie http://dbz-datarepo-app02:11000/oozie -info [JOB_ID]

for more information check out Apache Oozie.

Disable 70-persistent-net.rules generation on CentOS 6 VM

If you’re like me you probably have an environment that is running on some virtual platform and like everyone else you have built a template to spin Linux systems. One of the things lately we were running into was the “70-persistent-net.rules”, which associated MAC address to Network interfaces.

The easiest way I have found to disable this was to do the following, it’s not pretty but works.

rm /etc/udev/rules.d/70-persistent-net.rules

echo “#” > /lib/udev/rules.d/75-persistent-net-generator.rules

Happy hacking.

Connecting Tableau to DataStax Cassandra with Cassandra CQL ODBC.

Recently, I did some testing with Tableau Desktop connecting to DataStax Cassandra using their newly released DataStax ODBC driver. Before the release of the DataStax ODBC driver, the only way to connect Tableau Desktop to DataStax was the DataStax Enterprise Connector (a.k.a. Hive Thrift Server).

While Hive is a great analytic tool, it is somewhat slow. When you are trying to load data for a report, which should be relatively quick, Hive can take a little time to get up to speed. I’m sure everyone that has used Hive knows what I’m talking about.

I would highly recommend downloading and installing the DataStax ODBC driver.

There is a nice little how-to on the DataStax blog @ http://www.datastax.com/download-drivers.