Configuring MySQL as a Hive Metastore

Hive uses DerbyDB to store its metadata by default, if you have ever tried having more than one person, or multiple services, such as the Hive Web UI or Hive Service connect, it doesn’t work really well and can cause problems. To improve this I recommend use MySQL as the metastore_db.

 

Install a MySQL server and client either on the same server as Hive or a remote server.

Log into MySQL and create the metastore and user permission.

 

shell> mysql -u root -p

 

mysql > create database hive_metastore;

mysql > CREATE USER ‘hive’@'[localhost|remotehost]’ IDENTIFIED BY ‘metastore’;

mysql > GRANT ALL PRIVILEGES ON hive_metastore.* TO ‘hive’@'[localhost|remotehost]”;

 

Now download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/an install it into you hive library path “install_path/hive/lib”

 

Now configure your hive-site.xml file which should be in “install_path/hive/conf/hive-site.xml”.

 

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

 

<property>

<name>javax generic xenical.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://[localhost|remotehost]/metastore</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hive</value>

<description>username to use against metastore database</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>metastore</value>

<description>password to use against metastore database</description>

</property>

 

<property>

<name>datanucleus.autoCreateSchema</name>

<value>false</value>

<description>creates necessary schema on a startup if one doesn’t exist. Set this to false, after creating it once</description>

</property>

 

<property>

 

More information at:https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

Hadoop, Hive, Oh my.

Sorry, I haven’t really wrote anything in the last couple of months, but after a year or more of reading about Hadoop and poking at it. I have actually got the chance to do something with it.

A year or so ago I helped build a MySQL database to help analyze user data. A large amount of user data, billions of rows of CSV files and like any good DBA and/or DB Developer would do. We started importing and normalizing the data, so we could aggregate it into rational database. This works great, except a few problems start to show up.

The first is the bigger your tables get the slow the database gets, even with table partitioning and several other tricks of the trade it gets slow. Let’s be honest here, a single query summing up billions of rows is going to take a little while no matter what the database is, unless it’s vertically partitioned or you are using some type shard architecture.

The second problem we had was that we were expected to store the original CSV files for a up to two years, for business reasons, and no matter what the SAN Storage vendors tell you SAN storage is not as cheap as they would like you to think it is, even if you compress the files. On top of that, there was the fact, that if we need to reload the files for some reason, such as the ‘powers that be’ deciding they needed this other column now. We would have to decompress the files and reload the data.

Then the last problem of which most of us know all too well little or no budget. Well, there goes must of the commercial products, okay all of them.

Now, this is where I see my big chance to bring up Hadoop. First billions of rows and columns check, second cheap commodity hardware check, and last but not leased open source little check.

The next thing that I realized, we have way more people that know SQL, than we do java and pythons, which are the programming languages of choice for writing mapreduce. Good thing I am not the only one to have this problem. Facebook developers have written an application which sits on top of Hadoop call Hive. Hive is great it allows users that are already familiar with SQL to write similar queries in HiveQL and then Hive takes that query and then transforms it into mapreduce queries.

It proof of concept time. I was able to spin up a 14 VM Hadoop cluster in about two days, copy over a sizable amount of test data in another. I spent a couple week playing around with HiveQL and converting a few SQL queries over to HiveQL. With the results being I am able to process five days in the same time it was taking to process one day. Not bad for 14 VM.

So stay tune for more blog entry on Hadoop, Hive, and the crew.

Using Mysqldump to Backup Your Stored Procedures, Functions and Triggers

Now that MySQL has Stored Procedures, Functions and Trigger. I have found myself needing to back them up or make copies to move to a different MySQL server. It’s actually pretty easy to do with mysqldump. Here is the command line for it.

shell> mysqldump –routines –no-create-info –no-data –no-create-db –skip-opt <database> > outputfile.sql

Then if you need to restore it.

shell> mysql <database> < outputfile.sql

more info: http://dev.mysql.com/doc/refman/5.5/en/mysqldump.html

Slacking over the Holidays

So December was an interesting month for me, with a few changes in my personal career and the holidays. I take a week off. Which of course lead me spending two weeks catching up on work and to top it off my boss or someone above him decided that we need to take look at a our service architecture. So of course I am now in research mode, which should hopefully lead to some good blog post. Stay tuned.

Connecting PHP to An Oracle Instance on RedHat or CentOS 5

Here lately it seems that everyone wants to connect to Oracle, but I have to admit this was the first time someone asked me to get PHP to talk to Oracle. It was a lot less painful then I thought it would be, so here is what I did.

A long with the standard PHP RPMs you need to install a couple of additional RPMs from Oracle. These are oracle-instantclient-basic and oracle-instantclient-devel which can be downloaded from http://www.oracle.com/technetwork/database/features/instant-client/index-100365.html. You will also need php-oci8 RPM which can be download fromhttp://oss.oracle.com/projects/php/files/EL5/.

So after you have downloaded the RPMs go a head and install the packages and create the symlink for libcIntsh.so.

$ rpm -Uvh oracle-instantclient-basic-##.#.#.rpm
$ rpm -Uvh oracle-instantclient-devel-##.#.#.rpm
$ cd /usr/include/oracle/##.#/[client|client64]
$ ln –s libclntsh.so.##.# libclntsh.so

Now you are going to want to setup you environment settings, It is important to set all Oracle environment variables before starting Apache or running a PHP script,  so that the OCI8 process environment is correctly initialized. Setting environment variables in PHP scripts can lead to obvious or non-obvious problems. You can also add Instant Client library path to /etc/ld.so.conf.

$ LD_LIBRARY_PATH=/usr/lib/oracle/##.#/[client|client64]/lib:${LD_LIBRARY_PATH}
$ export LD_LIBRARY_PATH

And now for the big finish. Here is a simple connection script to test it all out.

<?php
$c = oci_connect( ‘USERNAME’,
‘PASSWORD’,
‘SERVERNAME:PORT/SERVICE_NAME’, INSTANCE_NAME’ );

if( $c ) {

$s = oci_parse( $c, ‘SELECT TABLE_NAME FROM all_tables’ );

oci_execute($s) ;

while($res = oci_fetch_array( $s, OCI_ASSOC) ) {

echo $res[‘TABLE_NAME’] . “\n”;
}

}
?>

For a complete list of function and additional install resources check out the following sites:

http://php.net/manual/en/book.oci8.php
http://wiki.oracle.com/page/PHP
http://www.oracle.com/technetwork/articles/technote-php-instant-084410.html