Configuring MySQL as a Hive Metastore

Hive uses DerbyDB to store its metadata by default, if you have ever tried having more than one person, or multiple services, such as the Hive Web UI or Hive Service connect, it doesn’t work really well and can cause problems. To improve this I recommend use MySQL as the metastore_db.

 

Install a MySQL server and client either on the same server as Hive or a remote server.

Log into MySQL and create the metastore and user permission.

 

shell> mysql -u root -p

 

mysql > create database hive_metastore;

mysql > CREATE USER ‘hive’@'[localhost|remotehost]’ IDENTIFIED BY ‘metastore’;

mysql > GRANT ALL PRIVILEGES ON hive_metastore.* TO ‘hive’@'[localhost|remotehost]”;

 

Now download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/an install it into you hive library path “install_path/hive/lib”

 

Now configure your hive-site.xml file which should be in “install_path/hive/conf/hive-site.xml”.

 

<?xml version=”1.0″?>

<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

<configuration>

 

<property>

<name>javax generic xenical.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://[localhost|remotehost]/metastore</value>

<description>JDBC connect string for a JDBC metastore</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>Driver class name for a JDBC metastore</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>hive</value>

<description>username to use against metastore database</description>

</property>

 

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>metastore</value>

<description>password to use against metastore database</description>

</property>

 

<property>

<name>datanucleus.autoCreateSchema</name>

<value>false</value>

<description>creates necessary schema on a startup if one doesn’t exist. Set this to false, after creating it once</description>

</property>

 

<property>

 

More information at:https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

Hadoop, Hive, Oh my.

Sorry, I haven’t really wrote anything in the last couple of months, but after a year or more of reading about Hadoop and poking at it. I have actually got the chance to do something with it.

A year or so ago I helped build a MySQL database to help analyze user data. A large amount of user data, billions of rows of CSV files and like any good DBA and/or DB Developer would do. We started importing and normalizing the data, so we could aggregate it into rational database. This works great, except a few problems start to show up.

The first is the bigger your tables get the slow the database gets, even with table partitioning and several other tricks of the trade it gets slow. Let’s be honest here, a single query summing up billions of rows is going to take a little while no matter what the database is, unless it’s vertically partitioned or you are using some type shard architecture.

The second problem we had was that we were expected to store the original CSV files for a up to two years, for business reasons, and no matter what the SAN Storage vendors tell you SAN storage is not as cheap as they would like you to think it is, even if you compress the files. On top of that, there was the fact, that if we need to reload the files for some reason, such as the ‘powers that be’ deciding they needed this other column now. We would have to decompress the files and reload the data.

Then the last problem of which most of us know all too well little or no budget. Well, there goes must of the commercial products, okay all of them.

Now, this is where I see my big chance to bring up Hadoop. First billions of rows and columns check, second cheap commodity hardware check, and last but not leased open source little check.

The next thing that I realized, we have way more people that know SQL, than we do java and pythons, which are the programming languages of choice for writing mapreduce. Good thing I am not the only one to have this problem. Facebook developers have written an application which sits on top of Hadoop call Hive. Hive is great it allows users that are already familiar with SQL to write similar queries in HiveQL and then Hive takes that query and then transforms it into mapreduce queries.

It proof of concept time. I was able to spin up a 14 VM Hadoop cluster in about two days, copy over a sizable amount of test data in another. I spent a couple week playing around with HiveQL and converting a few SQL queries over to HiveQL. With the results being I am able to process five days in the same time it was taking to process one day. Not bad for 14 VM.

So stay tune for more blog entry on Hadoop, Hive, and the crew.

Using Iozone for Filesystem Benchmarking

If you have been around computer systems long enough everyone knows how import disk performance is, espeacially with database systems. There’s the standard htparm -tT and dd test the everyone does, but it really does give you the whole picture. What you really want is to test read, write, re-read, re-write, read backwards, read strided, fread, fwrite, random read, pread ,mmap, aio_read, and aio_write. For that I would recommend using Iozone. It gives you a better idea of what’s going on.

http://www.iozone.org