Sorry, I haven’t really wrote anything in the last couple of months, but after a year or more of reading about Hadoop and poking at it. I have actually got the chance to do something with it.
A year or so ago I helped build a MySQL database to help analyze user data. A large amount of user data, billions of rows of CSV files and like any good DBA and/or DB Developer would do. We started importing and normalizing the data, so we could aggregate it into rational database. This works great, except a few problems start to show up.
The first is the bigger your tables get the slow the database gets, even with table partitioning and several other tricks of the trade it gets slow. Let’s be honest here, a single query summing up billions of rows is going to take a little while no matter what the database is, unless it’s vertically partitioned or you are using some type shard architecture.
The second problem we had was that we were expected to store the original CSV files for a up to two years, for business reasons, and no matter what the SAN Storage vendors tell you SAN storage is not as cheap as they would like you to think it is, even if you compress the files. On top of that, there was the fact, that if we need to reload the files for some reason, such as the ‘powers that be’ deciding they needed this other column now. We would have to decompress the files and reload the data.
Then the last problem of which most of us know all too well little or no budget. Well, there goes must of the commercial products, okay all of them.
Now, this is where I see my big chance to bring up Hadoop. First billions of rows and columns check, second cheap commodity hardware check, and last but not leased open source little check.
The next thing that I realized, we have way more people that know SQL, than we do java and pythons, which are the programming languages of choice for writing mapreduce. Good thing I am not the only one to have this problem. Facebook developers have written an application which sits on top of Hadoop call Hive. Hive is great it allows users that are already familiar with SQL to write similar queries in HiveQL and then Hive takes that query and then transforms it into mapreduce queries.
It proof of concept time. I was able to spin up a 14 VM Hadoop cluster in about two days, copy over a sizable amount of test data in another. I spent a couple week playing around with HiveQL and converting a few SQL queries over to HiveQL. With the results being I am able to process five days in the same time it was taking to process one day. Not bad for 14 VM.
So stay tune for more blog entry on Hadoop, Hive, and the crew.