Category: Databases Administration

Covering NoSQL, Relational Database, Data Visualization, and Reporting.

  • Looking at the Hadoop MapReduce Capacity, Fair, and Hod Schedulers.

    Today | started looking at the different MapReduce Schedulers, because I would like to be able to start the processing on a new jobs when slots became available. So I started look at the other schedulers that come with Hadoop.

    The Capacity Scheduler:

    The Capacity Scheduler is designed to run Hadoop Map-Reduce as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster while running Map-Reduce applications.

    The Fair Scheduler:

    Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities – the priorities are used as weights to determine the fraction of total compute time that each job gets.

    The Hod Scheduler:

    Hadoop On Demand (HOD) is a system for provisioning and managing independent Hadoop MapReduce and Hadoop Distributed File System (HDFS) instances on a shared cluster of nodes see this page. HOD is a tool that makes it easy for administrators and users to quickly setup and use Hadoop. HOD is also a very useful tool for Hadoop developers and testers who need to share a physical cluster for testing their own Hadoop versions.

    I decided to started with the Fair Scheduler, since it seem to fit my needs, but I will try to keep you informed of my progress.

    –Happy Data

     

  • Serengeti Soups Up Apache Hadoop

    The primary goal of Bigtop is to build a community around the packaging, deployment and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

    If you looking for in easy to install packaging or something to setup in a software repo. I suggest you check it out.

     

    https://cwiki.apache.org/BIGTOP/index.html

     

    peace!

  • Apache Bigtop Hadoop

    The primary goal of Bigtop is to build a community around the packaging, deployment and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

    If you looking for in easy to install packaging or something to setup in a software repo. I suggest you check it out.

    https://cwiki.apache.org/BIGTOP/index.html

    peace!

  • MySQL Cheat Sheet

    Found this today. Not to bad for a quick reference sheet. MySQL Cheat Sheet

  • Hortonworks Releases Its (Conservative) Hadoop Platform

    Check this article out.

    “Hortonworks sets itself apart from Cloudera and MapR by sticking with standards-based software and Apache’s proven 1.0 code line.” >>

  • Configuring MySQL as a Hive Metastore

    Hive uses DerbyDB to store its metadata by default, if you have ever tried having more than one person, or multiple services, such as the Hive Web UI or Hive Service connect, it doesn’t work really well and can cause problems. To improve this I recommend use MySQL as the metastore_db.

     

    Install a MySQL server and client either on the same server as Hive or a remote server.

    Log into MySQL and create the metastore and user permission.

     

    shell> mysql -u root -p

     

    mysql > create database hive_metastore;

    mysql > CREATE USER ‘hive’@'[localhost|remotehost]’ IDENTIFIED BY ‘metastore’;

    mysql > GRANT ALL PRIVILEGES ON hive_metastore.* TO ‘hive’@'[localhost|remotehost]”;

     

    Now download the MySQL JDBC driver from http://www.mysql.com/downloads/connector/j/an install it into you hive library path “install_path/hive/lib”

     

    Now configure your hive-site.xml file which should be in “install_path/hive/conf/hive-site.xml”.

     

    <?xml version=”1.0″?>

    <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>

    <configuration>

     

    <property>

    <name>javax generic xenical.jdo.option.ConnectionURL</name>

    <value>jdbc:mysql://[localhost|remotehost]/metastore</value>

    <description>JDBC connect string for a JDBC metastore</description>

    </property>

     

    <property>

    <name>javax.jdo.option.ConnectionDriverName</name>

    <value>com.mysql.jdbc.Driver</value>

    <description>Driver class name for a JDBC metastore</description>

    </property>

     

    <property>

    <name>javax.jdo.option.ConnectionUserName</name>

    <value>hive</value>

    <description>username to use against metastore database</description>

    </property>

     

    <property>

    <name>javax.jdo.option.ConnectionPassword</name>

    <value>metastore</value>

    <description>password to use against metastore database</description>

    </property>

     

    <property>

    <name>datanucleus.autoCreateSchema</name>

    <value>false</value>

    <description>creates necessary schema on a startup if one doesn’t exist. Set this to false, after creating it once</description>

    </property>

     

    <property>

     

    More information at:https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin

  • Hadoop, Hive, Oh my.

    Sorry, I haven’t really wrote anything in the last couple of months, but after a year or more of reading about Hadoop and poking at it. I have actually got the chance to do something with it.

    A year or so ago I helped build a MySQL database to help analyze user data. A large amount of user data, billions of rows of CSV files and like any good DBA and/or DB Developer would do. We started importing and normalizing the data, so we could aggregate it into rational database. This works great, except a few problems start to show up.

    The first is the bigger your tables get the slow the database gets, even with table partitioning and several other tricks of the trade it gets slow. Let’s be honest here, a single query summing up billions of rows is going to take a little while no matter what the database is, unless it’s vertically partitioned or you are using some type shard architecture.

    The second problem we had was that we were expected to store the original CSV files for a up to two years, for business reasons, and no matter what the SAN Storage vendors tell you SAN storage is not as cheap as they would like you to think it is, even if you compress the files. On top of that, there was the fact, that if we need to reload the files for some reason, such as the ‘powers that be’ deciding they needed this other column now. We would have to decompress the files and reload the data.

    Then the last problem of which most of us know all too well little or no budget. Well, there goes must of the commercial products, okay all of them.

    Now, this is where I see my big chance to bring up Hadoop. First billions of rows and columns check, second cheap commodity hardware check, and last but not leased open source little check.

    The next thing that I realized, we have way more people that know SQL, than we do java and pythons, which are the programming languages of choice for writing mapreduce. Good thing I am not the only one to have this problem. Facebook developers have written an application which sits on top of Hadoop call Hive. Hive is great it allows users that are already familiar with SQL to write similar queries in HiveQL and then Hive takes that query and then transforms it into mapreduce queries.

    It proof of concept time. I was able to spin up a 14 VM Hadoop cluster in about two days, copy over a sizable amount of test data in another. I spent a couple week playing around with HiveQL and converting a few SQL queries over to HiveQL. With the results being I am able to process five days in the same time it was taking to process one day. Not bad for 14 VM.

    So stay tune for more blog entry on Hadoop, Hive, and the crew.

  • Using Mysqldump to Backup Your Stored Procedures, Functions and Triggers

    Now that MySQL has Stored Procedures, Functions and Trigger. I have found myself needing to back them up or make copies to move to a different MySQL server. It’s actually pretty easy to do with mysqldump. Here is the command line for it.

    shell> mysqldump –routines –no-create-info –no-data –no-create-db –skip-opt <database> > outputfile.sql

    Then if you need to restore it.

    shell> mysql <database> < outputfile.sql

    more info: http://dev.mysql.com/doc/refman/5.5/en/mysqldump.html

  • Slacking over the Holidays

    So December was an interesting month for me, with a few changes in my personal career and the holidays. I take a week off. Which of course lead me spending two weeks catching up on work and to top it off my boss or someone above him decided that we need to take look at a our service architecture. So of course I am now in research mode, which should hopefully lead to some good blog post. Stay tuned.

  • Connecting PHP to An Oracle Instance on RedHat or CentOS 5

    Here lately it seems that everyone wants to connect to Oracle, but I have to admit this was the first time someone asked me to get PHP to talk to Oracle. It was a lot less painful then I thought it would be, so here is what I did.

    A long with the standard PHP RPMs you need to install a couple of additional RPMs from Oracle. These are oracle-instantclient-basic and oracle-instantclient-devel which can be downloaded from http://www.oracle.com/technetwork/database/features/instant-client/index-100365.html. You will also need php-oci8 RPM which can be download fromhttp://oss.oracle.com/projects/php/files/EL5/.

    So after you have downloaded the RPMs go a head and install the packages and create the symlink for libcIntsh.so.

    $ rpm -Uvh oracle-instantclient-basic-##.#.#.rpm
    $ rpm -Uvh oracle-instantclient-devel-##.#.#.rpm
    $ cd /usr/include/oracle/##.#/[client|client64]
    $ ln –s libclntsh.so.##.# libclntsh.so

    Now you are going to want to setup you environment settings, It is important to set all Oracle environment variables before starting Apache or running a PHP script,  so that the OCI8 process environment is correctly initialized. Setting environment variables in PHP scripts can lead to obvious or non-obvious problems. You can also add Instant Client library path to /etc/ld.so.conf.

    $ LD_LIBRARY_PATH=/usr/lib/oracle/##.#/[client|client64]/lib:${LD_LIBRARY_PATH}
    $ export LD_LIBRARY_PATH

    And now for the big finish. Here is a simple connection script to test it all out.

    <?php
    $c = oci_connect( ‘USERNAME’,
    ‘PASSWORD’,
    ‘SERVERNAME:PORT/SERVICE_NAME’, INSTANCE_NAME’ );

    if( $c ) {

    $s = oci_parse( $c, ‘SELECT TABLE_NAME FROM all_tables’ );

    oci_execute($s) ;

    while($res = oci_fetch_array( $s, OCI_ASSOC) ) {

    echo $res[‘TABLE_NAME’] . “\n”;
    }

    }
    ?>

    For a complete list of function and additional install resources check out the following sites:

    http://php.net/manual/en/book.oci8.php
    http://wiki.oracle.com/page/PHP
    http://www.oracle.com/technetwork/articles/technote-php-instant-084410.html