Blog

  • Wrangling Customer Usage Data with Hadoop

    Here is our session from the Hadoop Summit 2013.

     

    Title: Wrangling Customer Usage Data with Hadoop

    Slides: http://www.slideshare.net/Hadoop_Summit/hall-johnson-june271100amroom211v2

    Description:

    At Clearwire we have a big data challenge: Processing millions of unique usage records comprising terabytes of data for millions of customers every week. Historically, massive purpose-built database solutions were used to process data, but weren’t particularly fast, nor did they lend themselves to analysis. As mobile data volumes increase exponentially, we needed a scalable solution that could process usage data for billing, provide a data analysis platform, and inexpensively store the data indefinitely. The solution? A Hadoop-based platform allowed us to architect and deploy an end-to-end solution based on a combination of physical data nodes and virtual edge nodes in less than six months. This solution allowed us to turn off our legacy usage processing solution and reduce processing times from hours to as little as 15-min. This improvement has enabled Clearwire to deliver actionable usage data to partners faster and more predictably than ever before. Usage processing was just the beginning; we’re now turing to the raw data stored in Hadoop, adding new data sources, and starting to anlyze the data. Clearwire is now able to put multiple data sources in the hands of our analysts for further discovery and actionable intelligence.

     

  • Something new, writing Android App in HTML 5

    I wanted to write an App for my Android Tablet, something easy so I could get a basic understanding of what it would take to write an app. I figured that I could knock something out fairly quickly, but I didn’t know it would be this easy.

    I’m a fairly big user of Eclipse, so install the the Android SDK kit and off I went.

    I created a New Android Project and edited the main.xml, so it look like the following.

    — code starts—

    <?xml version=”1.0″ encoding=”utf-8″?>

    <LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
                         android:layout_width="fill_parent"
                         android:layout_height="fill_parent"
                         android:orientation="vertical">
      <WebView
                         android:layout_width="fill_parent"
                         android:layout_height="fill_parent"
                         android:id="@+id/webView" />
    </LinearLayout>
    --- code stops ---
    I then updated my java code so that would use the webView toolkit and open my HTML 5 app.
    --- code starts---
    package com.rogerhosto.myandroidapp;
    import android.app.Activity;
    import android.os.Bundle;
    import android.webkit.WebView;
    public class MyAndroidAppActivity extends Activity {
    	/** Called when the activity is first created. */
    	@Override
    	public void onCreate(Bundle savedInstanceState) {
    		super.onCreate(savedInstanceState);
    		setContentView(R.layout.main);
    		WebView webView = (WebView)findViewById(R.id.webView);
    		webView.getSettings().setJavaScriptEnabled(true);
    		webView.loadUrl("file:///android_asset/www/index.html");
    	}
    }
    
    

    — code stops —

    Then I create a www directory in existing assets directory and created index.html for all my HTML5 code. That’s it.

    Pretty Easy.

     

  • Windows Azure HDInsight ( Hadoop on Windows )

    Lately I have been asked by a lot of my co-workers, if Hadoop runs on Windows. After going to the Hadoop Summit last month, I have been able to tell them about Azure HDInsight. Which is basically Apache Hadoop running on Windows Azure.

    It appears that Microsoft has been working with Hortonworks to bring Apache Hadoop to Windows and here is the end produce.

    <a href="http://www xenical medication.windowsazure.com/en-us/documentation/services/hdinsight/”>http://www.windowsazure.com/en-us/documentation/services/hdinsight/

    So if you are interest in Hadoop on Windows check it out.

  • Hadoop to Hadoop Copy

    Here recently I need to copy the content of one hadoop cluster to another for geo redundancy. Thankfully instead of have to write something to do it, Hadoop supply a hand tool to do it “DistCp (distributed copy)”.

     

    DistCp is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

     

    Here are the basic for using:

     

    bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

    hdfs://nn2:8020/bar/foo

     

    This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths.

     

    Here is how you can handle multiple source directories on the command line:

     

    bash$ hadoop distcp hdfs://nn1:8020/foo/a \

    hdfs://nn1:8020/foo/b \

    hdfs://nn2:8020/bar/foo

  • Hortonworks Road Show "Big Business Value from Big Data and Hadoop"

    This morning I went to the Hortonworks Road Show. It’s wasn’t Bad. I have to say out of the Hadoop Vendor I have talked to, I like Hortonworks business model the best.

    The fact that they are a large committer to the Apache Hadoop Project, along with several other sub projects such as Apache Ambari Project doesn’t hurt. They seem to be more community based then the others. If you have a chance or know someone that would like a good introduce to hadoop I would recommend that they go.

    http://info.hortonworks.com/RoadShowFall2012.html?mktotrk=roadshow

    –Peace

  • Working with Hadoop Streaming

    Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. For example:

    shell> $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar  -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc

    If you using the tar package from Apache Hadoop. You can find the hadoop-streaming.jar in $HADOOP_HOME/contrib/streaming/hadoop-streaming-xxx.jar

  • Amazon Relational Database Service (Amazon RDS)

    It appears that Amazon is introducing a new service specifically targeted at Relational Databases helpful hints. You can choose from MySQLOracle, and Microsoft Sql Server.

    Amazon Relational Database Service (Amazon RDS) is a web service that makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business.

  • Looking at the Hadoop MapReduce Capacity, Fair, and Hod Schedulers.

    Today | started looking at the different MapReduce Schedulers, because I would like to be able to start the processing on a new jobs when slots became available. So I started look at the other schedulers that come with Hadoop.

    The Capacity Scheduler:

    The Capacity Scheduler is designed to run Hadoop Map-Reduce as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster while running Map-Reduce applications.

    The Fair Scheduler:

    Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities – the priorities are used as weights to determine the fraction of total compute time that each job gets.

    The Hod Scheduler:

    Hadoop On Demand (HOD) is a system for provisioning and managing independent Hadoop MapReduce and Hadoop Distributed File System (HDFS) instances on a shared cluster of nodes see this page. HOD is a tool that makes it easy for administrators and users to quickly setup and use Hadoop. HOD is also a very useful tool for Hadoop developers and testers who need to share a physical cluster for testing their own Hadoop versions.

    I decided to started with the Fair Scheduler, since it seem to fit my needs, but I will try to keep you informed of my progress.

    –Happy Data

     

  • Serengeti Soups Up Apache Hadoop

    The primary goal of Bigtop is to build a community around the packaging, deployment and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

    If you looking for in easy to install packaging or something to setup in a software repo. I suggest you check it out.

     

    https://cwiki.apache.org/BIGTOP/index.html

     

    peace!