Make way for Hadoop in the ‘Big Data’ craze

Interesting bit on Hadoop a little over hyped if you ask me.

http://www.marketwatch.com/story/make-way-for-hadoop-in-the-big-data-craze-2012-06-26?link=MW_latest_news

–Regards

Working with Hadoop Streaming

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. For example:

shell> $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc

If you using the tar package from Apache Hadoop. You can find the hadoop-streaming.jar in $HADOOP_HOME/contrib/streaming/hadoop-streaming-xxx.jar

Amazon Relational Database Service (Amazon RDS)

It appears that Amazon is introducing a new service specifically targeted at Relational Databases helpful hints. You can choose from MySQL, Oracle, and Microsoft Sql Server.

Amazon Relational Database Service (Amazon RDS) is a web service that makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business.

Looking at the Hadoop MapReduce Capacity, Fair, and Hod Schedulers.

Today | started looking at the different MapReduce Schedulers, because I would like to be able to start the processing on a new jobs when slots became available. So I started look at the other schedulers that come with Hadoop.

The Capacity Scheduler:

The Capacity Scheduler is designed to run Hadoop Map-Reduce as a shared, multi-tenant cluster in an operator-friendly manner while maximizing the throughput and the utilization of the cluster while running Map-Reduce applications.

The Fair Scheduler:

Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities – the priorities are used as weights to determine the fraction of total compute time that each job gets.

The Hod Scheduler:

Hadoop On Demand (HOD) is a system for provisioning and managing independent Hadoop MapReduce and Hadoop Distributed File System (HDFS) instances on a shared cluster of nodes see this page. HOD is a tool that makes it easy for administrators and users to quickly setup and use Hadoop. HOD is also a very useful tool for Hadoop developers and testers who need to share a physical cluster for testing their own Hadoop versions.

I decided to started with the Fair Scheduler, since it seem to fit my needs, but I will try to keep you informed of my progress.

–Happy Data

Serengeti Soups Up Apache Hadoop

The primary goal of Bigtop is to build a community around the packaging, deployment and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.

If you looking for in easy to install packaging or something to setup in a software repo. I suggest you check it out.

https://cwiki.apache.org/BIGTOP/index.html

peace!

Roger Hosto

Open Source Geek

Monthly Archives: June 2012

Make way for Hadoop in the ‘Big Data’ craze

Working with Hadoop Streaming

Amazon Relational Database Service (Amazon RDS)

Looking at the Hadoop MapReduce Capacity, Fair, and Hod Schedulers.

Serengeti Soups Up Apache Hadoop