Hadoop to Hadoop Copy

Here recently I need to copy the content of one hadoop cluster to another for geo redundancy. Thankfully instead of have to write something to do it, Hadoop supply a hand tool to do it “DistCp (distributed copy)”.

 

DistCp is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

 

Here are the basic for using:

 

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

 

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths.

 

Here is how you can handle multiple source directories on the command line:

 

bash$ hadoop distcp hdfs://nn1:8020/foo/a \

hdfs://nn1:8020/foo/b \

hdfs://nn2:8020/bar/foo

Hortonworks Road Show “Big Business Value from Big Data and Hadoop”

This morning I went to the Hortonworks Road Show. It’s wasn’t Bad. I have to say out of the Hadoop Vendor I have talked to, I like Hortonworks business model the best.

The fact that they are a large committer to the Apache Hadoop Project, along with several other sub projects such as Apache Ambari Project doesn’t hurt. They seem to be more community based then the others. If you have a chance or know someone that would like a good introduce to hadoop I would recommend that they go.

http://info.hortonworks.com/RoadShowFall2012.html?mktotrk=roadshow

–Peace

Working with Hadoop Streaming

Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer. For example:

shell> $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar  -input myInputDirs -output myOutputDir -mapper /bin/cat -reducer /bin/wc

If you using the tar package from Apache Hadoop. You can find the hadoop-streaming.jar in $HADOOP_HOME/contrib/streaming/hadoop-streaming-xxx.jar

Amazon Relational Database Service (Amazon RDS)

It appears that Amazon is introducing a new service specifically targeted at Relational Databases helpful hints. You can choose from MySQLOracle, and Microsoft Sql Server.

Amazon Relational Database Service (Amazon RDS) is a web service that makes it easy to set up, operate, and scale a relational database in the cloud. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business.