Here recently I need to copy the content of one hadoop cluster to another for geo redundancy. Thankfully instead of have to write something to do it, Hadoop supply a hand tool to do it “DistCp (distributed copy)”.
DistCp is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.
Here are the basic for using:
bash$ hadoop distcp hdfs://nn1:8020/foo/bar \
This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths.
Here is how you can handle multiple source directories on the command line:
bash$ hadoop distcp hdfs://nn1:8020/foo/a \