Skip to content

Roger Hosto

Good Talk

Menu
  • Home
  • Blogs
    • Databases Administration
      • MySQL
      • NoSQL
    • Development
    • Open Source Software
    • System Administration
  • Resume
  • About
Menu

Hadoop to Hadoop Copy

Posted on March 1, 2013 by webgeek

Here recently I need to copy the content of one hadoop cluster to another for geo redundancy. Thankfully instead of have to write something to do it, Hadoop supply a hand tool to do it “DistCp (distributed copy)”.

 

DistCp is a tool used for large inter/intra-cluster copying. It uses Map/Reduce to effect its distribution, error handling and recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list. Its Map/Reduce pedigree has endowed it with some quirks in both its semantics and execution. The purpose of this document is to offer guidance for common tasks and to elucidate its model.

 

Here are the basic for using:

 

bash$ hadoop distcp hdfs://nn1:8020/foo/bar \

hdfs://nn2:8020/bar/foo

 

This will expand the namespace under /foo/bar on nn1 into a temporary file, partition its contents among a set of map tasks, and start a copy on each TaskTracker from nn1 to nn2. Note that DistCp expects absolute paths.

 

Here is how you can handle multiple source directories on the command line:

 

bash$ hadoop distcp hdfs://nn1:8020/foo/a \

hdfs://nn1:8020/foo/b \

hdfs://nn2:8020/bar/foo

Category: Databases Administration, System Administration

Leave a Reply

You must be logged in to post a comment.

  • Back to Basics: ORM and Its Impact on Database and Data Architecture
  • MySQL Error: 1062 'Duplicate entry' Error
  • Installing MariaDB 10.1 on CentOS 6.8
  • Linux Mint
  • Querying Apache Hadoop Resource Manager with Python.
  • LinkedIn
© 2026 Roger Hosto | Powered by Minimalist Blog WordPress Theme