Using the DistCp tool

Use DistCp to copy files between various clusters.

The most common use of DistCp is an inter-cluster copy:

hadoop distcp hdfs://nn1:8020/source hdfs://nn2:8020/destination

Where hdfs://nn1:8020/source is the data source, and hdfs://nn2:8020/ destination is the destination.

This will expand the name space under /source on NameNode "nn1" into a temporary file, partition its contents among a set of map tasks, and start copying from "nn1" to "nn2". Note that DistCp requires absolute paths.

You can also specify multiple source directories:

hadoop distcp hdfs://nn1:8020/source/a hdfs://nn1:8020/source/b hdfs://
          nn2:8020/destination

Or specify multiple source directories from a file with the -f option:


hadoop distcp -f hdfs://nn1:8020/srclist hdfs://nn2:8020/destination

Where srclist contains:

hdfs://nn1:8020/source/a

hdfs://nn1:8020/source/b