Hadoop distcp - is it possible to keep every file the same (keep file size)?

Question

Hadoop distcp - is it possible to keep every file the same (keep file size)?

When I run a simple distcp command:

hadoop distcp s3://src-bucket/src-dir s3://dest-bucket/dest-dir

I am getting a slight mismatch between size (in bytes) src-dir

and dest-dir

>aws s3 --summarize s3://dest-bucket/dest-dir/
...
Total Objects: 12290
   Total Size: 64911104881181

>aws s3 --summarize s3://dest-bucket/dest-dir/
...
Total Objects: 12290
   Total Size: 64901040284124

My question is:

What could introduce this discrepancy? Is the content of my dir file still the same as the original?
Most importantly, are there any options I can set to ensure that each file looks exactly like its scrolling counter (i.e. the same file size)?

+3

hadoop hdfs distcp

pl0u June 18 17 at 8:18

source to share

2 answers

Chris nauroth · Answer 1 · 2017-06-19T17:25:11+0000

What could introduce this discrepancy? Is the content of my dir file still the same as the original?

Is it possible that src-dir was doing concurrent writes while DistCp was starting? For example, was the file open for writing to src-dir by another application, and the application was writing the contents to the file while DistCp was running?

Potential consistency effects on S3 could also come into play, especially around updates to existing objects. If an application overwrites an existing object, then a time window appears where applications viewing that object can see the old version of the object, or they can see the new version. More information on this is available in the AWS Amazon S3 Data Consistency Model documentation .

Most importantly, are there any options I can set to ensure that each file looks exactly like its scrolling counter (i.e. the same file size)?

In general, DistCp will CRC each source file against a new copy at the destination to confirm that it was copied correctly. I noticed that you are using S3 filesystem instead of HDFS. For S3, like many alternative file systems, there is a limitation that this CRC cannot be performed.

As an added note, S3FileSystem

(URI with s3://

for schema) is effectively deprecated, not supported by the Apache Hadoop community, and poorly supported. If possible, we recommend that users switch to S3AFileSystem

(URI with s3a://

for Scheme) for improved functionality, performance, and support. Read more on Integration with Amazon Web Services for more details.

If you cannot find an explanation for the behavior you see with the help s3://

, then it is possible that there is a bug and you may be prompted to try s3a://

. (If you already have data that has already been written with s3://

, then you will need to figure out some sort of migration for that data first, for example by copying from a URI s3://

to an equivalent s3a://

URI.)

Raúl Gutiérrez Segalés · Answer 2 · 2017-06-22T18:38:38+0000

I have a difference in how src is compressed and how dst is compressed (or not). So I would say:

1) check the settings .*compress.*

for any src created

2) make sure they match your .*compress.*

distcp job settings

Compression algorithms - using the same settings - should give deterministic results. So I suspect there is a mismatch between compression in origin and compression (or not) in destination.

Hadoop distcp - is it possible to keep every file the same (keep file size)?

More articles: