You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by David Rosenstrauch <da...@darose.net> on 2013/02/28 07:12:24 UTC

Distcp reliability issue

I've run into an issue with the reliability of distcp.  Specifically, I 
have a distcp job that seems to have not copied over a few files - and 
yet didn't fail the job.  Was hoping someone here might have some 
suggestions/fix/workaround.


So I ran a distcp job.  (Copying from one Amazon S3 bucket to another.) 
  The job did have some task failures - but any task that failed 
eventually got re-run successfully.  The job as a whole completed 
successfully, and seemed to think that all files were copied successfully.

However, I re-ran the distcp again afterwards just to make sure 
everything copied over successfully, since it's important data.  (And 
also because I had canceled an earlier run of the same distcp, and I 
wanted to make that didn't screw anything up.)  And although the re-run 
of the distcp skipped over most of the files (like it should) it 
actually wound up copying 7 files - i.e., 7 files that didn't get copied 
in the first job.  This obviously shouldn't have happened, as it should 
have copied over all of the files in the first run, and the second run 
should have copied zero.

I have task logs (and job counters) saved that show all of this.

I think I remember a colleague of mine from a previous job running into 
a situation like this before, where he wound up having to run distcp 
jobs twice in order to reliably ensure that all files copied 
successfully.  But I don't know what (if anything) he eventually did to 
work around the issue.

Anyone ever run into this before and/or have any pointers to discussions 
about this issue or a solution?  (Or even info about any home-grown 
solution you've used to work around this.)  Google didn't turn up much.

Thanks,

DR