You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by S D <sd...@gmail.com> on 2009/03/23 05:45:43 UTC

Duplicate Output Directories in S3

I have an Hadoop Streaming program that crawls the web for data items,
processes each retrieved item and then stores the results on S3. For each
processed item a directory on S3 is created to store the results produced by
the processing. At the conclusion of a program run I've been getting a
duplication of each directory. E.g., if I process item A1 and item A2 I get
two directories for the results of A1 and two directories for the results of
A2. The corresponding directories are identical. I've checked my code and
don't see anything obvious that could lead to this. Furthermore, it appears
that only one map task is handling a given data item. Any suggestions on
what this might be?

Thanks,
John