You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Ashish Paliwal <as...@gmail.com> on 2016/12/21 12:58:44 UTC

Hadoop MultiOutputs API Issue

Hi,

Hadoop Map Reduce version: 2.2.0

We are using MultiOutputs to write mullitple output files from Mapper(No
reducer). As per requirement, multioutput should write in directory other
than job's default output directory. So We used below MultiOutput method to
write in different directory.

 public <K, V> void
<http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.java#>
write(String
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
 namedOutput, K key, V value,String
<http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
 baseOutputPath)

Now, if any Map task run for longer time, then (cause speculative execution
enabled), hadoop start parallel task to complete task early. Now, both task
trying to write in same directory in same file. Second task failed with
"File already exists issue" and so Job.

After analyzing it founds that, like default context writer, *MultiOutputs
API does not create any temporary directory*. It directly starts writing
into output directory. and the reason is FileOutputCommitter used by
default context writer (and so Application Master) is different
than MultiOutputs.writer. So in case of MultiOutput, none of the method of
FileOutputCommitter is get called.

So is it known issue or default behavior? And what is the solution for this
problem?


Regards,
Ashish.

Re: Hadoop MultiOutputs API Issue

Posted by Ashish Paliwal <as...@gmail.com>.

Please share comments on mention issue.

Regards,
Ashish.

On Wed, Dec 21, 2016 at 6:28 PM, Ashish Paliwal <as...@gmail.com>
wrote:

> Hi,
>
> Hadoop Map Reduce version: 2.2.0
>
> We are using MultiOutputs to write mullitple output files from Mapper(No
> reducer). As per requirement, multioutput should write in directory other
> than job's default output directory. So We used below MultiOutput method to
> write in different directory.
>
>  public <K, V> void
> <http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.java#>
> write(String
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
>  namedOutput, K key, V value,String
> <http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b27/java/lang/String.java#String>
>  baseOutputPath)
>
> Now, if any Map task run for longer time, then (cause speculative
> execution enabled), hadoop start parallel task to complete task early. Now,
> both task trying to write in same directory in same file. Second task
> failed with "File already exists issue" and so Job.
>
> After analyzing it founds that, like default context writer, *MultiOutputs
> API does not create any temporary directory*. It directly starts writing
> into output directory. and the reason is FileOutputCommitter used by
> default context writer (and so Application Master) is different
> than MultiOutputs.writer. So in case of MultiOutput, none of the method of
> FileOutputCommitter is get called.
>
> So is it known issue or default behavior? And what is the solution for
> this problem?
>
>
> Regards,
> Ashish.
>