You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2013/12/28 17:59:50 UTC

[jira] [Commented] (TEZ-624) Fix output committer to support multiple outputs

    [ https://issues.apache.org/jira/browse/TEZ-624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13858067#comment-13858067 ] 

Bikas Saha commented on TEZ-624:
--------------------------------

There are 3 aspects to this.
1) Multiple outputs from the same vertex
2) Multiple outputs from different vertices
3) Both the above but writing to the same output dir.

All of these can be solved by adding the output name to the output file name for the MROutput case. Currently the output files are name part-r-0000 where r is the task type and 0000 is the task id. We can replace the r with the output name given to the output (as specified in the API). Alternatively we can use the vertex name + output index. This uniquely names the part file and so subsequent movements of these files during task and vertex commit will not collide with files written by other outputs.
The changes seem fairly straightforward. 
For old API we need to change the MROutput.getOutputName() method.
For new API we need to subclass FileOutputFormat and override protected method getOutputName().

> Fix output committer to support multiple outputs
> ------------------------------------------------
>
>                 Key: TEZ-624
>                 URL: https://issues.apache.org/jira/browse/TEZ-624
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Hitesh Shah
>            Assignee: Bikas Saha
>
> Output committers should be specified on each output and not a per vertex.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)