You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by Gurleen Dhody <Gu...@microsoft.com.INVALID> on 2019/03/06 22:28:14 UTC

OutputCommitter access Task Information in the final output Vertex

Hello Tez devs,

The current OutputCommitter added as a dataSink to the end vertex allows to finalize the output.

Pretext -
Currently we are generating 'temp output files' at the output stage. The filename of these files are made using unique identifiers (including task index, task attempt number, task vertex index, numPhysicalOutputs)

Problem -
During the output committer stage I couldn't find a way to access task information (task index, task attempt number) of the final output vertex.

Why I need task information?

  *   Recreate paths of the 'temp output files' for final processing
  *   If speculation is turned on. Then the output final vertex might generate multiple duplicate tasks generating similar temp output files but with different task attempt number. However at the end only one task attempt is successful and its output is used. The problem arises in a race condition when both or more attempts finish successfully generating multiple similar temp output files. Though only one attempt is registered as successful. We would like to know the other attempts so as to clean-up these speculated temp output files.

Can you provide a way how we can access task information of the final output vertex.

Appreciate any suggestions.

Thank You,
Gurleen