You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Rohini Palaniswamy (JIRA)" <ji...@apache.org> on 2015/03/04 23:59:38 UTC

[jira] [Commented] (PIG-4446) Support and add unit test for vertex level commit in Tez

    [ https://issues.apache.org/jira/browse/PIG-4446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347730#comment-14347730 ] 

Rohini Palaniswamy commented on PIG-4446:
-----------------------------------------

Enabling per vertex commit would be required in cases where multiple outputs are produced in parallel. It usually falls into two cases

      - forks happen (multiquery cannot be applied) and there could be one or more levels of vertices after the fork which can be run in parallel.

      - User is reading different subdirectories of a data and processing them in parallel and producing different output. This is multiple disconnected DAGs within a single Tez DAG. 

In this case different MR jobs write the different outputs and thus run in parallel.  If outputs are not produced as each stage ends in Tez then dependent jobs which start based on output availability could start late and it will look like a regression compared to MR.
For eg: Lets say there is a 5 vertex DAG where each vertex takes 30 mins to finish and produces an HDFS output. There is a different pipeline, which will be triggered by oozie after the output of vertex2 is available. With MR the second pipeline would start after 1 hour. With per vertex output commit, the second pipeline would continue to start after 1 hour. With DAG commit, the second pipeline would start after 2.5 hours.

Enabling per vertex commit would be a config change setting TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS to false. It would be easy to turn it on automatically based on the plan constructed for the above two cases.  If needed users can manually override in pig commandline or pig script.

Currently DAG level commit and per vertex commit are being done serially in the Tez AM and has performance issues which are being fixed in TEZ-714.  Additionaly MAPREDUCE-4815  will also help make the output commits faster for both MR and Tez.



> Support and add unit test for vertex level commit in Tez
> --------------------------------------------------------
>
>                 Key: PIG-4446
>                 URL: https://issues.apache.org/jira/browse/PIG-4446
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>             Fix For: 0.15.0
>
>
>   By default, Tez does AM level commit controlled by the setting TEZ_AM_COMMIT_ALL_OUTPUTS_ON_DAG_SUCCESS. It has support for vertex level commit as well and that makes sense in some cases and Pig should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)