You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2014/10/09 23:43:34 UTC
[jira] [Comment Edited] (TEZ-1649) ShuffleVertexManager auto reduce
parallelism can cause jobs to hang indefinitely (with ScatterGather edges)
[ https://issues.apache.org/jira/browse/TEZ-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165791#comment-14165791 ]
Bikas Saha edited comment on TEZ-1649 at 10/9/14 9:43 PM:
----------------------------------------------------------
Do we also need to augment this with a fix similar to TEZ-1494. Per this comment from you on https://issues.apache.org/jira/browse/TEZ-1522?focusedCommentId=14119672&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14119672.
If we do that and then probably this per source vertex check may not be needed.
was (Author: bikassaha):
Do we also need to augment this with a fix similar to TEZ-1494. Per this comment from you on https://issues.apache.org/jira/browse/TEZ-1522?focusedCommentId=14119672&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14119672
> ShuffleVertexManager auto reduce parallelism can cause jobs to hang indefinitely (with ScatterGather edges)
> -----------------------------------------------------------------------------------------------------------
>
> Key: TEZ-1649
> URL: https://issues.apache.org/jira/browse/TEZ-1649
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Attachments: TEZ-1649.1.patch, TEZ-1649.png
>
>
> Consider the following DAG
> M1, M2 --> R1
> M2, M3 --> R2
> R1 --> R2
> All edges are Scatter-Gather.
> 1. Set R1's (1000 parallelism) min/max setting to 0.25 - 0.5f
> 2. Set R2's (21 parallelism) min/max setting to 0.2 and 0.3f
> 3. Let M1 send some data from HDFS (test.txt)
> 4. Let M2 (50 parallelism) generate some data and send it to R2
> 5. Let M3 (500 parallelism) generate some data and send it to R2
> - Since R2's min/max can get satisfied by getting events from M3 itself, R2 will change its parallelism quickly than R1.
> - In the mean time, R1 changes its parallelism from 1000 to 20. This is not propagated to R2 and it would keep waiting.
> Tested this on a small scale (20 node) cluster and it happens consistently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)