You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2015/11/15 06:42:11 UTC

[jira] [Commented] (TEZ-2943) Auto reduce parallelism estimates badly on vertices with two disparate input sources

    [ https://issues.apache.org/jira/browse/TEZ-2943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005743#comment-15005743 ] 

Bikas Saha commented on TEZ-2943:
---------------------------------

This is a known issue because the logic does not account for each vertex separately. The initial code in the shuffle manager was naive and based on the MRR pattern where there was only one input. The code has changed to allow more (and the APIs have been added to provide more information) but these have not been incorporated into the managers logic. Some improvements have been made to use data size (instead of just task counts) but those have targeted the corner cases of having very little data. Similarly, there is logic to use per partition estimates to scheduler larger partition tasks first. However, some of the core logic on triggering the scheduling have not evolved as much, resulting in the above behavior. Its a known issue but without a prior jira tracking it. So thats why it may have fallen off the radar. Thanks for opening the jira. Now it will get fixed :)

> Auto reduce parallelism estimates badly on vertices with two disparate input sources
> ------------------------------------------------------------------------------------
>
>                 Key: TEZ-2943
>                 URL: https://issues.apache.org/jira/browse/TEZ-2943
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Jonathan Eagles
>            Assignee: Bikas Saha
>
> Excerpts from a DAG execution log highlight the issue.
> A DAG has a vertex scope-212 that has two input sources scope-210 and scope-211. The input properties have the following data movement properties.
> {noformat}
> scope-211: 
> -Tasks count: 72
> -OUTPUT_BYTES:93,975,296
> scope-210
> -Tasks count: 5
> -OUTPUT_BYTES: 2,315,364,586
> {noformat}
> Here is scope-212 auto reduce parallelism kicking in
> {noformat}
> 2015-11-11 19:46:28,829 [INFO] [App Shared Pool - #0] |vertexmanager.ShuffleVertexManager|: Reduce auto parallelism for vertex: scope-212 to 1 from 56 . Expected output: 101293660 based on actual output: 76299121 from 58 vertex manager events.  desiredTaskInputSize: 134217728 max slow start tasks:57.75 num sources completed:58
> {noformat}
> Some more background on why we determined the auto parallelism and started scheduling tasks
> {noformat}
> 2015-11-11 19:46:28,829 [INFO] [App Shared Pool - #0] |vertexmanager.ShuffleVertexManager|: Scheduling 56 tasks for vertex: scope-212 with totalTasks: 56. 58 source tasks completed out of 77. SourceTaskCompletedFraction: 0.7532467 min: 0.25 max: 0.75
> {noformat}
> It made this decision since 58/77 is roughly 75%. Most of the data came from scope-210. let's see how many of the 58 source tasks completed are from scope-210
> This is the first task attempt from scope-210
> {noformat}
> 2015-11-11 19:47:02,862 [INFO] [Dispatcher thread {Central}] |history.HistoryEventHandler|: [HISTORY][DAG:dag_1446259040496_444992_1][Event:TASK_ATTEMPT_FINISHED]: vertexName=scope-210, taskAttemptId=attempt_1446259040496_444992_1_02_000004_0, creationTime=1447271172510, allocationTime=1447271175440, startTime=1447271181036, finishTime=1447271222861, timeTaken=41825, status=SUCCEEDED, errorEnum=, diagnostics=
> {noformat}
> This is almost 30 seconds after the auto-reduce parallelism kicked in. So it seems we have a serious auto-reduce parallelism bug since it 1) doesn't require 75% percent of each source vertex and/or 2) it doesn't weight the expected bytes per vertex.
> This means that it launched only 1 task for 2.5G of data instead of 16
> This may relate to TEZ-1532 since events don't know signify which source vertex or task they originate from.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)