You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2021/11/02 17:28:00 UTC

[jira] [Resolved] (TEZ-4336) ShuffleScheduler should try to report the original exception (when shuffle becomes unhealthy)

     [ https://issues.apache.org/jira/browse/TEZ-4336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

László Bodor resolved TEZ-4336.
-------------------------------
    Resolution: Fixed

> ShuffleScheduler should try to report the original exception (when shuffle becomes unhealthy)
> ---------------------------------------------------------------------------------------------
>
>                 Key: TEZ-4336
>                 URL: https://issues.apache.org/jira/browse/TEZ-4336
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>             Fix For: 0.10.2
>
>         Attachments: TEZ_4336_client_output.txt
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> In a client log, I can something like:
> {code}
> ERROR : FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex re-running, vertexName=Map 1, vertexId=vertex_1632183109176_0005_8_03Vertex re-running, vertexName=Map 2, vertexId=vertex_1632183109176_0005_8_04Vertex failed, vertexName=Reducer 3, vertexId=vertex_1632183109176_0005_8_05, diagnostics=[Task failed, taskId=task_1632183109176_0005_8_05_000032, diagnostics=[TaskAttempt 0 killed, TaskAttempt 1 killed, TaskAttempt 2 killed, TaskAttempt 3 killed, TaskAttempt 4 killed, TaskAttempt 5 killed, TaskAttempt 6 killed, TaskAttempt 7 killed, TaskAttempt 8 killed, TaskAttempt 9 killed, TaskAttempt 10 killed, TaskAttempt 11 killed, TaskAttempt 12 killed, TaskAttempt 13 failed, info=[AttemptID:attempt_1632183109176_0005_8_05_000032_13 Timed out after 300 secs], TaskAttempt 14 killed, TaskAttempt 15 killed, TaskAttempt 16 killed, TaskAttempt 17 killed, TaskAttempt 18 killed, TaskAttempt 19 killed, TaskAttempt 20 killed, TaskAttempt 21 killed, TaskAttempt 22 killed, TaskAttempt 23 killed, TaskAttempt 24 killed, TaskAttempt 25 killed, TaskAttempt 26 killed, TaskAttempt 27 killed, TaskAttempt 28 killed, TaskAttempt 29 killed, TaskAttempt 30 killed, TaskAttempt 31 killed, TaskAttempt 32 killed, TaskAttempt 33 killed, TaskAttempt 34 killed, TaskAttempt 35 killed, TaskAttempt 36 killed, TaskAttempt 37 killed, TaskAttempt 38 killed, TaskAttempt 39 killed, TaskAttempt 40 killed, TaskAttempt 41 killed, TaskAttempt 42 killed, TaskAttempt 43 killed, TaskAttempt 44 killed, TaskAttempt 45 killed, TaskAttempt 46 killed, TaskAttempt 47 killed, TaskAttempt 48 killed, TaskAttempt 49 killed, TaskAttempt 50 killed, TaskAttempt 51 killed, TaskAttempt 52 killed, TaskAttempt 53 killed, TaskAttempt 54 killed, TaskAttempt 55 killed, TaskAttempt 56 killed, TaskAttempt 57 killed, TaskAttempt 58 killed, TaskAttempt 59 killed, TaskAttempt 60 killed, TaskAttempt 61 killed, TaskAttempt 62 killed, TaskAttempt 63 killed, TaskAttempt 64 killed, TaskAttempt 65 killed, TaskAttempt 66 killed, TaskAttempt 67 killed, TaskAttempt 68 killed, TaskAttempt 69 killed, TaskAttempt 70 killed, TaskAttempt 71 killed, TaskAttempt 72 killed, TaskAttempt 73 killed, TaskAttempt 74 killed, TaskAttempt 75 killed, TaskAttempt 76 killed, TaskAttempt 77 killed, TaskAttempt 78 killed, TaskAttempt 79 killed, TaskAttempt 80 killed, TaskAttempt 81 killed, TaskAttempt 82 killed, TaskAttempt 83 killed, TaskAttempt 84 killed, TaskAttempt 85 killed, TaskAttempt 86 killed, TaskAttempt 87 killed, TaskAttempt 88 killed, TaskAttempt 89 killed, TaskAttempt 90 killed, TaskAttempt 91 killed, TaskAttempt 92 killed, TaskAttempt 93 killed, TaskAttempt 94 killed, TaskAttempt 95 killed, TaskAttempt 96 failed, info=[Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {Map_2} #13
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
> 	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> 	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
> 	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
> 	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Map_2: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=14, pendingInputs=4130, fetcherHealthy=false, reducerProgressedEnough=false, reducerStalled=false
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.isShuffleHealthy(ShuffleScheduler.java:1055)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.ShuffleScheduler.copyFailed(ShuffleScheduler.java:793)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.setupConnection(FetcherOrderedGrouped.java:392)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.copyFromHost(FetcherOrderedGrouped.java:265)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.fetchNext(FetcherOrderedGrouped.java:184)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:196)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.FetcherOrderedGrouped.callInternal(FetcherOrderedGrouped.java:59)
> 	... 7 more
> , errorMessage=Shuffle Runner Failed:org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher_O {Map_2} #13
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:306)
> 	at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal(Shuffle.java:288)
> 	at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
> 	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
> 	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
> 	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> Shuffle failed with too many fetch failures and insufficient progress !failureCounts=14 means that the underlying exception wasn't reported, only the shuffle failure, it would be good the have some details
> here, isShuffleHealthy simply creates an exception:
> https://github.com/apache/tez/blob/5eeccf0e318e22cdcbbe202a9f554f93d138c207/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/ShuffleScheduler.java#L1059
> what if we stored the last exception (usually, most of them have the same root cause) and wrap it somehow into this IOException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)