You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2020/09/16 04:27:00 UTC

[jira] [Comment Edited] (TEZ-4233) Map task should be blamed earlier for local fetch failures

    [ https://issues.apache.org/jira/browse/TEZ-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196667#comment-17196667 ] 

László Bodor edited comment on TEZ-4233 at 9/16/20, 4:26 AM:
-------------------------------------------------------------

in the meantime discussed with [~ashutoshc], this solution can be extended to even detect remote fetch failures...
currently, ShuffleHandler returns HTTP 500 in case of an unrecoverable issue (local file not found), for both [tez|https://github.com/apache/tez/blob/master/tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java#L1087] and [llap|https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/shufflehandler/ShuffleHandler.java#L815]

for instance, this is an exception in LLAP:
{code}
<11>1 2020-09-04T08:14:08.495Z query-executor-0-4.query-executor-0-service.compute-1599179173-5pzs.svc.cluster.local query-executor 1 193f7c8d-151d-438c-90e9-34ad5dfa104c [mdc@18060 class="shufflehandler.ShuffleHandler" level="ERROR" thread="New I&#x2F;O worker #8"] Shuffle error in populating headers :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/hive/appcache/application_1599179385771_0009/240/output/attempt_1599179385771_0009_240_05_000464_10_215071_0/file.out.index in any of the configured local directories
	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:494)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:166)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle$1.load(ShuffleHandler.java:664)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle$1.load(ShuffleHandler.java:657)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle.getMapOutputInfo(ShuffleHandler.java:877)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle.populateHeaders(ShuffleHandler.java:912)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:809)
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
{code}

other side, fetcher receives HTTP 500 and print logs below, hitting this [path|https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/FetcherOrderedGrouped.java#L453]
{code}
<12>1 2020-09-04T08:14:09.328Z query-executor-0-0.query-executor-0-service.compute-1599179173-5pzs.svc.cluster.local query-executor 1 4334e088-814d-4536-b906-8e19912686cf [mdc@18060 class="orderedgrouped.FetcherOrderedGrouped" level="WARN" thread="Fetcher_O {Map_1} #2"] Invalid map id: TTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, partition: 13

<12>1 2020-09-04T08:14:09.328Z query-executor-0-0.query-executor-0-service.compute-1599179173-5pzs.svc.cluster.local query-executor 1 4334e088-814d-4536-b906-8e19912686cf [mdc@18060 class="orderedgrouped.FetcherOrderedGrouped" level="WARN" thread="Fetcher_O {Map_1} #2"] copyMapOutput failed for tasks [InputAttemptIdentifier [inputIdentifier=174, attemptNumber=3, pathComponent=attempt_1599179385771_0008_18_02_000174_3_55687_0, spillType=2, spillId=0]]
{code}
then it calls [copyFailed|https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/FetcherOrderedGrouped.java#L318], I think at this point, ShuffleScheduler.copyFailed can let AM know that it was a fatal problem



was (Author: abstractdog):
in the meantime discussed with [~ashutoshc], this solution can be extended to event detect remote fetch failures...
currently, ShuffleHandler returns HTTP 500 in case of an unrecoverable issue (local file not found), for both [tez|https://github.com/apache/tez/blob/master/tez-plugins/tez-aux-services/src/main/java/org/apache/tez/auxservices/ShuffleHandler.java#L1087] and [llap|https://github.com/apache/hive/blob/master/llap-server/src/java/org/apache/hadoop/hive/llap/shufflehandler/ShuffleHandler.java#L815]

for instance, this is an exception in LLAP:
{code}
<11>1 2020-09-04T08:14:08.495Z query-executor-0-4.query-executor-0-service.compute-1599179173-5pzs.svc.cluster.local query-executor 1 193f7c8d-151d-438c-90e9-34ad5dfa104c [mdc@18060 class="shufflehandler.ShuffleHandler" level="ERROR" thread="New I&#x2F;O worker #8"] Shuffle error in populating headers :
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find usercache/hive/appcache/application_1599179385771_0009/240/output/attempt_1599179385771_0009_240_05_000464_10_215071_0/file.out.index in any of the configured local directories
	at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:494)
	at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:166)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle$1.load(ShuffleHandler.java:664)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle$1.load(ShuffleHandler.java:657)
	at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
	at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
	at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
	at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
	at com.google.common.cache.LocalCache.get(LocalCache.java:3953)
	at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3976)
	at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4960)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle.getMapOutputInfo(ShuffleHandler.java:877)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle.populateHeaders(ShuffleHandler.java:912)
	at org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:809)
	at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
{code}

other side, fetcher receives HTTP 500 and print logs below, hitting this [path|https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/FetcherOrderedGrouped.java#L453]
{code}
<12>1 2020-09-04T08:14:09.328Z query-executor-0-0.query-executor-0-service.compute-1599179173-5pzs.svc.cluster.local query-executor 1 4334e088-814d-4536-b906-8e19912686cf [mdc@18060 class="orderedgrouped.FetcherOrderedGrouped" level="WARN" thread="Fetcher_O {Map_1} #2"] Invalid map id: TTP/1.1 500 Internal Server Error
Content-Type: text/plain; charset=UTF, expected to start with attempt, partition: 13

<12>1 2020-09-04T08:14:09.328Z query-executor-0-0.query-executor-0-service.compute-1599179173-5pzs.svc.cluster.local query-executor 1 4334e088-814d-4536-b906-8e19912686cf [mdc@18060 class="orderedgrouped.FetcherOrderedGrouped" level="WARN" thread="Fetcher_O {Map_1} #2"] copyMapOutput failed for tasks [InputAttemptIdentifier [inputIdentifier=174, attemptNumber=3, pathComponent=attempt_1599179385771_0008_18_02_000174_3_55687_0, spillType=2, spillId=0]]
{code}
then it calls [copyFailed|https://github.com/apache/tez/blob/master/tez-runtime-library/src/main/java/org/apache/tez/runtime/library/common/shuffle/orderedgrouped/FetcherOrderedGrouped.java#L318], I think at this point, ShuffleScheduler.copyFailed can let AM know that it was a fatal problem


> Map task should be blamed earlier for local fetch failures
> ----------------------------------------------------------
>
>                 Key: TEZ-4233
>                 URL: https://issues.apache.org/jira/browse/TEZ-4233
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: László Bodor
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4233.01.patch
>
>
> Fetch failures can be a result of network issue or disk issue. Currently, AM doesn't know about whether the original input read error happened because of a local fetch failure or not. I think if a map output was reported as a subject of local fetch failure, AM should respond earlier, and blame it as soon as possible. Here is a hidden assumption that a disk read should never fail (or relatively rarely compared to network issues).
> When I detected this issue, it was a Kubernetes based LLAP environment, where a daemon completely disappeared and a new daemon - running reducer tasks - assumed that it has map outputs locally, which wasn't the case. 
> This patch can help in container mode as well, as we can assume that a local read should work, and if it's not, the original map output data should be re-generated as soon as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)