You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2018/07/09 23:24:00 UTC

[jira] [Commented] (TEZ-3968) Tez Job Fails with Shuffle failures too fast when NM returns a 401 error

    [ https://issues.apache.org/jira/browse/TEZ-3968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16537776#comment-16537776 ] 

Gopal V commented on TEZ-3968:
------------------------------

The core issue is that the 400+ retries happened in under a second and the downstream tasks exited without waiting for the producer retry to start off.

This does not happen when a machine is unreachable, while in case of an NM which has lost data, but is otherwise healthy the error happens too fast for the retry to catch up and obsolete the older shuffle output.

This is a scenario where being slower would fix the issue and being faster makes it worse.

> Tez Job Fails with Shuffle failures too fast when NM returns a 401 error
> ------------------------------------------------------------------------
>
>                 Key: TEZ-3968
>                 URL: https://issues.apache.org/jira/browse/TEZ-3968
>             Project: Apache Tez
>          Issue Type: Improvement
>    Affects Versions: 0.7.1
>            Reporter: Prabhu Joseph
>            Priority: Major
>
> Tez Job failed with a reduce task failed on all four attempts while fetching a particular map output from a Node. NodeManager where MapTask has succeeded was stopped and got NM local directories cleared and started again (as disks were full). This has caused the shuffle failure in NodeManager as there is no Job Token found.
> NodeManager Logs shows reason for Shuffle Failure:
> {code}
> 2018-07-05 00:26:00,371 WARN  mapred.ShuffleHandler (ShuffleHandler.java:messageReceived(947)) - Shuffle failure
> org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't find job token for job job_1530690553693_17267 !!
>         at org.apache.hadoop.mapreduce.security.token.JobTokenSecretManager.retrieveTokenSecret(JobTokenSecretManager.java:112)
>         at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.verifyRequest(ShuffleHandler.java:1133)
>         at org.apache.hadoop.mapred.ShuffleHandler$Shuffle.messageReceived(ShuffleHandler.java:944)
> {code}
> Analysis of Application Logs:
> Application application_1530690553693_17267 failed with task task_1530690553693_17267_4_02_000496 failed on all four attempts.
> Four Attempts:
> {code}
> attempt_1530690553693_17267_4_02_000496_3 -> container_e270_1530690553693_17267_01_014554 -> bigdata2.openstacklocal
> attempt_1530690553693_17267_4_02_000496_2 -> container_e270_1530690553693_17267_01_014423 -> bigdata3.openstacklocal
> attempt_1530690553693_17267_4_02_000496_1 -> container_e270_1530690553693_17267_01_014311 -> bigdata4.openstacklocal
> attempt_1530690553693_17267_4_02_000496_0 -> container_e270_1530690553693_17267_01_014613 -> bigdata5.openstacklocal
> {code}
> All the four attempts failed while fetching a same Map Output:
> {code}
> 2018-07-05 00:26:54,161 [WARN] [fetcher {Map_1} #51] |orderedgrouped.FetcherOrderedGrouped|: Failed to verify reply after connecting to bigdata6.openstacklocal:13562 with 1 inputs pending
> java.io.IOException: Server returned HTTP response code: 401 for URL: http://bigdata6.openstacklocal:13562/mapOutput?job=job_1530690553693_17267&reduce=496&map=attempt_1530690553693_17267_4_01_000874_0_10003
> {code}
> The failures are being reported back to the AM correctly in Tez, though it is not reported as a "source unhealthy" because the NodeManager is healthy (due to the cleanup).
> {code}
> 2018-07-04 23:47:42,344 [INFO] [fetcher {Map_1} #10] |orderedgrouped.ShuffleScheduler|: Map_1: Reporting fetch failure for InputIdentifier: InputAttemptIdentifier [inputIdentifier=InputIdentifier [inputIndex=874], attemptNumber=0, pathComponent=ttempt_1530690553693_17267_4_01_000874_0_10003, spillType=0, spillId=-1] taskAttemptIdentifier: Map 1_000874_00 to AM.
> {code}
> There are approximated 460 errors reported back to the AM like this, which keeps getting marked as "fetcher unhealthy" which is probably because the restarted NM showed up as healthy.
> This scenario of shuffle failures are not handled as NM showed up as healthy. Mapper (source InputIdentifier ) has to be marked as unhealthy and rerun.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)