You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/10/15 06:10:33 UTC
[jira] [Comment Edited] (TEZ-1637) Improved shuffle error handling
across NM restarts
[ https://issues.apache.org/jira/browse/TEZ-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171956#comment-14171956 ]
Rajesh Balamohan edited comment on TEZ-1637 at 10/15/14 4:10 AM:
-----------------------------------------------------------------
>>>>
In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be inconsistent.
>>>>
Fixed this. Added a simple testcase to verify this in TestFetcher.testWithRetry().
>>>>>
setupConnectionsWithRetry..I think it should just be called setupConnection.
>>>>>
Renamed.
>>>
Any reason to create a new list. Can remaining just be used like the other call.
>>>
"remaining" is LinkedHashSet. setupConnection() and ShuffleUtils accept List<InputAttemptIdentifier>. Hence the change.
>>>
A custom exception may be a better option.
>>>
Introduced FetcherReadTimeoutException to address this.
>>>
We're primarily retrying on read errors. When a NodeManager goes down - the connection timeout is what is preventing the connection from failing immediately ? Assuming that's why we don't need retry logic in place there.
>>>
Yes
was (Author: rajesh.balamohan):
>>>>
In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be inconsistent.
>>>>
Fixed this.
>>>>>
setupConnectionsWithRetry..I think it should just be called setupConnection.
>>>>>
Renamed.
>>>
Any reason to create a new list. Can remaining just be used like the other call.
>>>
"remaining" is LinkedHashSet. setupConnection() and ShuffleUtils accept List<InputAttemptIdentifier>. Hence the change.
>>>
A custom exception may be a better option.
>>>
Introduced FetcherReadTimeoutException to address this.
>>>
We're primarily retrying on read errors. When a NodeManager goes down - the connection timeout is what is preventing the connection from failing immediately ? Assuming that's why we don't need retry logic in place there.
>>>
Yes
> Improved shuffle error handling across NM restarts
> ---------------------------------------------------
>
> Key: TEZ-1637
> URL: https://issues.apache.org/jira/browse/TEZ-1637
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-1637.1.patch, TEZ-1637.2.patch, TEZ-1637.WIP.patch
>
>
> Similar to MAPREDUCE-5891 :- need to make sure the Tez shufflehandler can handle NM restarts correctly. This is required for rolling upgrades
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)