You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2014/10/15 06:10:33 UTC

[jira] [Comment Edited] (TEZ-1637) Improved shuffle error handling across NM restarts

    [ https://issues.apache.org/jira/browse/TEZ-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171956#comment-14171956 ] 

Rajesh Balamohan edited comment on TEZ-1637 at 10/15/14 4:10 AM:
-----------------------------------------------------------------

>>>>
In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be inconsistent. 
>>>>
Fixed this. Added a simple testcase to verify this in TestFetcher.testWithRetry().

>>>>>
setupConnectionsWithRetry..I think it should just be called setupConnection.
>>>>>
Renamed.

>>>
Any reason to create a new list. Can remaining just be used like the other call.
>>>
"remaining" is LinkedHashSet.  setupConnection() and ShuffleUtils accept List<InputAttemptIdentifier>.  Hence the change.

>>>
A custom exception may be a better option. 
>>>
Introduced FetcherReadTimeoutException to address this.

>>>
We're primarily retrying on read errors. When a NodeManager goes down - the connection timeout is what is preventing the connection from failing immediately ? Assuming that's why we don't need retry logic in place there.
>>>
Yes


was (Author: rajesh.balamohan):
>>>>
In the ScatterGather Fetcher, putBackRemainingMapOutputs(host); seems to be inconsistent. 
>>>>
Fixed this. 

>>>>>
setupConnectionsWithRetry..I think it should just be called setupConnection.
>>>>>
Renamed.

>>>
Any reason to create a new list. Can remaining just be used like the other call.
>>>
"remaining" is LinkedHashSet.  setupConnection() and ShuffleUtils accept List<InputAttemptIdentifier>.  Hence the change.

>>>
A custom exception may be a better option. 
>>>
Introduced FetcherReadTimeoutException to address this.

>>>
We're primarily retrying on read errors. When a NodeManager goes down - the connection timeout is what is preventing the connection from failing immediately ? Assuming that's why we don't need retry logic in place there.
>>>
Yes

> Improved shuffle error handling across NM restarts 
> ---------------------------------------------------
>
>                 Key: TEZ-1637
>                 URL: https://issues.apache.org/jira/browse/TEZ-1637
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-1637.1.patch, TEZ-1637.2.patch, TEZ-1637.WIP.patch
>
>
> Similar to MAPREDUCE-5891 :- need to make sure the Tez shufflehandler can handle NM restarts correctly. This is required for rolling upgrades



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)