You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2015/10/21 19:09:27 UTC

[jira] [Comment Edited] (TEZ-2882) Consider improving fetch failure handling

    [ https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967481#comment-14967481 ] 

Rajesh Balamohan edited comment on TEZ-2882 at 10/21/15 5:09 PM:
-----------------------------------------------------------------

Thanks [~sseth]

This is the one I'm concerned about - and think is a candidate for special casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before returning true. This should be good for small clusters? Basically a combination of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST. 

"if (hostFailureFraction != -1) " - Float comparison
- Fixed.

"failedShufflesSinceLastCompletion" - Looking at this some more - do we need some mechanism to disable this ? 
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to disable this and a test

"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.

Will commit it once jenkins passes.


was (Author: rajesh.balamohan):
Thanks @sseth

This is the one I'm concerned about - and think is a candidate for special casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before returning true. This should be good for small clusters? Basically a combination of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST. 

"if (hostFailureFraction != -1) " - Float comparison
- Fixed.

"failedShufflesSinceLastCompletion" - Looking at this some more - do we need some mechanism to disable this ? 
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to disable this and a test

"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.

Will commit it once jenkins passes.

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch, TEZ-2882.4.patch, TEZ-2882.5.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)