You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2015/10/21 19:09:27 UTC
[jira] [Comment Edited] (TEZ-2882) Consider improving fetch failure
handling
[ https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14967481#comment-14967481 ]
Rajesh Balamohan edited comment on TEZ-2882 at 10/21/15 5:09 PM:
-----------------------------------------------------------------
Thanks [~sseth]
This is the one I'm concerned about - and think is a candidate for special casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before returning true. This should be good for small clusters? Basically a combination of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST.
"if (hostFailureFraction != -1) " - Float comparison
- Fixed.
"failedShufflesSinceLastCompletion" - Looking at this some more - do we need some mechanism to disable this ?
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to disable this and a test
"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.
Will commit it once jenkins passes.
was (Author: rajesh.balamohan):
Thanks @sseth
This is the one I'm concerned about - and think is a candidate for special casing.
- For 1 input, "hasFailedAcrossNodes()" would take atleast 16 failures before returning true. This should be good for small clusters? Basically a combination of TEZ_RUNTIME_SHUFFLE_ACCEPTABLE_HOST_FETCH_FAILURE_FRACTION and threshold governed by TEZ_RUNTIME_SHUFFLE_MIN_FAILURES_PER_HOST.
"if (hostFailureFraction != -1) " - Float comparison
- Fixed.
"failedShufflesSinceLastCompletion" - Looking at this some more - do we need some mechanism to disable this ?
- Fixed. Added "TEZ_RUNTIME_SHUFFLE_FAILED_CHECK_SINCE_LAST_COMPLETION" to disable this and a test
"fetcherHealthy"
- Fixed it to compute with maxAllowedFailedFetchFraction.
Will commit it once jenkins passes.
> Consider improving fetch failure handling
> -----------------------------------------
>
> Key: TEZ-2882
> URL: https://issues.apache.org/jira/browse/TEZ-2882
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Rajesh Balamohan
> Assignee: Rajesh Balamohan
> Attachments: TEZ-2882.1.patch, TEZ-2882.2.patch, TEZ-2882.3.patch, TEZ-2882.4.patch, TEZ-2882.5.patch
>
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)