You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2015/10/12 04:09:05 UTC

[jira] [Updated] (TEZ-2882) Consider improving fetch failure handling

     [ https://issues.apache.org/jira/browse/TEZ-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rajesh Balamohan updated TEZ-2882:
----------------------------------
    Attachment: TEZ-2882.1.patch

Changes:
- Reduced abortFailureLimit from 30 to 15. This should be fine for detecting read/connect issues. But could be aggressive when server returns 500 internal server (e.g server had disk issues and was able to read index file properly. But when streaming real contents, it encountered disk issues and ends up throwing 500 internal server error. In such cases, reducing this value from 30 to 15 might cause little more aggressive failures.  This should be ok, as in case of 500 internal server, there is hardly a chance for the server to report healthy output).
- Added ability to detect failure rates since last progress. Task health is checked based on this and this would improve the accuracy of whether consumer has to be restarted or source has to be restarted. Also, consumer would be restarted only when errors have happened across 20% of the hosts (e.  Failing to fetch from 1 host, but succeeded from others – it’s like that 1 host's problem. Failing to fetch from a large number of hosts, it’s likely caused by the consumer).
- Added set of tests for this. Added a simple test for checking penalty as well.

Not covered in this:
- In case producer host gets restarted, consumer could get 404 error. This is handled in the same way as other type of read exceptions (e.g 500 internal server error, or shuffle header mismatch etc). Ideally, it might be good to restart the producer as soon as possible in AM side on observing 404 (instead of waiting for the retry cycle). This can be addressed in separate ticket, as it would not cause any job hang currently.

[~sseth] - Please review when you find time.

> Consider improving fetch failure handling
> -----------------------------------------
>
>                 Key: TEZ-2882
>                 URL: https://issues.apache.org/jira/browse/TEZ-2882
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2882.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)