You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "László Bodor (Jira)" <ji...@apache.org> on 2020/04/10 06:45:00 UTC

[jira] [Comment Edited] (TEZ-4139) Tez should consider node information for computing failure fraction

    [ https://issues.apache.org/jira/browse/TEZ-4139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080277#comment-17080277 ] 

László Bodor edited comment on TEZ-4139 at 4/10/20, 6:44 AM:
-------------------------------------------------------------

[~rajesh.balamohan]: could you please take a look at  [^TEZ-4139.01.WIP.patch] ?
basically I've changed to store attempt failures per source host, tez-dag unit tests still pass, as I've changed the calculation part in order to consider the new data structure
{code}
  private Map<String, Map<TezTaskAttemptID, Long>> uniquefailedOutputReports = Maps.newHashMap();
{code}
my question is, how could this be considered exactly? given the failureFraction calculation:
{code}
   float failureFraction = runningTasks > 0 ? ((float) totalUniqueReportsCount) / runningTasks : 0;
{code}
in this example above, the denominator (runningTasks) is the number of running tasks in the actual vertex, and totalUniqueReportsCount is the original count (I made it work in the same way as earlier, regardless of the underlying data structure)

if I want to change this calculation to take the failures for a given host into account, how should I change the denominator?  (as in the numerator, I'll most probably change to failure count per host)

1. by not changing denominator, I'll have a lower amount of failure fraction, which is not the intention I guess (this will only work if user sets "tez.task.max.allowed.output.failures.fraction" to a lower value)

2. changing the denominator somehow? maybe to reflect some "per vertex" number

(3. not changing the denominator and introducing something like "tez.task.max.allowed.output.failures.fraction.per.source.host" and set it to a lower value by default?)


was (Author: abstractdog):
[~rajesh.balamohan]: could you please take a look at  [^TEZ-4139.01.WIP.patch] ?
basically I've changed to store attempt failures per source host
{code}
  private Map<String, Map<TezTaskAttemptID, Long>> uniquefailedOutputReports = Maps.newHashMap();
{code}
my question is, how could this be considered exactly? given the failureFraction calculation:
{code}
   float failureFraction = runningTasks > 0 ? ((float) totalUniqueReportsCount) / runningTasks : 0;
{code}
in this example above, the denominator (runningTasks) is the number of running tasks in the actual vertex, and totalUniqueReportsCount is the original count (I made it work in the same way as earlier, regardless of the underlying data structure)

if I want to change this calculation to take the failures for a given host into account, how should I change the denominator?  (as in the numerator, I'll most probably change to failure count per host)

1. by not changing denominator, I'll have a lower amount of failure fraction, which is not the intention I guess (this will only work if user sets "tez.task.max.allowed.output.failures.fraction" to a lower value)

2. changing the denominator somehow? maybe to reflect some "per vertex" number

(3. not changing the denominator and introducing something like "tez.task.max.allowed.output.failures.fraction.per.source.host" and set it to a lower value by default?)

> Tez should consider node information for computing failure fraction
> -------------------------------------------------------------------
>
>                 Key: TEZ-4139
>                 URL: https://issues.apache.org/jira/browse/TEZ-4139
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: László Bodor
>            Priority: Major
>         Attachments: TEZ-4139.01.WIP.patch
>
>
> When lots of downstream attempts fail to pull the information from source task, source task is marked as failed and it is retried. Currently failure fraction is handled by looking at unique task attempts from downstream. However, it should consider taking into account node information for computing "failureFraction".
> https://github.com/apache/tez/blob/master/tez-dag/src/main/java/org/apache/tez/dag/app/dag/impl/TaskAttemptImpl.java#L1845-L1849



--
This message was sent by Atlassian Jira
(v8.3.4#803005)