You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Hai Lu (Jira)" <ji...@apache.org> on 2019/11/07 00:07:01 UTC

[jira] [Updated] (SAMZA-2266) Introduce a backoff when there are repeated failures for host-affinity allocations

     [ https://issues.apache.org/jira/browse/SAMZA-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hai Lu updated SAMZA-2266:
--------------------------
    Fix Version/s: 1.3

> Introduce a backoff when there are repeated failures for host-affinity allocations
> ----------------------------------------------------------------------------------
>
>                 Key: SAMZA-2266
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2266
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Daniel Nishimura
>            Assignee: Daniel Nishimura
>            Priority: Major
>             Fix For: 1.3
>
>          Time Spent: 9h
>  Remaining Estimate: 0h
>
> The issue here is that we retry allocations of dead containers (and repeatedly on subsequent failures) in a very small window of time (<1min). 
> It is observed that NMs take ~2mins to mark themselves as unhealthy to the RM.
> If a job has host-affinity enabled, this will cause us to allocate containers on the same unhealthy host multiple times and eventually kill the application.
> This ticket is to evaluate the feasibility and possibly implement a fix that involves introducing a time backoff on retries of container allocation on the same host - so we eventually get a different host when the unhealthy NM's status is updated.
> We may also want to look into the possibility of abandoning host-affinity on the 8th attempt of restarting a container - so we don't kill the entire job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)