You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Hai Lu (Jira)" <ji...@apache.org> on 2019/11/07 00:07:01 UTC
[jira] [Updated] (SAMZA-2266) Introduce a backoff when there are
repeated failures for host-affinity allocations
[ https://issues.apache.org/jira/browse/SAMZA-2266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hai Lu updated SAMZA-2266:
--------------------------
Fix Version/s: 1.3
> Introduce a backoff when there are repeated failures for host-affinity allocations
> ----------------------------------------------------------------------------------
>
> Key: SAMZA-2266
> URL: https://issues.apache.org/jira/browse/SAMZA-2266
> Project: Samza
> Issue Type: Bug
> Reporter: Daniel Nishimura
> Assignee: Daniel Nishimura
> Priority: Major
> Fix For: 1.3
>
> Time Spent: 9h
> Remaining Estimate: 0h
>
> The issue here is that we retry allocations of dead containers (and repeatedly on subsequent failures) in a very small window of time (<1min).
> It is observed that NMs take ~2mins to mark themselves as unhealthy to the RM.
> If a job has host-affinity enabled, this will cause us to allocate containers on the same unhealthy host multiple times and eventually kill the application.
> This ticket is to evaluate the feasibility and possibly implement a fix that involves introducing a time backoff on retries of container allocation on the same host - so we eventually get a different host when the unhealthy NM's status is updated.
> We may also want to look into the possibility of abandoning host-affinity on the 8th attempt of restarting a container - so we don't kill the entire job.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)