You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Jake Maes (JIRA)" <ji...@apache.org> on 2017/11/23 00:37:00 UTC
[jira] [Updated] (SAMZA-1508) JobRunner should not return success until the job is healthy

     [ https://issues.apache.org/jira/browse/SAMZA-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jake Maes updated SAMZA-1508:
-----------------------------
    Description: 
It can be frustrating for users when run-app.sh returns success before the job was fully running.

This happens because the JobRunner currently waits for JobStatus=RUNNING, but in Yarn for example, that happens when the AM is launched, not when all the containers are launched.
What can go wrong?
1. The job could stay stuck waiting for containers that it cant get because of capacity issues or an outage.
2. The job containers may immediately fail due to a runtime error.

In both cases, the user may go on their merry way because run-app.sh returned successfully, even though the job is already dead. They may not get alerted for some time.

How do we fix?
There are a few ways to fix it. Each one progressively harder but progressively better:
1. Make JobRunner reach out to AM and monitor the needed containers metric until it reaches 0
2. Expose a new healthy endpoint in the AM which is only set to true when a heartbeat has been received from each of the containers. Have the JobRunner wait on this (with a timeout)
3. Expose a hook where users can write custom logic to determine job health

I think #1 is the most bang for buck and the implementation for #1 can easily be extended for #2 later.

Other notes:
I don't think this is needed for standalone, since users are directly deploying the processors and can monitor the processes directly.

  was:
It can be frustrating for users when run-app.sh returns success before the job was fully running.

This happens because the JobRunner currently waits for JobStatus=RUNNING, but in Yarn for example, that happens when the AM is launched, not when all the containers are launched.
What can go wrong?
1. The job could stay stuck waiting for containers that it cant get because of capacity issues or an outage.
2. The job containers may immediately fail due to a runtime error.
In both cases, the user may go on their merry way because run-app.sh returned successfully, even though the job is already dead. They may not get alerted for some time.
How do we fix?
There are a few ways to fix it. Each one progressively harder but progressively better:
1. Make JobRunner reach out to AM and monitor the needed containers metric until it reaches 0
2. Expose a new healthy endpoint in the AM which is only set to true when a heartbeat has been received from each of the containers. Have the JobRunner wait on this (with a timeout)
3. Expose a hook where users can write custom logic to determine job health

I think #1 is the most bang for buck and the implementation for #1 can easily be extended for #2 later.

Other notes:
I don't think this is needed for standalone, since users are directly deploying the processors and can monitor the processes directly.


> JobRunner should not return success until the job is healthy
> ------------------------------------------------------------
>
>                 Key: SAMZA-1508
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1508
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Jake Maes
>            Assignee: Jake Maes
>
> It can be frustrating for users when run-app.sh returns success before the job was fully running.
> This happens because the JobRunner currently waits for JobStatus=RUNNING, but in Yarn for example, that happens when the AM is launched, not when all the containers are launched.
> What can go wrong?
> 1. The job could stay stuck waiting for containers that it cant get because of capacity issues or an outage.
> 2. The job containers may immediately fail due to a runtime error.
> In both cases, the user may go on their merry way because run-app.sh returned successfully, even though the job is already dead. They may not get alerted for some time.
> How do we fix?
> There are a few ways to fix it. Each one progressively harder but progressively better:
> 1. Make JobRunner reach out to AM and monitor the needed containers metric until it reaches 0
> 2. Expose a new healthy endpoint in the AM which is only set to true when a heartbeat has been received from each of the containers. Have the JobRunner wait on this (with a timeout)
> 3. Expose a hook where users can write custom logic to determine job health
> I think #1 is the most bang for buck and the implementation for #1 can easily be extended for #2 later.
> Other notes:
> I don't think this is needed for standalone, since users are directly deploying the processors and can monitor the processes directly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)