You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Sriram Subramanian (JIRA)" <ji...@apache.org> on 2015/02/03 10:59:37 UTC

[jira] [Commented] (SAMZA-516) Support standalone Samza jobs

    [ https://issues.apache.org/jira/browse/SAMZA-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303026#comment-14303026 ] 

Sriram Subramanian commented on SAMZA-516:
------------------------------------------

Thanks Chris for the writeup. I am assuming your latest design doc has all the proposed changes.

1. Failover - 
    I can see three approaches for this. 

a. Standalone job starts container as a sub process - 
     Pros
     - Existing containers will not be restarted to failover the containers that were part of the failed stand alone job
     - The degree of parallelism is unaffected by the definition of jobs.container.count. This means even under failure, we maintain the same container count
     - The state of the existing containers is unaffected and restoration of state needs to happen only for the failed containers. This avoids any code to check if the state is locally available to avoid restoration.
     Cons
     - Adds another level of hierarchy but actually it may not be that bad. You can map the stand alone jobs to node managers in the yarn world. The containers are simply monitored by the Standalone job similar to Yarn.

b. Move tasks to other containers on failure
     Pros
     - On a Standalone job failure, ownership of tasks gets transferred by a rebalance. I am not sure if this makes things any simpler. 
    Cons
     - The tasks can get distributed amongst all the containers and in the worst case will restart all the containers in the job
     - Would need more code to make containers immutable and I kind of dislike that. It seems easier to reason about when assignments are immutable
     - We would need special code to avoid restoration of state from Kafka for stores that are already local
     - The jobs.container.count config no longer holds good. The number of containers does not remain constant and can actually reduce over time if we have multiple failures.

c. Have one standalone job per container

Here is another proposal

   We do not restrict in having one Standalone job per machine. Instead we can have n number of standalone jobs per machine. This would be determined by the job coordinator. A single stand alone job can run only one container. This maintains 1:1 mapping between process and container.
 
Instead of starting the Standalone job on each machine, the user simply starts a Standalone job on one of the machines. The job would try to become the job coordinator on startup and if it succeeds, will proceed to the assignment process. The job coordinator then simply gets the list of machines and no of containers and assigns the containers to machines (Simple round robin strategy should be good enough). It then updates ZK with this assignment and simply starts one Standalone job for each container. The container is completely left as is and does not require any change. The Standalone jobs would listen to ZK for job coordinator changes and for electing new coordinator (This is no different from what you suggest). On a failover election, the coordinator simply gets the assignment from ZK, ensures that the assignment still holds good and if not does a reassignment (This is required in all the approaches).

Pros

 - Keeps the container immutable and maintains 1:1 process to container mapping
 - Starting a job is a lot easier. Simply start one stand alone job on any machine.
 - On failure, state needs to be restored only for the failed containers.
 - There is no scenario where all containers have to be restarted when one of them change

Cons

 - This may cause some JVM overhead but that should be minimal


 2. Standalone coordinator 

I really prefer having the coordinator not be standalone since we would need to implement a HA story around it. Having a redirection on each of the standalone job to the coordinator should solve the problem of the jumping UI and should be less complex than coming up with a HA story. Also, with one container per Standalone job process, leaking containers on a coordinator failure should not be an issue.


> Support standalone Samza jobs
> -----------------------------
>
>                 Key: SAMZA-516
>                 URL: https://issues.apache.org/jira/browse/SAMZA-516
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>              Labels: design, project
>         Attachments: DESIGN-SAMZA-516-0.md, DESIGN-SAMZA-516-0.pdf, DESIGN-SAMZA-516-1.md, DESIGN-SAMZA-516-1.pdf
>
>
> Samza currently supports two modes of operation out of the box: local and YARN. With local mode, a single Java process starts the JobCoordinator, creates a single container, and executes it locally. All partitions are procesed within this container.  With YARN, a YARN grid is required to execute the Samza job. In addition, SAMZA-375 introduces a patch to run Samza in Mesos.
> There have been several requests lately to be able to run Samza jobs without any resource manager (YARN, Mesos, etc), but still run it in a distributed fashion.
> The goal of this ticket is to design and implement a samza-standalone module, which will:
> # Support executing a single Samza job in one or more containers.
> # Support failover, in cases where a machine is lost.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)