You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Sanil Jain (Confluence)" <no...@apache.org> on 2019/11/11 07:30:00 UTC

[CONF] Apache Samza > SEP-22: Container Placements in Samza

There's **1 new edit** on this page  
---  
|  
---  
|  | [![page icon](cid:page-
icon)](https://cwiki.apache.org/confluence/display/SAMZA/SEP-22%3A+Container+Placements+in+Samza?src=mail&src.mail.product=confluence-
server&src.mail.timestamp=1573457400568&src.mail.notification=com.atlassian.confluence.plugins.confluence-
notifications-batch-plugin%3Abatching-
notification&src.mail.recipient=8aa980875bf24635015c82a7d09902ac&src.mail.action=view
"page icon")  
---  
[SEP-22: Container Placements in
Samza](https://cwiki.apache.org/confluence/display/SAMZA/SEP-22%3A+Container+Placements+in+Samza?src=mail&src.mail.product=confluence-
server&src.mail.timestamp=1573457400568&src.mail.notification=com.atlassian.confluence.plugins.confluence-
notifications-batch-plugin%3Abatching-
notification&src.mail.recipient=8aa980875bf24635015c82a7d09902ac&src.mail.action=view
"SEP-22: Container Placements in Samza")  
|  |  |  |  | ![](cid:avatar_5056a43ede8596254d91f5801238c5df) |  | Sanil Jain
edited this page  
---  
|  
|  | Here's what changed:  
---  
|

#  **Status**

**Current state** : ** ** [ UNDER DISCUSSION ]

**Discussion thread** : <link to mailing list DISCUSS thread> _  
_

**JIRA** : SAMZA-TBD _  
_

**Released:  **

#  **Problem**

Samza operates in a multi-tenant environment with cluster managers like Yarn
and Mesos where a single host can run multiple Samza containers. Often due to
soft limits configured for cluster managers like Yarn and no notion of dynamic
workload balancing in Samza a host lands in a situation where it is
underperforming and it is desired to move one or more containers from that
host to other hosts. Today this is not possible without affecting other jobs
on the hot host or restarting the affected job manually. In other use cases
like resetting checkpoints of a single container or supporting canary or
rolling bounces the ability to restart a single or a subset of containers
without restarting the whole job is highly desirable.

This doc addresses the problem of restarting/moving a single container of a
job without affecting other containers of the same job or other job running on
the same host. X

#  **Motivation**

**Alleviating Hot Host Problems:** Yarn as a resource manager in Samza is
configured to operate on soft limits (a job sets vcore for a container or has
defaulted, this is the minimum guarantee of resources to be provided by Yarn )
& most of the customers go by default and fail to right-size their Job. In
addition, there is no notion of a dynamic cluster balancer in Samza at
LinkedIn today. Although Yarn to some capacity acts as a Cluster Balancer if
configured with hard limits. Due to this often a host lands in a situation
when containers on it are underperforming (also referred to as a hot host)
because it has CPU heavy containers running on it while some other hosts are
underutilized. Now it is desirable to move this container to a different host.
To achieve the same following solutions exists:

  1. Rewrite the locality mapping in coordinator stream and restart the job 
  2. Take the hot host out of rotation which kills containers from all the jobs running on that host. Then trigger a restart on a job whose container is supposed to be moved so Yarn would try to allocate some other host for it. Once the container starts on other hosts, then put the hot host in rotation again so that other container who were killed as a result of taking host out can be attempted to restart on the hot host again. It ain't easy!

If the ability to move a container exists, someone at the simplest can
manually move containers to different hosts or write some simple scripts to
automate that.

**Canary / Rolling Bounces:** When there is a bug in the Samza framework code
that affects the Samza container deployment, Samza engineer needs to manually
restart the container process on the given machine with the given binary
version. This involves multiple steps (e.g. manually identify and log in the
container host before using kill -9 to stop the process) which is
inconvenient.With the restart ability, the same system can be used for
building support for Canary or Rolling Bounces for YARN based Samza
deployments, restart ability can be easily extended to deploy a single or
subset of containers using a different version of application code.

**Resetting Checkpoints:** Startpoint API has made resetting checkpoints easy
but it still needs a dev to restart his job once he has set his start points.
The restart can be potentially be prevented with restart ability of a single
container or a few containers

**Draining a host:** Moving all running containers from a host sequentially to
other hosts or in parallel.

**Fix a Job in Degraded State:** Often users find it desirable to just restart
a single container for various reasons like underperforming containers when
only a few containers running into exceptions because some partitions have
corrupt messages. Today Samza kills the job if a container has run into
exceptions more than a [fixed
configured](https://samza.apache.org/learn/documentation/1.0.0/jobs/configuration-
table.html#cluster-manager-container-retry-count) number of times. In these
scenarios, it's desirable for Users to keep the job running in a degraded
state and only fix one or few containers of the job and issue a restart for
them.

**Dynamic Workload Balancer:** The ability to move and restart containers is
the fundamental building block to developing a load balancer (like Cruise
Control) for Samza. At the very simple this load balancer can be a simple
script trying to balance cluster, later it can be built into a more
sophisticated system.

**Heterogeneous container:** Each Samza job today has homogenous containers
(in regards to memory & vcore configurations). One of the desired use cases
for Samza in the future is the ability to restart a container with different
sizes (memory & cpu) which can be used by ay auto-sizing engine designed for
samza.

**Selectively Spin StandBy Containers:** Samza has a feature of Hot StandBy
Containers for reducing stateful restoration times. Enabling this feature for
a job involves doubling the containers at the least (simplest case where every
container has 1 standby replica enabled). Customers are reluctant to enable
this since doubling the containers increases the cost to serve. To improve the
adoption for this feature we can build the ability to spin up StandBy
Containers for a single or a subset of containers while the job is running,
these StandBy Containers then can be used for failover to reduce downtime.  

#  **Requirements**

###  Goals

  * Build the ability to move container (non-am) of a running job from its existing host to a preferred host (if specified) or any host
    * If an existing stateful container has a standby container, move to the active container to stand by container 
    * If an existing stateful container is without a standby container, spin up a standby container and then move the active to standBy once stand by has bootstrapped to reduce downtime [Stretch]
  * Build the ability to restart containers with the same host preferences
  * Build the ability to selectively spin up standby for one or a subset of containers [Stretch]
  * Expose these APIs via a tool or dashboard for Samza team to easily use during oncall 

###  Non-Goals

  * This system does not intend to solve the problem of Dynamic Workload Balance i.e  Cruise control. It may act as a building block for one later.
  * Solving canary problem for YARN based deployment mode is out of scope of this solution however system built should be easily extensible to support canary 
  * This system **will not** have a built in intelligence to find a better match for host for a container it will take simplistic decisions per params passed by user.

###  SLA / SCALE / LIMITS

  * At a time AM for a single job will only serve one request per container, parallel requests across containers are still supported. If a control request is underway any other requests issued on the same container will be discarded
  * System should be capable of scaling to be used across different jobs at the same time 

###  General Requirements

At a high level the system can be divided into two logical components:

  1. **Container Placement Handler (CPH):** to serve as a dispatcher of specific control requests from a control plane.  Fetches the container placement control action from control-plane, queues it up and dispatches requests on containers of a job (non-am) with policy (details below) 
  2. **Container Placement Service:** API layer built around Cluster Manager to move containers. Reacts to control actions from CPH and interacts with underlying Cluster manager to execute them  

  * ####  **Requirements for Container Placement Handler**

  1. Should be able to handle multiple move request for different containers at the time for a single job
  2. Should provide the good monitoring mechanism 
  3. Should be general enough to be coupled with any system for example [Control-plane](https://docs.google.com/document/d/1HMTbXv8vj0SBJB3dKG08LZpjL3JI9WXtL_Fua4DAx_w/edit?pli=1#heading=h.c5d479iogv5o)
  4. Should allow queuing requests for multiple containers across multiple jobs at the same time 
  5. Should be easily extensible to build canary support for yarn deployments 

  * ####  **Requirements for Container Placement Service**

  1. Under no circumstances should Control Actions result in an inconsistent / non-working state of the job (i.e., #running-containers should equal number of configured containers).
  2. Should not be tightly coupled with YARN, should be extensible to be used by any cluster manager implementation

###  Failure Scenarios & Assumptions

  1. Each control action will be associated with a deployment id of an app it is intended to be issued upon
  2. Across Job restarts, any requests issued for the previous deployment incarnation will not be respected. This includes AM restarts (today if the AM restarts that implies a job restart)

#  **Glossary**

**  
**

|

**Term**

|

**Description**  
  
---|---  
  
Cluster Manager

|

Resource manager used by samza eg. Yarn, Mesos, etc  
  
Job Coordinator (JC) { Also referred to as ApplicationMaster (AM) for Yarn }

|

Each Samza application has a JC that manages the application’s workload, asks
RM for containers and handles notifications when one of its containers fails.  
  
Node Manager (NM)

|

A single node in a Yarn Cluster  
  
Resource Manager (RM)

|

Central Service that provides like scheduling, heartbeats, liveness monitoring
to all applications running in the YARN cluster  
  
Host Affinity

|

The ability of Samza to allocate a container to the same machine across job
restarts is referred to as host-affinity.  
  
Container processorId

|

Samza allocates containers resource Ids eg 0, 1, 2 which remains the same
across restarts of a job  
  
#  **Proposed Changes**

###  API design

On the basis of types of Control actions, the commands are the following:  

####  **1\. Restart Container**

**  
**

_API_

|

restartContainer  
  
---|---  
  
_Description_

|

_Active Container:_ Stop container process on source-host and starts it for  

  1. Stateless Job on 
    1. Destination-host (if specified, destination host can be source as well)
    2. Any host (otherwise except source-host)
  2. Stateful Job on
    1. Destination-host (if specified, destination host can be source as well)
    2. Standby Container (if present)
    3. Any host (otherwise except source host)

_StandBy Container:_ Stop container process on source-host and starts it on:

    1. Destination-host (if specified & matches [StandBy Constraints](https://docs.google.com/document/d/18EdLZgrW-ZMIypqI2YX0YsAnTa0Zf_0Rt4tQxkLvZ10/edit#heading=h.8pqrp0s0raw0))
    2. Any host (otherwise which matches [StandBy Constraints](https://docs.google.com/document/d/18EdLZgrW-ZMIypqI2YX0YsAnTa0Zf_0Rt4tQxkLvZ10/edit#heading=h.8pqrp0s0raw0))

  
  
_Parameters_

|

processor-id: Samza resource id of container e.g 0, 1, 2

destination-host: container host url [optional]  
  
_Status code_

|

202 : Accepted

401 : Unauthorized

409 : Forbidden (In case already a Control Action is active on a Container)  
  
_Returns_

|

Since this is an ASYNC API, this command returns a request-id which user can
then query for status  
  
_Failure Scenarios_

|

There are two cases under which a request to restart might fail:

  1. When active container stop fails, in this case we mark the request failed
  2. When stopped active container fails to start on destination host in that case we can set a flexible policy to either attempt a start on source host or restart anywhere

  
  
**  
**

Note: For supporting canary above parameter list can be easily extended to
support following parameters

_Parameters_

|

user-version: user application version [optional]

samza-version: samza framework version [optional]

jvm-args: arbitrary string to be used as jvm arguments [optional]  
  
---|---  
  
**  
  
**

####  **3\. Container Status**

**  
**

_API_

|

containerStatus  
  
---|---  
  
_Description_

|

Gives the status & info of the container, for ex is it running, stopped what
control commands are issued on it  
  
_Parameters_

|

processor-id: Samza resource id of container e.g 0, 1, 2  
  
_Status code_

|

202 : Accepted

401 : Unauthorized  
  
**  
**

####  **2\. Enable & Disable StandBy [Stretch]**

**  
**

_API_

|

controlStandBy  
  
---|---  
  
_Description_

|

Starts or Stops a standBy container for the active container  
  
_Parameters_

|

processor-id: Samza resource id of container e.g 0, 1, 2  
  
_Status code_

|

202 : Accepted

401 : Unauthorized  
  
**  
**

###  Architecture

For implementing a scalable container placement control system, the proposed
solution is divided into two parts:

####  Part 1. Container Placement Handler

  1. Control Plane as described plane above the job that allows taking control actions by multiple controllers like Samza Dashboard, Start points controller. 
  2. ContainerPlacementHandler is handler registered to control plane that translates control actions to invoke Container Placement Service APIs

![](/confluence/download/attachments/135860312/Screen%20Shot%202019-11-10%20at%2011.11.10%20PM.png?version=1&modificationDate=1573456303057&api=v2)

This control plane can be implemented in the following ways

**Option 1: Samza metastore service** **  [Preferred]**

Samza Metastore will provide an API to write to the coordinator stream. One
simple way to expose Container Placement API is, Container Placement handler
can have a coordinator stream consumer polling control messages from
coordinator stream & acting on them. CPH will take actions maintaining some
in-memory state with Container Placement Service to not take an action twice.
In addition since control actions are associated with a deployment id, they
are automatically invalidated across restarts

Pros

|

Cons  
  
---|---  
  
  * No need to build Authentication & Authorization, already handled by the Metadata auth service
  * No need to enable Rate limiting since requests are queued so the flow of requests can be regulated at the consumer side

|

  * If AM dies there can be still queued requests in Coordinator Stream, such requests have to be handled across AM restarts
  * Coordinator stream is log compacted so control messages written to the coordinator stream need to be deleted to prevent it from growing to large sizes which can affect job start times

  
  
**  
****Option 2: Stub a Rest Endpoint in JC [Rejected Alternative]**

It's fairly simple to embed a REST endpoint in JC or extend the [existing Rest
endpoint](https://jarvis.corp.linkedin.com/codesearch/result/?path=samza-
li%2Fsamza%2Fsamza-
yarn%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fsamza%2Fwebapp&reponame=samza%2Fsamza-
li&name=ApplicationMasterRestClient.java) in JC that exposes configs in JC to
support apis listed above

Pros

|

Cons  
  
---|---  
  
  * Simple to extend the existing REST endpoint
  * No need to build Authentication since AM runs on hosts which are blacklisted for anyone except the Samza Team
  * If the AM dies all the outstanding requests are discarded (no additional handling needed)

|

  * Need to build Authorization layer around these rest endpoints 
  * Loading the already Heavy loaded Job coordinator with another service might cause an increase in memory used
  * Need to build a service for discovery or rely on Yarn embedded Servlet

  
  
  

**Open Sources Access:  **

To expose this API to open source users, a metadata store writer tool can be
used. The same tool is going to give Open source users access to Start points
APIs.

**  
**

####  Part 2. Container Placement Service

Container Placement service is a set of API Built around AM to move/restart
containers. The solution proposes to introduce a ContainerPlacementManager
which is a single entity managing the container placements for stateful and
stateless jobs & a metadata holder for container actions
(ControlActionMetadata) forex request_id, current status, requested resources,
etc.

![](/confluence/download/attachments/135860312/Screen%20Shot%202019-11-10%20at%2011.12.12%20PM.png?version=1&modificationDate=1573456365904&api=v2)

####  **2.1 Container Move**

####  **2.1.1 Stateless Container Move (Host Affinity Off)**

**Option 1:** **Reserve-Stop-Move:** Request Resource first and Only make a
move once resources are allocated by ClusterResource Manager **(Preferred)**

**Pros**

|

**Cons**  
  
---|---  
  
  * Container only moves when a preferred resource is available for container to start
  * Offers stronger move semantics more effective when used with a Workload balancer 

|

  * Complex implementation as compared to Option 2

  
  
**  
**

**Option 2: Stop-Reserve-Move:** Stop the container and then make resource
request for preferred resources and attempt a start (Rejected Alternative)

**Pros**

|

**Cons**  
  
---|---  
  
  * Easier to implement

|

  * Weaker Move semantics since if preferred resources are not available and since the container is already stopped then clearly this has a higher chance of moving container to anything other than a preferred host which is ineffective for a workload balancer

  
  
**  
**

**Orchestration of Stateless Container Move using Option 1:** The image shows
a sequence of steps involved in moving a container. Note that there is a
specific number of retries configured for requesting preferred hosts to
ClusterResourceManager.

  1. ContainerPlacementHandler dispatches the move request to ContainerProcessManager 
  2. ContainerProcessManager registers move with ContainerPlacementManager and issues a resource request with ContainerAllocator (Async)
  3. Now there are two possible scenarios

3.1 Requested Resources are allocated by ClusterResourceManager

  1. In this case ContainerPlacementManager initiates a container shutdown and waits for the ContainerPlacementManager callback on successfully stopping the container
  2. ContainerProcessManager on receiving the callback checks if this was due to a move request & propagates it to ContainerPlacementManager which issues a start container request ClusterResourceManager to start container on allocated resources 
  3. If the container start request succeeds then a success notification is sent (updating the Control Action Metadata) otherwise a new request to start this container on ANY_HOST is issued

3.2 Resource Request expires

  1. In cases where Resource Request expires and ClusterResourceManager is not able to allocate requested resources there a fixed number times a request is retried. If request does not get allocated even after retires, move is marked as failed and container stays on the host it is running on. 

**  
**

**Failure Scenarios:**

  * If the ContainerPlacementManager is not able to stop the active container (3.1 #1 above fails) in that case we mark the control action request failed
  * If ClusterResourceManager fails to start the stopped active container on accrued destination host, then we attempt to start the container back on source host, failing to do which will attempt to start the container on ANY_HOST   

**  
![](/confluence/download/attachments/135860312/Screen%20Shot%202019-11-10%20at%2011.16.58%20PM.png?version=1&modificationDate=1573456639361&api=v2)**  

####  **2.1.2 Stateful Container Move (Host Affinity On)**

The stateful container is divided into two cases on the basis of whether the
job has StandBy Container for it or not

![](/confluence/download/attachments/135860312/Screen%20Shot%202019-11-10%20at%2011.19.52%20PM.png?version=1&modificationDate=1573456806049&api=v2)

**Case 1. Stateful Job has Standby Container Enabled**

Let's take a case when the job has a stateful container C on H1 and has a
StandBy container enabled C’ on H2

**Option 1: Stop-Reserve-Move: Initiate a Stop on C & issue a Standby failover
(Preferred - Phase 1*)**

In this option, C is stopped on H1 and failover is initiated from H1 to H2
with current semantics of [Failover of an active
Container](https://cwiki.apache.org/confluence/display/SAMZA/SEP-19%3A+Hot+standby+state+for+Samza+applications)
in the Hot standby feature . This ensures C moves either to H2 (in the best
case) or to some other hosts (when Cluster Manager cannot allocate capacity on
H2, i.e H2 is under capacity)

**Pros**

|

**Cons**  
  
---|---  
  
  * Easy to implement, ability to do this already exists 

|

  * C always moves from H1 and there a slim chance that C won’t land on H2 as per the current semantics (stolen-host scenario: Cluster Manager fails to allocate H2 for C)

  
  
**Option 2: Reserve-Stop-Move: Initiate a Standby move first then issue a move
for Active Container (Preferred - Phase 2*)**

  1. In this option request is first issued to stop C’ 
  2. Then request H2 from cluster manager to start C 
  3. Issue a stop on C only H2 can be allocated (#2 succeeds). 
  4. Then request C’ to start ANY_HOST except for H1, H2 

**Two ways to achieve this:  **

  

  * A client can make two calls for this first move C’ to ANY_HOST apart from H2, H1 (similar to a stateless move), then move the request of C to C’ (similar to a stateless move) so the state maintenance of this lives on the client
  * Maintain state in JC to accomplish this 

  

**Pros**

|

**Cons**  
  
---|---  
  
  * Stronger move semantics, since the active container won’t be stopped if it cannot be moved to the standby container

|

  * More complex as compared to Option 1 needs to maintain more state  

  
  


_* Phase 1 & Phase 2 refer to the implementation phases, please see the
section: _[_Implementation Plan
_](https://docs.google.com/document/d/1uSQUUmEJUM8aigTKshsvvsR3utVEmP9xzfE4au6IKhc/edit#heading=h.ecuqmj3agqvi)

**Option 3: Reserve-Stop-Move: First Request H2 from Cluster Manager and
initiate the failover once you get H2 as a preferred resource (Rejected
Alternative)**

  * In this option C’ is stopped on H2 and failover is initiated from H1 to H2 with current semantics as developed by the feature Hot standby
  * In case of the move failed only affects standby

**  
**

**Pros**

|

**Cons**  
  
---|---  
  
  * Stronger move semantics are guaranteed, the active container won’t be stopped unless it can be started with H2

|

  * Since there is a standby already running on H2, there is a higher chance that cluster manager might fail to allocate for H2 and move requests are ineffective (Host is running full capacity)
  * Need to design a new failover scheme hence more development time to implement

  
  
**  
**

**Orchestration of StandBy Enabled Container Move using Option 1 (Phase 1)**

  1. ContainerPlacementHandler dispatches the move request to ContainerProcessManager
  2. ContainerProcessManager registers a move & initiates a failover with StandByContainerManager then the following [failover sequence](https://cwiki.apache.org/confluence/display/SAMZA/SEP-19%3A+Hot+standby+state+for+Samza+applications) is followed

**  
**

**Case 2. Stateful Job has Standby Container Disabled [Stretch]**

**Option 1: Stateful containers move is equivalent to Stateless container Move
(Phase 1)**

In this option, Stateful container move like stateless container & does a
bootstrap

**Option 2:** **Optionally Spin Up StandBy container & then move (Phase 2)**

If we expose an API to spin a StandBy Container for an active container, a
client can then make two requests, one to spin up a StandBy then periodically
monitor the lag for StandByCotntainer (Using Diagnostics). Once the StandBy
Container has a steady lag is ready to move, this case becomes Case 1
mentioned above. (More details TBD later)

####  **  
****2.2 Stateful & Stateless Container Restart**

This is much simpler as compared to Move. We can either make restart
equivalent to a move or issue a stop a container and ContainerProcessManager
will try to start that container again. Let's discuss these options in detail:

**Option 1: Reserve-Stop-Start: Restart is equivalent to move on the same host
[Preferred]**

  * In this option, resources are requested on the same host first before issuing a stop on Container
  * Once the resources are accrued a container is stopped & requested to start 
  * So this is equivalent to move semantics above 

**  
**

**Pros**

|

**Cons**  
  
---|---  
  
  * Strong Restart semantics since the container is not stopped if it cannot be restarted on the same resource

|

  * For a container to restart, it will be holding 2x resources on the same host for the time it has accrued resources to the time when the active container is issued a stop

  
  
**  
**

Both stateless & stateful container restarts will be equivalent to Stateless
Container Move on the same host. Since the container is restarted on the same
host then

**  
**

**Option 2: Stop-Reserve-Start: Restart is equivalent to stopping container
first then attempting to start it [Rejected Alternative]**

  * In this option, a container is issued a stop first
  * Once the container stops, resources are requested for starting on the same host (last seen host)
  * Once the resources are accrued, the container is issued a start on the same host

**  
**

**Pros**

|

**Cons**  
  
---|---  
  
  * The container at any point in time is only holding resources it needs to start on a host 

|

  * Weaker Restart semantics since there is a likely chance of restarting this container on any other host then the source host (since container is stopped & when Cluster Manager cannot return resources)

  
  
**  
****Case 2.1: Host Affinity is OFF  **

In this case, when User registers a restart request, ContainerPlacementManager
issues a stop on the current container. On receiving a callback from
ClusterResourceManager for container stop, ContainerProcessManager requests
resources on last seen host. But Allocator thread assigns this request on
ANY_HOST available since we have homogenous containers in Yarn & host affinity
is off, the container can be started from any available resource.

**Case 2.2: Host Affinity is ON  **

In this case, a Stop is issued on Container & the ContainerProcessManager
makes a request for resources on last seen host to ClusterResourcesManager.
Allocator Thread on having resources on the preferred host (here last seen
host) issues a request to run a container on it. If this request expires then
there are multiple possibilities:

  * If StandBy Containers are enabled directly initiate a failover 
  * Else if there are resources on ANY_HOST then issues a container start request on that resource
  * Else issue an ANY_HOST request to ClusterResourceManager

#  **Public Interfaces**

**  
**

#  **Implementation and Test Plan**

##  Implementation Plan

**  
**

**Phase  **

|

**Feature**

|

**Timeline  **  
  
---|---|---  
  
Phase 1

|

  * Design Discussion & SEP
  * Stateless Container
    * Move: Option 1
    * Restart ability
  * Stateful Container with StandBy
    * Move: Option 1
    * Restart
  * Stateful Container without StandBy
    * Move like Stateless Container Move
    * Restart ability
  * Expose this ability to be used by Samza Oncalls / SREs

|

Q4 2019  
  
Phase 2

|

  * Stateful Container without StandBy
    * API to enableStandBy
  * StandBy Container Move 
  * Stateful Container with StandBy
    * Move: Option 2
  * Integration with Control Plane

|

TBD  
  
##  Test Specifications

  * Complete testing plan: [https://docs.google.com/spreadsheets/d/1v-fw0pHxKRobGkALDCno4FuPCsBhdepQ86vIGLHWu54/edit?usp=drive_web&ouid=115242259601904072922](https://docs.google.com/spreadsheets/d/1v-fw0pHxKRobGkALDCno4FuPCsBhdepQ86vIGLHWu54/edit?usp=drive_web&ouid=115242259601904072922)
  * Local deployment testing with Virtual Private Cluster ([LXC with Samza](https://iwww.corp.linkedin.com/wiki/cf/display/~rmatharu/Virtual+Private+Cluster+%28VPC%29+for+use+with+Samza+YARN))
  * Create a test job using VPC or in the real cluster that automates testing for taking control actions on containers

#  **Compatibility, Deprecation, and Migration Plan**

  * The new interfaces & API introduced should not affect any part of the User code & should be backward compatible. No migration needed for this new feature.
  * Jobs using older samza version shall be able to discard the Control messages written to the metastore

#  **Rejected Alternatives**

**Note:  **These are described in the implementation along with preferred
options above for the sake of consistency in reading

##  **References**

  1. [ https://docs.google.com/document/d/18EdLZgrW-ZMIypqI2YX0YsAnTa0Zf_0Rt4tQxkLvZ10/edit#](https://docs.google.com/document/d/18EdLZgrW-ZMIypqI2YX0YsAnTa0Zf_0Rt4tQxkLvZ10/edit)
  2. [https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cm_mc_rolling_restart.html](https://www.cloudera.com/documentation/enterprise/5-10-x/topics/cm_mc_rolling_restart.html)
  3. [https://docs.google.com/document/d/16cQHipnAZHCerEgRW6icw9YbWA3TUcZ2bm-g9D79y6c/edit#](https://docs.google.com/document/d/16cQHipnAZHCerEgRW6icw9YbWA3TUcZ2bm-g9D79y6c/edit)

  
  
  

  
  
  
|  |  | [Go to page
history](https://cwiki.apache.org/confluence/pages/viewpreviousversions.action?pageId=135860312&src=mail&src.mail.product=confluence-
server&src.mail.timestamp=1573457400568&src.mail.notification=com.atlassian.confluence.plugins.confluence-
notifications-batch-plugin%3Abatching-
notification&src.mail.recipient=8aa980875bf24635015c82a7d09902ac "Go to page
history")  
---  
---  
| [View
page](https://cwiki.apache.org/confluence/display/SAMZA/SEP-22%3A+Container+Placements+in+Samza?src=mail&src.mail.product=confluence-
server&src.mail.timestamp=1573457400568&src.mail.notification=com.atlassian.confluence.plugins.confluence-
notifications-batch-plugin%3Abatching-
notification&src.mail.recipient=8aa980875bf24635015c82a7d09902ac&src.mail.action=view)  
---  
  
|  | [Stop watching
space](https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=SAMZA&src=mail&src.mail.product=confluence-
server&src.mail.timestamp=1573457400568&src.mail.notification=com.atlassian.confluence.plugins.confluence-
notifications-batch-plugin%3Abatching-
notification&src.mail.recipient=8aa980875bf24635015c82a7d09902ac&src.mail.action=stop-
watching&jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJ4c3JmOjhhYTk4MDg3NWJmMjQ2MzUwMTVjODJhN2QwOTkwMmFjIiwicXNoIjoiMDlmOWE4MTE1N2Y5ZjJmYzIyNzM4OTY5NTAzZDhjMDVmODM5NDU5ZDdjZTQ5MTczYTE2NDM1YTM4YjFlYzNhNCIsImlzcyI6ImNvbmZsdWVuY2Vfbm90aWZpY2F0aW9uc0FSRUgtWFVEMS1QT1FHLUNTQU8iLCJleHAiOjE1NzQwNjIyMDAsImlhdCI6MTU3MzQ1NzQwMH0.s1YlXZX1JyHudxxFDFhl3QsErkwrURCpUlHbk_-S6-0)
| •  
---|---  
[Manage
notifications](https://cwiki.apache.org/confluence/users/editmyemailsettings.action?src=mail&src.mail.product=confluence-
server&src.mail.timestamp=1573457400568&src.mail.notification=com.atlassian.confluence.plugins.confluence-
notifications-batch-plugin%3Abatching-
notification&src.mail.recipient=8aa980875bf24635015c82a7d09902ac&src.mail.action=manage)  
---  
| ![Confluence logo big](cid:footer-desktop-logo)  
---  
This message was sent by Atlassian Confluence 6.15.8  
![](cid:footer-mobile-logo)  
---