You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Abhishek Shivanna <ab...@gmail.com> on 2017/05/03 16:52:16 UTC

Re: [DISCUSS] SEP-3: Heart-beat mechanism between JobCoordinator and all running containers

Hey Jagadish,

Thank you for taking the time to review the design.
I agree with moving the heartbeat into the the LocalContainerRunner instead
of fitting it into the SamzaContainer. I will update the SEP with the new
design changes.
Also agree with the changes to the configuration and choosing suitable
defaults should be good enough.

Thanks,
Abhishek



On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman <
jagadish1989@gmail.com> wrote:

> Hi Abhishek,
>
> Heartbeat between the AM and container has been a long awaited Samza
> feature. It will go a long way in ensuring our reliability! +1 for this
> SEP.
>
> *High level comments:*
>
> Currently, the only use-case for the heartbeat mechanism seems to be when
> running Samza on Yarn. IMHO, it makes sense to pull the heart beat logic
> into the *LocalContainerRunner* instead of baking it into the
> *SamzaContainer* class. Long term, we can re-visit this when we have a
> pluggable liveness detection mechanism.
>
> I'm thinking of a flow like this:
>
> There is a separate component (or a thread) inside LocalContainerRunner
> that periodically polls the coordinator, and determines if it should
> continue running. If the coordinator determines that the container should
> not run, the *LocalContainerRunner* cleanly shuts-down the container and
> the process exits with a non-zero exit status.
>
> The following nice properties fall out:
>
>    - We can remove the proposed config *job.container.validator.enabled. *
>    - We can also remove the proposed *Killable* interface since
>    *SamzaContainer* (and runLoops) don't have to implement *Killable *
>    anymore. The life-cycle is managed by the *LocalContainerRunner* that
>    started it.
>
> *On the proposed public interfaces:*
>
> *job.container.validator.enabled:  *I am not in favor of adding this as a
> new public config. IIUC, When running Samza jobs on Yarn, we always want
> the validator/heartbeats to be enabled. OTOH, when running Samza jobs in
> standalone mode, we currently do not have a pluggable mechanism for
> heartbeat.
>
> *job.container.schedule.ms <http://job.container.schedule.ms>: *It does
> seem that we can pick a sensible default, and be done with it (instead of
> adding a new config)? Is there a reason this needs to be configurable?
>
> *On proposed Killable interface: *
>
> Not entirely sure we need this new "*Killable"* interface (esp. given that
> there's currently only one implementation - *SamzaContainer*).
>
>    - The *LocalContainerRunner* can instead directly invoke shut-down on
>    the *SamzaContainer* when its heart-beat expires. The extra level of
>    indirection (making *SamzaContainer* to implement *Killable*) is
>    probably unnecessary IMHO.
>
>
>    - Since, the *LocalContainerRunner* invokes *start/run* on the
>    *SamzaContainer*, it seems simpler also have it invoke *shutdown* on the
>    *SamzaContainer. *
>
> *Minor Comments:*
>
> >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is to get
> requests from the Samza container periodically and respond back weather the
> container is in the Job Coordinator's current list of valid containers.
>
> Wondering if it'd be slightly cleaner to rename this to */heartBeatRequest*
> and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
> *isContainerValid
> * and the definition of validity does seem slightly broad?
>
> Thanks again for taking the time to draft the SEP, and volunteering to
> implement this. Nice work!
>
> Best,
> Jagadish
>
> On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <ab...@gmail.com>
> wrote:
>
> > Hi Everyone,
> >
> > In order to fix the issue of orphaned/leaky containers seen when the
> > YARN Node Manager crashes, I have created a SEP discussing the design for
> > implementing a heartbeat between the containers and the job coordinator:
> > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> > all+running+containers
> >
> > Please take a look and provide feedback. I would also really appreciate
> > help in designing a way to propagate the error up from SamzaContainer in
> > order to exit the container with a non-zero exit code.
> >
> > Thanks,
> > Abhishek
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>

Re: [DISCUSS] SEP-3: Heart-beat mechanism between JobCoordinator and all running containers

Posted by "Navina Ramesh (Apache)" <na...@apache.org>.

Abhishek,
Thanks for clarifying and updating the SEP.

Cheers!
Navina

On Wed, May 3, 2017 at 8:20 PM, Jagadish Venkatraman <jagadish1989@gmail.com
> wrote:

> Navina,
>
>
> >> The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
> internal
> APIs and have a concrete implementations.
>
> More specifically, both of these are purely internal implementation classes
> (and have nothing to do with any pluggable public API that we expose)
>
> Best,
> Jagadish
>
> On Wed, May 3, 2017 at 7:34 PM, Abhishek Shivanna <ab...@gmail.com>
> wrote:
>
> > Hey Navina,
> >
> > Thank you for reviewing the SEP.
> >
> > > Are you planning on exposing this monitor class as a public api? What
> is
> > the significance of doing so?
> >
> > Sorry for the confusion of having implementation details under "public
> > interfaces".
> > The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
> > internal APIs
> > and have a concrete implementations.
> >
> > > Is "Execution Container ID" the name of the environmental variable? I
> > don't
> > think environmental variables can contain whitespace??
> >
> > Again, confusion that stemmed from my initial draft. I have fixed the SEP
> > with the actual name in the implementation.
> >
> > > I think the first sentence corresponds to your design. The second one
> is
> > more of an implementation detail. You may want to split it up or just
> > discard one of them. I got confused reading them together because one
> talks
> > about adding to container and the other about the ContainerRunner.
> >
> > Fixed the SEP to make it more clear.
> >
> > Thanks,
> > Abhishek
> >
> >
> > On Wed, May 3, 2017 at 2:08 PM, Navina Ramesh (Apache) <
> navina@apache.org>
> > wrote:
> >
> > > Hi Abhishek,
> > > I checked your latest proposal in SEP and it looks good to me.
> > >
> > > QQ:
> > > > A new ContainerHeartbeatMonitor class that accepts a
> > > ContainerHeartbeatClient (which has the business logic to make
> heartbeat
> > > checks on the JC endpoint) and a callback.
> > >
> > > Are you planning on exposing this monitor class as a public api? What
> is
> > > the significance of doing so?
> > >
> > > > set an environment variable with the "Execution Container ID" during
> > > container launch. This can be read from the container to make requests
> to
> > > the above endpoint.
> > >
> > > Is "Execution Container ID" the name of the environmental variable? I
> > don't
> > > think environmental variables can contain whitespace??
> > >
> > > > On the container side we start a new thread that periodically polls
> > this
> > > endpoint described above to check if the container is valid. If its
> not,
> > we
> > > shutdown the run loop and raise an error (so that the exit code is non
> 0
> > so
> > > that YARN reschedules the container)
> > > The plan is to setup a monitor in the LocalContainerRunner class that
> > > schedules a thread to check the above endpoint at regular intervals. On
> > > failure the thread modifies state on the LocalContainerRunner to denote
> > > that there was an error. This state is checked during exit in the
> > > LocalContainerRunner to exit with a non-zero code.
> > >
> > > I think the first sentence corresponds to your design. The second one
> is
> > > more of an implementation detail. You may want to split it up or just
> > > discard one of them. I got confused reading them together because one
> > talks
> > > about adding to container and the other about the ContainerRunner.
> > >
> > > Design looks pretty elegant and easily portable.
> > >
> > > Thanks!
> > > Navina
> > >
> > >
> > > On Wed, May 3, 2017 at 9:52 AM, Abhishek Shivanna <ab...@gmail.com>
> > > wrote:
> > >
> > > > Hey Jagadish,
> > > >
> > > > Thank you for taking the time to review the design.
> > > > I agree with moving the heartbeat into the the LocalContainerRunner
> > > instead
> > > > of fitting it into the SamzaContainer. I will update the SEP with the
> > new
> > > > design changes.
> > > > Also agree with the changes to the configuration and choosing
> suitable
> > > > defaults should be good enough.
> > > >
> > > > Thanks,
> > > > Abhishek
> > > >
> > > >
> > > >
> > > > On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman <
> > > > jagadish1989@gmail.com> wrote:
> > > >
> > > > > Hi Abhishek,
> > > > >
> > > > > Heartbeat between the AM and container has been a long awaited
> Samza
> > > > > feature. It will go a long way in ensuring our reliability! +1 for
> > this
> > > > > SEP.
> > > > >
> > > > > *High level comments:*
> > > > >
> > > > > Currently, the only use-case for the heartbeat mechanism seems to
> be
> > > when
> > > > > running Samza on Yarn. IMHO, it makes sense to pull the heart beat
> > > logic
> > > > > into the *LocalContainerRunner* instead of baking it into the
> > > > > *SamzaContainer* class. Long term, we can re-visit this when we
> have
> > a
> > > > > pluggable liveness detection mechanism.
> > > > >
> > > > > I'm thinking of a flow like this:
> > > > >
> > > > > There is a separate component (or a thread) inside
> > LocalContainerRunner
> > > > > that periodically polls the coordinator, and determines if it
> should
> > > > > continue running. If the coordinator determines that the container
> > > should
> > > > > not run, the *LocalContainerRunner* cleanly shuts-down the
> container
> > > and
> > > > > the process exits with a non-zero exit status.
> > > > >
> > > > > The following nice properties fall out:
> > > > >
> > > > >    - We can remove the proposed config *job.container.validator.
> > > enabled.
> > > > *
> > > > >    - We can also remove the proposed *Killable* interface since
> > > > >    *SamzaContainer* (and runLoops) don't have to implement
> *Killable
> > *
> > > > >    anymore. The life-cycle is managed by the *LocalContainerRunner*
> > > that
> > > > >    started it.
> > > > >
> > > > > *On the proposed public interfaces:*
> > > > >
> > > > > *job.container.validator.enabled:  *I am not in favor of adding
> this
> > > as
> > > > a
> > > > > new public config. IIUC, When running Samza jobs on Yarn, we always
> > > want
> > > > > the validator/heartbeats to be enabled. OTOH, when running Samza
> jobs
> > > in
> > > > > standalone mode, we currently do not have a pluggable mechanism for
> > > > > heartbeat.
> > > > >
> > > > > *job.container.schedule.ms <http://job.container.schedule.ms>: *It
> > > does
> > > > > seem that we can pick a sensible default, and be done with it
> > (instead
> > > of
> > > > > adding a new config)? Is there a reason this needs to be
> > configurable?
> > > > >
> > > > > *On proposed Killable interface: *
> > > > >
> > > > > Not entirely sure we need this new "*Killable"* interface (esp.
> given
> > > > that
> > > > > there's currently only one implementation - *SamzaContainer*).
> > > > >
> > > > >    - The *LocalContainerRunner* can instead directly invoke
> shut-down
> > > on
> > > > >    the *SamzaContainer* when its heart-beat expires. The extra
> level
> > of
> > > > >    indirection (making *SamzaContainer* to implement *Killable*) is
> > > > >    probably unnecessary IMHO.
> > > > >
> > > > >
> > > > >    - Since, the *LocalContainerRunner* invokes *start/run* on the
> > > > >    *SamzaContainer*, it seems simpler also have it invoke
> *shutdown*
> > on
> > > > the
> > > > >    *SamzaContainer. *
> > > > >
> > > > > *Minor Comments:*
> > > > >
> > > > > >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is
> to
> > > get
> > > > > requests from the Samza container periodically and respond back
> > weather
> > > > the
> > > > > container is in the Job Coordinator's current list of valid
> > containers.
> > > > >
> > > > > Wondering if it'd be slightly cleaner to rename this to
> > > > */heartBeatRequest*
> > > > > and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
> > > > > *isContainerValid
> > > > > * and the definition of validity does seem slightly broad?
> > > > >
> > > > > Thanks again for taking the time to draft the SEP, and volunteering
> > to
> > > > > implement this. Nice work!
> > > > >
> > > > > Best,
> > > > > Jagadish
> > > > >
> > > > > On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <
> > abkshvn@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > In order to fix the issue of orphaned/leaky containers seen when
> > the
> > > > > > YARN Node Manager crashes, I have created a SEP discussing the
> > design
> > > > for
> > > > > > implementing a heartbeat between the containers and the job
> > > > coordinator:
> > > > > > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> > > > > > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> > > > > > all+running+containers
> > > > > >
> > > > > > Please take a look and provide feedback. I would also really
> > > appreciate
> > > > > > help in designing a way to propagate the error up from
> > SamzaContainer
> > > > in
> > > > > > order to exit the container with a non-zero exit code.
> > > > > >
> > > > > > Thanks,
> > > > > > Abhishek
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jagadish V,
> > > > > Graduate Student,
> > > > > Department of Computer Science,
> > > > > Stanford University
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>

Re: [DISCUSS] SEP-3: Heart-beat mechanism between JobCoordinator and all running containers

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.

Abhishek,
Thanks for clarifying and updating the SEP.

Cheers!
Navina

On Wed, May 3, 2017 at 8:20 PM, Jagadish Venkatraman <jagadish1989@gmail.com
> wrote:

> Navina,
>
>
> >> The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
> internal
> APIs and have a concrete implementations.
>
> More specifically, both of these are purely internal implementation classes
> (and have nothing to do with any pluggable public API that we expose)
>
> Best,
> Jagadish
>
> On Wed, May 3, 2017 at 7:34 PM, Abhishek Shivanna <ab...@gmail.com>
> wrote:
>
> > Hey Navina,
> >
> > Thank you for reviewing the SEP.
> >
> > > Are you planning on exposing this monitor class as a public api? What
> is
> > the significance of doing so?
> >
> > Sorry for the confusion of having implementation details under "public
> > interfaces".
> > The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
> > internal APIs
> > and have a concrete implementations.
> >
> > > Is "Execution Container ID" the name of the environmental variable? I
> > don't
> > think environmental variables can contain whitespace??
> >
> > Again, confusion that stemmed from my initial draft. I have fixed the SEP
> > with the actual name in the implementation.
> >
> > > I think the first sentence corresponds to your design. The second one
> is
> > more of an implementation detail. You may want to split it up or just
> > discard one of them. I got confused reading them together because one
> talks
> > about adding to container and the other about the ContainerRunner.
> >
> > Fixed the SEP to make it more clear.
> >
> > Thanks,
> > Abhishek
> >
> >
> > On Wed, May 3, 2017 at 2:08 PM, Navina Ramesh (Apache) <
> navina@apache.org>
> > wrote:
> >
> > > Hi Abhishek,
> > > I checked your latest proposal in SEP and it looks good to me.
> > >
> > > QQ:
> > > > A new ContainerHeartbeatMonitor class that accepts a
> > > ContainerHeartbeatClient (which has the business logic to make
> heartbeat
> > > checks on the JC endpoint) and a callback.
> > >
> > > Are you planning on exposing this monitor class as a public api? What
> is
> > > the significance of doing so?
> > >
> > > > set an environment variable with the "Execution Container ID" during
> > > container launch. This can be read from the container to make requests
> to
> > > the above endpoint.
> > >
> > > Is "Execution Container ID" the name of the environmental variable? I
> > don't
> > > think environmental variables can contain whitespace??
> > >
> > > > On the container side we start a new thread that periodically polls
> > this
> > > endpoint described above to check if the container is valid. If its
> not,
> > we
> > > shutdown the run loop and raise an error (so that the exit code is non
> 0
> > so
> > > that YARN reschedules the container)
> > > The plan is to setup a monitor in the LocalContainerRunner class that
> > > schedules a thread to check the above endpoint at regular intervals. On
> > > failure the thread modifies state on the LocalContainerRunner to denote
> > > that there was an error. This state is checked during exit in the
> > > LocalContainerRunner to exit with a non-zero code.
> > >
> > > I think the first sentence corresponds to your design. The second one
> is
> > > more of an implementation detail. You may want to split it up or just
> > > discard one of them. I got confused reading them together because one
> > talks
> > > about adding to container and the other about the ContainerRunner.
> > >
> > > Design looks pretty elegant and easily portable.
> > >
> > > Thanks!
> > > Navina
> > >
> > >
> > > On Wed, May 3, 2017 at 9:52 AM, Abhishek Shivanna <ab...@gmail.com>
> > > wrote:
> > >
> > > > Hey Jagadish,
> > > >
> > > > Thank you for taking the time to review the design.
> > > > I agree with moving the heartbeat into the the LocalContainerRunner
> > > instead
> > > > of fitting it into the SamzaContainer. I will update the SEP with the
> > new
> > > > design changes.
> > > > Also agree with the changes to the configuration and choosing
> suitable
> > > > defaults should be good enough.
> > > >
> > > > Thanks,
> > > > Abhishek
> > > >
> > > >
> > > >
> > > > On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman <
> > > > jagadish1989@gmail.com> wrote:
> > > >
> > > > > Hi Abhishek,
> > > > >
> > > > > Heartbeat between the AM and container has been a long awaited
> Samza
> > > > > feature. It will go a long way in ensuring our reliability! +1 for
> > this
> > > > > SEP.
> > > > >
> > > > > *High level comments:*
> > > > >
> > > > > Currently, the only use-case for the heartbeat mechanism seems to
> be
> > > when
> > > > > running Samza on Yarn. IMHO, it makes sense to pull the heart beat
> > > logic
> > > > > into the *LocalContainerRunner* instead of baking it into the
> > > > > *SamzaContainer* class. Long term, we can re-visit this when we
> have
> > a
> > > > > pluggable liveness detection mechanism.
> > > > >
> > > > > I'm thinking of a flow like this:
> > > > >
> > > > > There is a separate component (or a thread) inside
> > LocalContainerRunner
> > > > > that periodically polls the coordinator, and determines if it
> should
> > > > > continue running. If the coordinator determines that the container
> > > should
> > > > > not run, the *LocalContainerRunner* cleanly shuts-down the
> container
> > > and
> > > > > the process exits with a non-zero exit status.
> > > > >
> > > > > The following nice properties fall out:
> > > > >
> > > > >    - We can remove the proposed config *job.container.validator.
> > > enabled.
> > > > *
> > > > >    - We can also remove the proposed *Killable* interface since
> > > > >    *SamzaContainer* (and runLoops) don't have to implement
> *Killable
> > *
> > > > >    anymore. The life-cycle is managed by the *LocalContainerRunner*
> > > that
> > > > >    started it.
> > > > >
> > > > > *On the proposed public interfaces:*
> > > > >
> > > > > *job.container.validator.enabled:  *I am not in favor of adding
> this
> > > as
> > > > a
> > > > > new public config. IIUC, When running Samza jobs on Yarn, we always
> > > want
> > > > > the validator/heartbeats to be enabled. OTOH, when running Samza
> jobs
> > > in
> > > > > standalone mode, we currently do not have a pluggable mechanism for
> > > > > heartbeat.
> > > > >
> > > > > *job.container.schedule.ms <http://job.container.schedule.ms>: *It
> > > does
> > > > > seem that we can pick a sensible default, and be done with it
> > (instead
> > > of
> > > > > adding a new config)? Is there a reason this needs to be
> > configurable?
> > > > >
> > > > > *On proposed Killable interface: *
> > > > >
> > > > > Not entirely sure we need this new "*Killable"* interface (esp.
> given
> > > > that
> > > > > there's currently only one implementation - *SamzaContainer*).
> > > > >
> > > > >    - The *LocalContainerRunner* can instead directly invoke
> shut-down
> > > on
> > > > >    the *SamzaContainer* when its heart-beat expires. The extra
> level
> > of
> > > > >    indirection (making *SamzaContainer* to implement *Killable*) is
> > > > >    probably unnecessary IMHO.
> > > > >
> > > > >
> > > > >    - Since, the *LocalContainerRunner* invokes *start/run* on the
> > > > >    *SamzaContainer*, it seems simpler also have it invoke
> *shutdown*
> > on
> > > > the
> > > > >    *SamzaContainer. *
> > > > >
> > > > > *Minor Comments:*
> > > > >
> > > > > >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is
> to
> > > get
> > > > > requests from the Samza container periodically and respond back
> > weather
> > > > the
> > > > > container is in the Job Coordinator's current list of valid
> > containers.
> > > > >
> > > > > Wondering if it'd be slightly cleaner to rename this to
> > > > */heartBeatRequest*
> > > > > and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
> > > > > *isContainerValid
> > > > > * and the definition of validity does seem slightly broad?
> > > > >
> > > > > Thanks again for taking the time to draft the SEP, and volunteering
> > to
> > > > > implement this. Nice work!
> > > > >
> > > > > Best,
> > > > > Jagadish
> > > > >
> > > > > On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <
> > abkshvn@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Everyone,
> > > > > >
> > > > > > In order to fix the issue of orphaned/leaky containers seen when
> > the
> > > > > > YARN Node Manager crashes, I have created a SEP discussing the
> > design
> > > > for
> > > > > > implementing a heartbeat between the containers and the job
> > > > coordinator:
> > > > > > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> > > > > > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> > > > > > all+running+containers
> > > > > >
> > > > > > Please take a look and provide feedback. I would also really
> > > appreciate
> > > > > > help in designing a way to propagate the error up from
> > SamzaContainer
> > > > in
> > > > > > order to exit the container with a non-zero exit code.
> > > > > >
> > > > > > Thanks,
> > > > > > Abhishek
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jagadish V,
> > > > > Graduate Student,
> > > > > Department of Computer Science,
> > > > > Stanford University
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>



-- 
Navina R.

Re: [DISCUSS] SEP-3: Heart-beat mechanism between JobCoordinator and all running containers

Posted by Jagadish Venkatraman <ja...@gmail.com>.

Navina,


>> The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both internal
APIs and have a concrete implementations.

More specifically, both of these are purely internal implementation classes
(and have nothing to do with any pluggable public API that we expose)

Best,
Jagadish

On Wed, May 3, 2017 at 7:34 PM, Abhishek Shivanna <ab...@gmail.com> wrote:

> Hey Navina,
>
> Thank you for reviewing the SEP.
>
> > Are you planning on exposing this monitor class as a public api? What is
> the significance of doing so?
>
> Sorry for the confusion of having implementation details under "public
> interfaces".
> The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
> internal APIs
> and have a concrete implementations.
>
> > Is "Execution Container ID" the name of the environmental variable? I
> don't
> think environmental variables can contain whitespace??
>
> Again, confusion that stemmed from my initial draft. I have fixed the SEP
> with the actual name in the implementation.
>
> > I think the first sentence corresponds to your design. The second one is
> more of an implementation detail. You may want to split it up or just
> discard one of them. I got confused reading them together because one talks
> about adding to container and the other about the ContainerRunner.
>
> Fixed the SEP to make it more clear.
>
> Thanks,
> Abhishek
>
>
> On Wed, May 3, 2017 at 2:08 PM, Navina Ramesh (Apache) <na...@apache.org>
> wrote:
>
> > Hi Abhishek,
> > I checked your latest proposal in SEP and it looks good to me.
> >
> > QQ:
> > > A new ContainerHeartbeatMonitor class that accepts a
> > ContainerHeartbeatClient (which has the business logic to make heartbeat
> > checks on the JC endpoint) and a callback.
> >
> > Are you planning on exposing this monitor class as a public api? What is
> > the significance of doing so?
> >
> > > set an environment variable with the "Execution Container ID" during
> > container launch. This can be read from the container to make requests to
> > the above endpoint.
> >
> > Is "Execution Container ID" the name of the environmental variable? I
> don't
> > think environmental variables can contain whitespace??
> >
> > > On the container side we start a new thread that periodically polls
> this
> > endpoint described above to check if the container is valid. If its not,
> we
> > shutdown the run loop and raise an error (so that the exit code is non 0
> so
> > that YARN reschedules the container)
> > The plan is to setup a monitor in the LocalContainerRunner class that
> > schedules a thread to check the above endpoint at regular intervals. On
> > failure the thread modifies state on the LocalContainerRunner to denote
> > that there was an error. This state is checked during exit in the
> > LocalContainerRunner to exit with a non-zero code.
> >
> > I think the first sentence corresponds to your design. The second one is
> > more of an implementation detail. You may want to split it up or just
> > discard one of them. I got confused reading them together because one
> talks
> > about adding to container and the other about the ContainerRunner.
> >
> > Design looks pretty elegant and easily portable.
> >
> > Thanks!
> > Navina
> >
> >
> > On Wed, May 3, 2017 at 9:52 AM, Abhishek Shivanna <ab...@gmail.com>
> > wrote:
> >
> > > Hey Jagadish,
> > >
> > > Thank you for taking the time to review the design.
> > > I agree with moving the heartbeat into the the LocalContainerRunner
> > instead
> > > of fitting it into the SamzaContainer. I will update the SEP with the
> new
> > > design changes.
> > > Also agree with the changes to the configuration and choosing suitable
> > > defaults should be good enough.
> > >
> > > Thanks,
> > > Abhishek
> > >
> > >
> > >
> > > On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman <
> > > jagadish1989@gmail.com> wrote:
> > >
> > > > Hi Abhishek,
> > > >
> > > > Heartbeat between the AM and container has been a long awaited Samza
> > > > feature. It will go a long way in ensuring our reliability! +1 for
> this
> > > > SEP.
> > > >
> > > > *High level comments:*
> > > >
> > > > Currently, the only use-case for the heartbeat mechanism seems to be
> > when
> > > > running Samza on Yarn. IMHO, it makes sense to pull the heart beat
> > logic
> > > > into the *LocalContainerRunner* instead of baking it into the
> > > > *SamzaContainer* class. Long term, we can re-visit this when we have
> a
> > > > pluggable liveness detection mechanism.
> > > >
> > > > I'm thinking of a flow like this:
> > > >
> > > > There is a separate component (or a thread) inside
> LocalContainerRunner
> > > > that periodically polls the coordinator, and determines if it should
> > > > continue running. If the coordinator determines that the container
> > should
> > > > not run, the *LocalContainerRunner* cleanly shuts-down the container
> > and
> > > > the process exits with a non-zero exit status.
> > > >
> > > > The following nice properties fall out:
> > > >
> > > >    - We can remove the proposed config *job.container.validator.
> > enabled.
> > > *
> > > >    - We can also remove the proposed *Killable* interface since
> > > >    *SamzaContainer* (and runLoops) don't have to implement *Killable
> *
> > > >    anymore. The life-cycle is managed by the *LocalContainerRunner*
> > that
> > > >    started it.
> > > >
> > > > *On the proposed public interfaces:*
> > > >
> > > > *job.container.validator.enabled:  *I am not in favor of adding this
> > as
> > > a
> > > > new public config. IIUC, When running Samza jobs on Yarn, we always
> > want
> > > > the validator/heartbeats to be enabled. OTOH, when running Samza jobs
> > in
> > > > standalone mode, we currently do not have a pluggable mechanism for
> > > > heartbeat.
> > > >
> > > > *job.container.schedule.ms <http://job.container.schedule.ms>: *It
> > does
> > > > seem that we can pick a sensible default, and be done with it
> (instead
> > of
> > > > adding a new config)? Is there a reason this needs to be
> configurable?
> > > >
> > > > *On proposed Killable interface: *
> > > >
> > > > Not entirely sure we need this new "*Killable"* interface (esp. given
> > > that
> > > > there's currently only one implementation - *SamzaContainer*).
> > > >
> > > >    - The *LocalContainerRunner* can instead directly invoke shut-down
> > on
> > > >    the *SamzaContainer* when its heart-beat expires. The extra level
> of
> > > >    indirection (making *SamzaContainer* to implement *Killable*) is
> > > >    probably unnecessary IMHO.
> > > >
> > > >
> > > >    - Since, the *LocalContainerRunner* invokes *start/run* on the
> > > >    *SamzaContainer*, it seems simpler also have it invoke *shutdown*
> on
> > > the
> > > >    *SamzaContainer. *
> > > >
> > > > *Minor Comments:*
> > > >
> > > > >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is to
> > get
> > > > requests from the Samza container periodically and respond back
> weather
> > > the
> > > > container is in the Job Coordinator's current list of valid
> containers.
> > > >
> > > > Wondering if it'd be slightly cleaner to rename this to
> > > */heartBeatRequest*
> > > > and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
> > > > *isContainerValid
> > > > * and the definition of validity does seem slightly broad?
> > > >
> > > > Thanks again for taking the time to draft the SEP, and volunteering
> to
> > > > implement this. Nice work!
> > > >
> > > > Best,
> > > > Jagadish
> > > >
> > > > On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <
> abkshvn@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Everyone,
> > > > >
> > > > > In order to fix the issue of orphaned/leaky containers seen when
> the
> > > > > YARN Node Manager crashes, I have created a SEP discussing the
> design
> > > for
> > > > > implementing a heartbeat between the containers and the job
> > > coordinator:
> > > > > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> > > > > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> > > > > all+running+containers
> > > > >
> > > > > Please take a look and provide feedback. I would also really
> > appreciate
> > > > > help in designing a way to propagate the error up from
> SamzaContainer
> > > in
> > > > > order to exit the container with a non-zero exit code.
> > > > >
> > > > > Thanks,
> > > > > Abhishek
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jagadish V,
> > > > Graduate Student,
> > > > Department of Computer Science,
> > > > Stanford University
> > > >
> > >
> >
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University

Re: [DISCUSS] SEP-3: Heart-beat mechanism between JobCoordinator and all running containers

Posted by Abhishek Shivanna <ab...@gmail.com>.

Hey Navina,

Thank you for reviewing the SEP.

> Are you planning on exposing this monitor class as a public api? What is
the significance of doing so?

Sorry for the confusion of having implementation details under "public
interfaces".
The ContainerHeartbeatMonitor and the ContainerHeartbeatClient are both
internal APIs
and have a concrete implementations.

> Is "Execution Container ID" the name of the environmental variable? I
don't
think environmental variables can contain whitespace??

Again, confusion that stemmed from my initial draft. I have fixed the SEP
with the actual name in the implementation.

> I think the first sentence corresponds to your design. The second one is
more of an implementation detail. You may want to split it up or just
discard one of them. I got confused reading them together because one talks
about adding to container and the other about the ContainerRunner.

Fixed the SEP to make it more clear.

Thanks,
Abhishek


On Wed, May 3, 2017 at 2:08 PM, Navina Ramesh (Apache) <na...@apache.org>
wrote:

> Hi Abhishek,
> I checked your latest proposal in SEP and it looks good to me.
>
> QQ:
> > A new ContainerHeartbeatMonitor class that accepts a
> ContainerHeartbeatClient (which has the business logic to make heartbeat
> checks on the JC endpoint) and a callback.
>
> Are you planning on exposing this monitor class as a public api? What is
> the significance of doing so?
>
> > set an environment variable with the "Execution Container ID" during
> container launch. This can be read from the container to make requests to
> the above endpoint.
>
> Is "Execution Container ID" the name of the environmental variable? I don't
> think environmental variables can contain whitespace??
>
> > On the container side we start a new thread that periodically polls this
> endpoint described above to check if the container is valid. If its not, we
> shutdown the run loop and raise an error (so that the exit code is non 0 so
> that YARN reschedules the container)
> The plan is to setup a monitor in the LocalContainerRunner class that
> schedules a thread to check the above endpoint at regular intervals. On
> failure the thread modifies state on the LocalContainerRunner to denote
> that there was an error. This state is checked during exit in the
> LocalContainerRunner to exit with a non-zero code.
>
> I think the first sentence corresponds to your design. The second one is
> more of an implementation detail. You may want to split it up or just
> discard one of them. I got confused reading them together because one talks
> about adding to container and the other about the ContainerRunner.
>
> Design looks pretty elegant and easily portable.
>
> Thanks!
> Navina
>
>
> On Wed, May 3, 2017 at 9:52 AM, Abhishek Shivanna <ab...@gmail.com>
> wrote:
>
> > Hey Jagadish,
> >
> > Thank you for taking the time to review the design.
> > I agree with moving the heartbeat into the the LocalContainerRunner
> instead
> > of fitting it into the SamzaContainer. I will update the SEP with the new
> > design changes.
> > Also agree with the changes to the configuration and choosing suitable
> > defaults should be good enough.
> >
> > Thanks,
> > Abhishek
> >
> >
> >
> > On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman <
> > jagadish1989@gmail.com> wrote:
> >
> > > Hi Abhishek,
> > >
> > > Heartbeat between the AM and container has been a long awaited Samza
> > > feature. It will go a long way in ensuring our reliability! +1 for this
> > > SEP.
> > >
> > > *High level comments:*
> > >
> > > Currently, the only use-case for the heartbeat mechanism seems to be
> when
> > > running Samza on Yarn. IMHO, it makes sense to pull the heart beat
> logic
> > > into the *LocalContainerRunner* instead of baking it into the
> > > *SamzaContainer* class. Long term, we can re-visit this when we have a
> > > pluggable liveness detection mechanism.
> > >
> > > I'm thinking of a flow like this:
> > >
> > > There is a separate component (or a thread) inside LocalContainerRunner
> > > that periodically polls the coordinator, and determines if it should
> > > continue running. If the coordinator determines that the container
> should
> > > not run, the *LocalContainerRunner* cleanly shuts-down the container
> and
> > > the process exits with a non-zero exit status.
> > >
> > > The following nice properties fall out:
> > >
> > >    - We can remove the proposed config *job.container.validator.
> enabled.
> > *
> > >    - We can also remove the proposed *Killable* interface since
> > >    *SamzaContainer* (and runLoops) don't have to implement *Killable *
> > >    anymore. The life-cycle is managed by the *LocalContainerRunner*
> that
> > >    started it.
> > >
> > > *On the proposed public interfaces:*
> > >
> > > *job.container.validator.enabled:  *I am not in favor of adding this
> as
> > a
> > > new public config. IIUC, When running Samza jobs on Yarn, we always
> want
> > > the validator/heartbeats to be enabled. OTOH, when running Samza jobs
> in
> > > standalone mode, we currently do not have a pluggable mechanism for
> > > heartbeat.
> > >
> > > *job.container.schedule.ms <http://job.container.schedule.ms>: *It
> does
> > > seem that we can pick a sensible default, and be done with it (instead
> of
> > > adding a new config)? Is there a reason this needs to be configurable?
> > >
> > > *On proposed Killable interface: *
> > >
> > > Not entirely sure we need this new "*Killable"* interface (esp. given
> > that
> > > there's currently only one implementation - *SamzaContainer*).
> > >
> > >    - The *LocalContainerRunner* can instead directly invoke shut-down
> on
> > >    the *SamzaContainer* when its heart-beat expires. The extra level of
> > >    indirection (making *SamzaContainer* to implement *Killable*) is
> > >    probably unnecessary IMHO.
> > >
> > >
> > >    - Since, the *LocalContainerRunner* invokes *start/run* on the
> > >    *SamzaContainer*, it seems simpler also have it invoke *shutdown* on
> > the
> > >    *SamzaContainer. *
> > >
> > > *Minor Comments:*
> > >
> > > >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is to
> get
> > > requests from the Samza container periodically and respond back weather
> > the
> > > container is in the Job Coordinator's current list of valid containers.
> > >
> > > Wondering if it'd be slightly cleaner to rename this to
> > */heartBeatRequest*
> > > and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
> > > *isContainerValid
> > > * and the definition of validity does seem slightly broad?
> > >
> > > Thanks again for taking the time to draft the SEP, and volunteering to
> > > implement this. Nice work!
> > >
> > > Best,
> > > Jagadish
> > >
> > > On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <ab...@gmail.com>
> > > wrote:
> > >
> > > > Hi Everyone,
> > > >
> > > > In order to fix the issue of orphaned/leaky containers seen when the
> > > > YARN Node Manager crashes, I have created a SEP discussing the design
> > for
> > > > implementing a heartbeat between the containers and the job
> > coordinator:
> > > > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> > > > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> > > > all+running+containers
> > > >
> > > > Please take a look and provide feedback. I would also really
> appreciate
> > > > help in designing a way to propagate the error up from SamzaContainer
> > in
> > > > order to exit the container with a non-zero exit code.
> > > >
> > > > Thanks,
> > > > Abhishek
> > > >
> > >
> > >
> > >
> > > --
> > > Jagadish V,
> > > Graduate Student,
> > > Department of Computer Science,
> > > Stanford University
> > >
> >
>

Re: [DISCUSS] SEP-3: Heart-beat mechanism between JobCoordinator and all running containers

Posted by "Navina Ramesh (Apache)" <na...@apache.org>.

Hi Abhishek,
I checked your latest proposal in SEP and it looks good to me.

QQ:
> A new ContainerHeartbeatMonitor class that accepts a
ContainerHeartbeatClient (which has the business logic to make heartbeat
checks on the JC endpoint) and a callback.

Are you planning on exposing this monitor class as a public api? What is
the significance of doing so?

> set an environment variable with the "Execution Container ID" during
container launch. This can be read from the container to make requests to
the above endpoint.

Is "Execution Container ID" the name of the environmental variable? I don't
think environmental variables can contain whitespace??

> On the container side we start a new thread that periodically polls this
endpoint described above to check if the container is valid. If its not, we
shutdown the run loop and raise an error (so that the exit code is non 0 so
that YARN reschedules the container)
The plan is to setup a monitor in the LocalContainerRunner class that
schedules a thread to check the above endpoint at regular intervals. On
failure the thread modifies state on the LocalContainerRunner to denote
that there was an error. This state is checked during exit in the
LocalContainerRunner to exit with a non-zero code.

I think the first sentence corresponds to your design. The second one is
more of an implementation detail. You may want to split it up or just
discard one of them. I got confused reading them together because one talks
about adding to container and the other about the ContainerRunner.

Design looks pretty elegant and easily portable.

Thanks!
Navina


On Wed, May 3, 2017 at 9:52 AM, Abhishek Shivanna <ab...@gmail.com> wrote:

> Hey Jagadish,
>
> Thank you for taking the time to review the design.
> I agree with moving the heartbeat into the the LocalContainerRunner instead
> of fitting it into the SamzaContainer. I will update the SEP with the new
> design changes.
> Also agree with the changes to the configuration and choosing suitable
> defaults should be good enough.
>
> Thanks,
> Abhishek
>
>
>
> On Wed, Apr 26, 2017 at 3:23 PM, Jagadish Venkatraman <
> jagadish1989@gmail.com> wrote:
>
> > Hi Abhishek,
> >
> > Heartbeat between the AM and container has been a long awaited Samza
> > feature. It will go a long way in ensuring our reliability! +1 for this
> > SEP.
> >
> > *High level comments:*
> >
> > Currently, the only use-case for the heartbeat mechanism seems to be when
> > running Samza on Yarn. IMHO, it makes sense to pull the heart beat logic
> > into the *LocalContainerRunner* instead of baking it into the
> > *SamzaContainer* class. Long term, we can re-visit this when we have a
> > pluggable liveness detection mechanism.
> >
> > I'm thinking of a flow like this:
> >
> > There is a separate component (or a thread) inside LocalContainerRunner
> > that periodically polls the coordinator, and determines if it should
> > continue running. If the coordinator determines that the container should
> > not run, the *LocalContainerRunner* cleanly shuts-down the container and
> > the process exits with a non-zero exit status.
> >
> > The following nice properties fall out:
> >
> >    - We can remove the proposed config *job.container.validator.enabled.
> *
> >    - We can also remove the proposed *Killable* interface since
> >    *SamzaContainer* (and runLoops) don't have to implement *Killable *
> >    anymore. The life-cycle is managed by the *LocalContainerRunner* that
> >    started it.
> >
> > *On the proposed public interfaces:*
> >
> > *job.container.validator.enabled:  *I am not in favor of adding this as
> a
> > new public config. IIUC, When running Samza jobs on Yarn, we always want
> > the validator/heartbeats to be enabled. OTOH, when running Samza jobs in
> > standalone mode, we currently do not have a pluggable mechanism for
> > heartbeat.
> >
> > *job.container.schedule.ms <http://job.container.schedule.ms>: *It does
> > seem that we can pick a sensible default, and be done with it (instead of
> > adding a new config)? Is there a reason this needs to be configurable?
> >
> > *On proposed Killable interface: *
> >
> > Not entirely sure we need this new "*Killable"* interface (esp. given
> that
> > there's currently only one implementation - *SamzaContainer*).
> >
> >    - The *LocalContainerRunner* can instead directly invoke shut-down on
> >    the *SamzaContainer* when its heart-beat expires. The extra level of
> >    indirection (making *SamzaContainer* to implement *Killable*) is
> >    probably unnecessary IMHO.
> >
> >
> >    - Since, the *LocalContainerRunner* invokes *start/run* on the
> >    *SamzaContainer*, it seems simpler also have it invoke *shutdown* on
> the
> >    *SamzaContainer. *
> >
> > *Minor Comments:*
> >
> > >> Expose a REST endpoint (eg: /isContainerValid) who's purpose is to get
> > requests from the Samza container periodically and respond back weather
> the
> > container is in the Job Coordinator's current list of valid containers.
> >
> > Wondering if it'd be slightly cleaner to rename this to
> */heartBeatRequest*
> > and return a *heartBeatResponse* as *CONTINUE, DIE*.  The name
> > *isContainerValid
> > * and the definition of validity does seem slightly broad?
> >
> > Thanks again for taking the time to draft the SEP, and volunteering to
> > implement this. Nice work!
> >
> > Best,
> > Jagadish
> >
> > On Mon, Apr 24, 2017 at 6:42 PM, Abhishek Shivanna <ab...@gmail.com>
> > wrote:
> >
> > > Hi Everyone,
> > >
> > > In order to fix the issue of orphaned/leaky containers seen when the
> > > YARN Node Manager crashes, I have created a SEP discussing the design
> for
> > > implementing a heartbeat between the containers and the job
> coordinator:
> > > https://cwiki.apache.org/confluence/display/SAMZA/SEP-
> > > 3%3A+Heart-beat+mechanism+between+JobCoordinator+and+
> > > all+running+containers
> > >
> > > Please take a look and provide feedback. I would also really appreciate
> > > help in designing a way to propagate the error up from SamzaContainer
> in
> > > order to exit the container with a non-zero exit code.
> > >
> > > Thanks,
> > > Abhishek
> > >
> >
> >
> >
> > --
> > Jagadish V,
> > Graduate Student,
> > Department of Computer Science,
> > Stanford University
> >
>