You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@apex.apache.org by Mohit Jotwani <mo...@datatorrent.com> on 2016/12/01 05:22:20 UTC

Re: "ExcludeNodes" for an Apex application

This is a practical scenario where developers would be required to exclude
certain nodes as they might be required for some mission critical
applications. It would be good to have this feature.

I understand that Stram should not get into resourcing and still rely on
Yarn, however, as the App Master it should have the right to reject the
nodes offered by Yarn and request for other resources.

Regards,
Mohit

On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> Apex has automatic blacklisting of the troublesome nodes, please take a
> look at the following attributes,
>
> MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> FAILURES_FOR_BLACKLIST
>
> BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
>
> Thanks
>
>
>
> On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <ra...@datatorrent.com>
> wrote:
>
> Not sure if this is what Milind had in mind but we often run into
> situations where the dev group
> working with Apex has no control over cluster configuration -- to make any
> changes to the cluster they need to
> go through an elaborate process that can take many days.
>
> Meanwhile, if they notice that a particular node is consistently causing
> problems for their
> app, having a simple way to exclude it would be very helpful since it gives
> them a way
> to bypass communication and process issues within their own organization.
>
> Ram
>
> On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <sa...@datatorrent.com>
> wrote:
>
> > To me both use cases appear to be generic resource management use cases.
> > For example, a randomly rebooting node is not good for any purpose esp.
> > long running apps so it is a bit of a stretch to imagine that these nodes
> > will be acceptable for some batch jobs in Yarn. So such a node should be
> > marked “Bad” or Unavailable in Yarn itself.
> >
> > Second use case is also typical anti-affinity use case which ideally
> > should be implemented in Yarn – Milind’s example can also apply to
> non-Apex
> > batch jobs. In any case it looks like Yarn still doesn’t have it (
> > https://issues.apache.org/jira/browse/YARN-1042) so if Apex needs it we
> > will need to do it ourselves.
> >
> > On 11/30/16, 10:39 AM, "Munagala Ramanath" <ra...@datatorrent.com> wrote:
> >
> >     But then, what's the solution to the 2 problem scenarios that Milind
> >     describes ?
> >
> >     Ram
> >
> >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> > sanjay@datatorrent.com>
> >     wrote:
> >
> >     > I think “exclude nodes” and such is really the job of the resource
> > manager
> >     > i.e. Yarn. So I am not sure taking over some of these tasks in Apex
> > would
> >     > be very useful.
> >     >
> >     > I agree with Amol that apps should be node neutral. Resource
> > management in
> >     > Yarn together with fault tolerance in Apex should minimize the need
> > for
> >     > this feature although I am sure one can find use cases.
> >     >
> >     >
> >     > On 11/29/16, 10:41 PM, "Amol Kekre" <am...@datatorrent.com> wrote:
> >     >
> >     >     We do have this feature in Yarn, but that applies to all
> > applications.
> >     > I am
> >     >     not sure if Yarn has anti-affinity. This feature may be used,
> > but in
> >     >     general there is danger is an application taking over resource
> >     > allocation.
> >     >     Another quirk is that big data apps should ideally be
> > node-neutral.
> >     > This is
> >     >     a good idea, if we are able to carve out something where need
> is
> > app
> >     >     specific.
> >     >
> >     >     Thks
> >     >     Amol
> >     >
> >     >
> >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
> > milindb@gmail.com>
> >     > wrote:
> >     >
> >     >     > We have seen 2 cases mentioned below, where, it would have
> > been nice
> >     > if
> >     >     > Apex allowed us to exclude a node from the cluster for an
> >     > application.
> >     >     >
> >     >     > 1. A node in the cluster had gone bad (was randomly
> rebooting)
> > and
> >     > so an
> >     >     > Apex app should not use it - other apps can use it as they
> were
> >     > batch jobs.
> >     >     > 2. A node is being used for a mission critical app (Could be
> > an Apex
> >     > app
> >     >     > itself), but another Apex app which is mission critical
> should
> > not
> >     > be using
> >     >     > resources on that node.
> >     >     >
> >     >     > Can we have a way in which, Stram and YARN can coordinate
> > between
> >     > each
> >     >     > other to not use a set of nodes for the application. It an be
> > done
> >     > in 2 way
> >     >     > s-
> >     >     >
> >     >     > 1. Have a list of "exclude" nodes with Stram- when YARN
> > allcates
> >     > resources
> >     >     > on either of these, STRAM rejects and gets resources
> allocated
> > again
> >     > frm
> >     >     > YARN
> >     >     > 2. Have a list of nodes that can be used for an app - This
> can
> > be a
> >     > part of
> >     >     > config. Hwever, I don't think this would be a right way to do
> > so as
> >     > we will
> >     >     > need support from YARN as well. Further, this might be
> > difficult to
> >     > change
> >     >     > at runtim if need be.
> >     >     >
> >     >     > Any thoughts?
> >     >     >
> >     >     >
> >     >     > --
> >     >     > ~Milind bee at gee mail dot com
> >     >     >
> >     >
> >     >
> >     >
> >     >
> >
> >
> >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Mohit Jotwani <mo...@datatorrent.com>.
I would agree with Milind.

Regards,
Mohit

On Fri, Dec 2, 2016 at 12:49 PM, Milind Barve <mi...@gmail.com> wrote:

> Additionally, this would apply to Stram as well i.e. the master should also
> not be deployed on these nodes. Not sure if anti-affinity goes beyond
> operators.
>
> On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com> wrote:
>
> > My previous mail explains it, but just forgot to add : -1 to cover this
> > under anti affinity.
> >
> > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com> wrote:
> >
> >> While it is possible to extend anti-affinity to take care of this, I
> feel
> >> it will cause confusion from a user perspective. As a user, when I think
> >> about anti-affinity, what comes to mind right away is a relative
> relation
> >> between operators.
> >>
> >> On the other hand, the current ask is not that, but a relation at an
> >> application level w.r.t. a node. (Further, we might even think of
> extending
> >> this at an operator level - which would mean do not deploy an operator
> on a
> >> particular node)
> >>
> >> We would be better off clearly articulating and allowing users to
> >> configure it seperately as against using anti-affinity.
> >>
> >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com>
> >> wrote:
> >>
> >>> Okay, I think that serves an alternate purpose of detecting any newly
> >>> gone
> >>> bad node and excluding it.
> >>>
> >>> +1 for covering the original scenario under anti-affinity.
> >>>
> >>> ~ Bhupesh
> >>>
> >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <ram@datatorrent.com
> >
> >>> wrote:
> >>>
> >>> > It only takes effect after failures -- no way to exclude from the
> >>> get-go.
> >>> >
> >>> > Ram
> >>> >
> >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com>
> >>> wrote:
> >>> >
> >>> > > As suggested by Sandesh, the parameter
> >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> exactly
> >>> > what
> >>> > > is needed.
> >>> > > Why would this not work?
> >>> > >
> >>> > > ~ Bhupesh
> >>> > >
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> ~Milind bee at gee mail dot com
> >>
> >
> >
> >
> > --
> > ~Milind bee at gee mail dot com
> >
>
>
>
> --
> ~Milind bee at gee mail dot com
>

Re: "ExcludeNodes" for an Apex application

Posted by Munagala Ramanath <ra...@datatorrent.com>.
The OP is claiming (in the comment to the first response) that he actually
tried the
proposed solution and it did not work for him and shows the RM code fragment
that is clobbering his preference.

Ram

On Fri, Dec 2, 2016 at 12:17 AM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> Yarn allows the AppMaster to run on the selected node, Apex shouldn't
> select the blacklisted nodes, so it is possible to achieve not running the
> Apex containers on certain nodes.
>
> http://stackoverflow.com/questions/29302659/run-my-own-
> application-master-on-a-specific-node-in-a-yarn-cluster
>
>
> On Thu, Dec 1, 2016 at 11:52 PM Amol Kekre <am...@datatorrent.com> wrote:
>
> > Yarn will deploy AM (Stram) on a node of its choice, therey rendering any
> > attribute within the app un-enforceable in terms of not deploying master
> on
> > a node.
> >
> > Thks
> > Amol
> >
> >
> > On Thu, Dec 1, 2016 at 11:19 PM, Milind Barve <mi...@gmail.com> wrote:
> >
> > > Additionally, this would apply to Stram as well i.e. the master should
> > also
> > > not be deployed on these nodes. Not sure if anti-affinity goes beyond
> > > operators.
> > >
> > > On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com>
> wrote:
> > >
> > > > My previous mail explains it, but just forgot to add : -1 to cover
> this
> > > > under anti affinity.
> > > >
> > > > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com>
> > wrote:
> > > >
> > > >> While it is possible to extend anti-affinity to take care of this, I
> > > feel
> > > >> it will cause confusion from a user perspective. As a user, when I
> > think
> > > >> about anti-affinity, what comes to mind right away is a relative
> > > relation
> > > >> between operators.
> > > >>
> > > >> On the other hand, the current ask is not that, but a relation at an
> > > >> application level w.r.t. a node. (Further, we might even think of
> > > extending
> > > >> this at an operator level - which would mean do not deploy an
> operator
> > > on a
> > > >> particular node)
> > > >>
> > > >> We would be better off clearly articulating and allowing users to
> > > >> configure it seperately as against using anti-affinity.
> > > >>
> > > >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > >> wrote:
> > > >>
> > > >>> Okay, I think that serves an alternate purpose of detecting any
> newly
> > > >>> gone
> > > >>> bad node and excluding it.
> > > >>>
> > > >>> +1 for covering the original scenario under anti-affinity.
> > > >>>
> > > >>> ~ Bhupesh
> > > >>>
> > > >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <
> > ram@datatorrent.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > It only takes effect after failures -- no way to exclude from the
> > > >>> get-go.
> > > >>> >
> > > >>> > Ram
> > > >>> >
> > > >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <
> bhupesh@datatorrent.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > As suggested by Sandesh, the parameter
> > > >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> > > exactly
> > > >>> > what
> > > >>> > > is needed.
> > > >>> > > Why would this not work?
> > > >>> > >
> > > >>> > > ~ Bhupesh
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> ~Milind bee at gee mail dot com
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > ~Milind bee at gee mail dot com
> > > >
> > >
> > >
> > >
> > > --
> > > ~Milind bee at gee mail dot com
> > >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Milind Barve <mi...@gmail.com>.
So all Apex will need to do is - to make sure as a part of the initial
configuration validations that the node selected to run the master is not a
part of the "excludeNode" list.

On Fri, Dec 2, 2016 at 1:47 PM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> Yarn allows the AppMaster to run on the selected node, Apex shouldn't
> select the blacklisted nodes, so it is possible to achieve not running the
> Apex containers on certain nodes.
>
> http://stackoverflow.com/questions/29302659/run-my-own-
> application-master-on-a-specific-node-in-a-yarn-cluster
>
>
> On Thu, Dec 1, 2016 at 11:52 PM Amol Kekre <am...@datatorrent.com> wrote:
>
> > Yarn will deploy AM (Stram) on a node of its choice, therey rendering any
> > attribute within the app un-enforceable in terms of not deploying master
> on
> > a node.
> >
> > Thks
> > Amol
> >
> >
> > On Thu, Dec 1, 2016 at 11:19 PM, Milind Barve <mi...@gmail.com> wrote:
> >
> > > Additionally, this would apply to Stram as well i.e. the master should
> > also
> > > not be deployed on these nodes. Not sure if anti-affinity goes beyond
> > > operators.
> > >
> > > On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com>
> wrote:
> > >
> > > > My previous mail explains it, but just forgot to add : -1 to cover
> this
> > > > under anti affinity.
> > > >
> > > > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com>
> > wrote:
> > > >
> > > >> While it is possible to extend anti-affinity to take care of this, I
> > > feel
> > > >> it will cause confusion from a user perspective. As a user, when I
> > think
> > > >> about anti-affinity, what comes to mind right away is a relative
> > > relation
> > > >> between operators.
> > > >>
> > > >> On the other hand, the current ask is not that, but a relation at an
> > > >> application level w.r.t. a node. (Further, we might even think of
> > > extending
> > > >> this at an operator level - which would mean do not deploy an
> operator
> > > on a
> > > >> particular node)
> > > >>
> > > >> We would be better off clearly articulating and allowing users to
> > > >> configure it seperately as against using anti-affinity.
> > > >>
> > > >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > >> wrote:
> > > >>
> > > >>> Okay, I think that serves an alternate purpose of detecting any
> newly
> > > >>> gone
> > > >>> bad node and excluding it.
> > > >>>
> > > >>> +1 for covering the original scenario under anti-affinity.
> > > >>>
> > > >>> ~ Bhupesh
> > > >>>
> > > >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <
> > ram@datatorrent.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > It only takes effect after failures -- no way to exclude from the
> > > >>> get-go.
> > > >>> >
> > > >>> > Ram
> > > >>> >
> > > >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <
> bhupesh@datatorrent.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > As suggested by Sandesh, the parameter
> > > >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> > > exactly
> > > >>> > what
> > > >>> > > is needed.
> > > >>> > > Why would this not work?
> > > >>> > >
> > > >>> > > ~ Bhupesh
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> ~Milind bee at gee mail dot com
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > ~Milind bee at gee mail dot com
> > > >
> > >
> > >
> > >
> > > --
> > > ~Milind bee at gee mail dot com
> > >
> >
>



-- 
~Milind bee at gee mail dot com

Re: "ExcludeNodes" for an Apex application

Posted by Sandesh Hegde <sa...@datatorrent.com>.
Yarn allows the AppMaster to run on the selected node, Apex shouldn't
select the blacklisted nodes, so it is possible to achieve not running the
Apex containers on certain nodes.

http://stackoverflow.com/questions/29302659/run-my-own-application-master-on-a-specific-node-in-a-yarn-cluster


On Thu, Dec 1, 2016 at 11:52 PM Amol Kekre <am...@datatorrent.com> wrote:

> Yarn will deploy AM (Stram) on a node of its choice, therey rendering any
> attribute within the app un-enforceable in terms of not deploying master on
> a node.
>
> Thks
> Amol
>
>
> On Thu, Dec 1, 2016 at 11:19 PM, Milind Barve <mi...@gmail.com> wrote:
>
> > Additionally, this would apply to Stram as well i.e. the master should
> also
> > not be deployed on these nodes. Not sure if anti-affinity goes beyond
> > operators.
> >
> > On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com> wrote:
> >
> > > My previous mail explains it, but just forgot to add : -1 to cover this
> > > under anti affinity.
> > >
> > > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com>
> wrote:
> > >
> > >> While it is possible to extend anti-affinity to take care of this, I
> > feel
> > >> it will cause confusion from a user perspective. As a user, when I
> think
> > >> about anti-affinity, what comes to mind right away is a relative
> > relation
> > >> between operators.
> > >>
> > >> On the other hand, the current ask is not that, but a relation at an
> > >> application level w.r.t. a node. (Further, we might even think of
> > extending
> > >> this at an operator level - which would mean do not deploy an operator
> > on a
> > >> particular node)
> > >>
> > >> We would be better off clearly articulating and allowing users to
> > >> configure it seperately as against using anti-affinity.
> > >>
> > >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com>
> > >> wrote:
> > >>
> > >>> Okay, I think that serves an alternate purpose of detecting any newly
> > >>> gone
> > >>> bad node and excluding it.
> > >>>
> > >>> +1 for covering the original scenario under anti-affinity.
> > >>>
> > >>> ~ Bhupesh
> > >>>
> > >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <
> ram@datatorrent.com
> > >
> > >>> wrote:
> > >>>
> > >>> > It only takes effect after failures -- no way to exclude from the
> > >>> get-go.
> > >>> >
> > >>> > Ram
> > >>> >
> > >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com>
> > >>> wrote:
> > >>> >
> > >>> > > As suggested by Sandesh, the parameter
> > >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> > exactly
> > >>> > what
> > >>> > > is needed.
> > >>> > > Why would this not work?
> > >>> > >
> > >>> > > ~ Bhupesh
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> ~Milind bee at gee mail dot com
> > >>
> > >
> > >
> > >
> > > --
> > > ~Milind bee at gee mail dot com
> > >
> >
> >
> >
> > --
> > ~Milind bee at gee mail dot com
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Munagala Ramanath <ra...@datatorrent.com>.
Agree it should be via YARN; the poison pill would be the final barrier in
the event
all other mechanisms have failed -- sort of like an API call which
documents that a parameter
should be non-null but nevertheless checks it internally and throws an
exception if it finds null.

Additionally, it also helps teams that do not have control over YARN
configuration.

Ram

On Fri, Dec 2, 2016 at 7:15 AM, Amol Kekre <am...@datatorrent.com> wrote:

> Stram exclude node should be via Yarn, poison pill is not a good way as it
> induces a terminate for wrong reasons.
>
> Thks
> Amol
>
>
> On Fri, Dec 2, 2016 at 7:13 AM, Munagala Ramanath <ra...@datatorrent.com>
> wrote:
>
> > Could STRAM include a poison pill where it simply exits with diagnostic
> if
> > its host name is blacklisted ?
> >
> > Ram
> >
> > On Thu, Dec 1, 2016 at 11:52 PM, Amol Kekre <am...@datatorrent.com>
> wrote:
> >
> > > Yarn will deploy AM (Stram) on a node of its choice, therey rendering
> any
> > > attribute within the app un-enforceable in terms of not deploying
> master
> > on
> > > a node.
> > >
> > > Thks
> > > Amol
> > >
> > >
> > > On Thu, Dec 1, 2016 at 11:19 PM, Milind Barve <mi...@gmail.com>
> wrote:
> > >
> > > > Additionally, this would apply to Stram as well i.e. the master
> should
> > > also
> > > > not be deployed on these nodes. Not sure if anti-affinity goes beyond
> > > > operators.
> > > >
> > > > On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com>
> > wrote:
> > > >
> > > > > My previous mail explains it, but just forgot to add : -1 to cover
> > this
> > > > > under anti affinity.
> > > > >
> > > > > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com>
> > > wrote:
> > > > >
> > > > >> While it is possible to extend anti-affinity to take care of
> this, I
> > > > feel
> > > > >> it will cause confusion from a user perspective. As a user, when I
> > > think
> > > > >> about anti-affinity, what comes to mind right away is a relative
> > > > relation
> > > > >> between operators.
> > > > >>
> > > > >> On the other hand, the current ask is not that, but a relation at
> an
> > > > >> application level w.r.t. a node. (Further, we might even think of
> > > > extending
> > > > >> this at an operator level - which would mean do not deploy an
> > operator
> > > > on a
> > > > >> particular node)
> > > > >>
> > > > >> We would be better off clearly articulating and allowing users to
> > > > >> configure it seperately as against using anti-affinity.
> > > > >>
> > > > >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> > > > bhupesh@datatorrent.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Okay, I think that serves an alternate purpose of detecting any
> > newly
> > > > >>> gone
> > > > >>> bad node and excluding it.
> > > > >>>
> > > > >>> +1 for covering the original scenario under anti-affinity.
> > > > >>>
> > > > >>> ~ Bhupesh
> > > > >>>
> > > > >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <
> > > ram@datatorrent.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> > It only takes effect after failures -- no way to exclude from
> the
> > > > >>> get-go.
> > > > >>> >
> > > > >>> > Ram
> > > > >>> >
> > > > >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <
> > bhupesh@datatorrent.com>
> > > > >>> wrote:
> > > > >>> >
> > > > >>> > > As suggested by Sandesh, the parameter
> > > > >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> > > > exactly
> > > > >>> > what
> > > > >>> > > is needed.
> > > > >>> > > Why would this not work?
> > > > >>> > >
> > > > >>> > > ~ Bhupesh
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> ~Milind bee at gee mail dot com
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > ~Milind bee at gee mail dot com
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > ~Milind bee at gee mail dot com
> > > >
> > >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Amol Kekre <am...@datatorrent.com>.
Stram exclude node should be via Yarn, poison pill is not a good way as it
induces a terminate for wrong reasons.

Thks
Amol


On Fri, Dec 2, 2016 at 7:13 AM, Munagala Ramanath <ra...@datatorrent.com>
wrote:

> Could STRAM include a poison pill where it simply exits with diagnostic if
> its host name is blacklisted ?
>
> Ram
>
> On Thu, Dec 1, 2016 at 11:52 PM, Amol Kekre <am...@datatorrent.com> wrote:
>
> > Yarn will deploy AM (Stram) on a node of its choice, therey rendering any
> > attribute within the app un-enforceable in terms of not deploying master
> on
> > a node.
> >
> > Thks
> > Amol
> >
> >
> > On Thu, Dec 1, 2016 at 11:19 PM, Milind Barve <mi...@gmail.com> wrote:
> >
> > > Additionally, this would apply to Stram as well i.e. the master should
> > also
> > > not be deployed on these nodes. Not sure if anti-affinity goes beyond
> > > operators.
> > >
> > > On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com>
> wrote:
> > >
> > > > My previous mail explains it, but just forgot to add : -1 to cover
> this
> > > > under anti affinity.
> > > >
> > > > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com>
> > wrote:
> > > >
> > > >> While it is possible to extend anti-affinity to take care of this, I
> > > feel
> > > >> it will cause confusion from a user perspective. As a user, when I
> > think
> > > >> about anti-affinity, what comes to mind right away is a relative
> > > relation
> > > >> between operators.
> > > >>
> > > >> On the other hand, the current ask is not that, but a relation at an
> > > >> application level w.r.t. a node. (Further, we might even think of
> > > extending
> > > >> this at an operator level - which would mean do not deploy an
> operator
> > > on a
> > > >> particular node)
> > > >>
> > > >> We would be better off clearly articulating and allowing users to
> > > >> configure it seperately as against using anti-affinity.
> > > >>
> > > >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> > > bhupesh@datatorrent.com>
> > > >> wrote:
> > > >>
> > > >>> Okay, I think that serves an alternate purpose of detecting any
> newly
> > > >>> gone
> > > >>> bad node and excluding it.
> > > >>>
> > > >>> +1 for covering the original scenario under anti-affinity.
> > > >>>
> > > >>> ~ Bhupesh
> > > >>>
> > > >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <
> > ram@datatorrent.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > It only takes effect after failures -- no way to exclude from the
> > > >>> get-go.
> > > >>> >
> > > >>> > Ram
> > > >>> >
> > > >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <
> bhupesh@datatorrent.com>
> > > >>> wrote:
> > > >>> >
> > > >>> > > As suggested by Sandesh, the parameter
> > > >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> > > exactly
> > > >>> > what
> > > >>> > > is needed.
> > > >>> > > Why would this not work?
> > > >>> > >
> > > >>> > > ~ Bhupesh
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> ~Milind bee at gee mail dot com
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > ~Milind bee at gee mail dot com
> > > >
> > >
> > >
> > >
> > > --
> > > ~Milind bee at gee mail dot com
> > >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Munagala Ramanath <ra...@datatorrent.com>.
Could STRAM include a poison pill where it simply exits with diagnostic if
its host name is blacklisted ?

Ram

On Thu, Dec 1, 2016 at 11:52 PM, Amol Kekre <am...@datatorrent.com> wrote:

> Yarn will deploy AM (Stram) on a node of its choice, therey rendering any
> attribute within the app un-enforceable in terms of not deploying master on
> a node.
>
> Thks
> Amol
>
>
> On Thu, Dec 1, 2016 at 11:19 PM, Milind Barve <mi...@gmail.com> wrote:
>
> > Additionally, this would apply to Stram as well i.e. the master should
> also
> > not be deployed on these nodes. Not sure if anti-affinity goes beyond
> > operators.
> >
> > On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com> wrote:
> >
> > > My previous mail explains it, but just forgot to add : -1 to cover this
> > > under anti affinity.
> > >
> > > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com>
> wrote:
> > >
> > >> While it is possible to extend anti-affinity to take care of this, I
> > feel
> > >> it will cause confusion from a user perspective. As a user, when I
> think
> > >> about anti-affinity, what comes to mind right away is a relative
> > relation
> > >> between operators.
> > >>
> > >> On the other hand, the current ask is not that, but a relation at an
> > >> application level w.r.t. a node. (Further, we might even think of
> > extending
> > >> this at an operator level - which would mean do not deploy an operator
> > on a
> > >> particular node)
> > >>
> > >> We would be better off clearly articulating and allowing users to
> > >> configure it seperately as against using anti-affinity.
> > >>
> > >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> > bhupesh@datatorrent.com>
> > >> wrote:
> > >>
> > >>> Okay, I think that serves an alternate purpose of detecting any newly
> > >>> gone
> > >>> bad node and excluding it.
> > >>>
> > >>> +1 for covering the original scenario under anti-affinity.
> > >>>
> > >>> ~ Bhupesh
> > >>>
> > >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <
> ram@datatorrent.com
> > >
> > >>> wrote:
> > >>>
> > >>> > It only takes effect after failures -- no way to exclude from the
> > >>> get-go.
> > >>> >
> > >>> > Ram
> > >>> >
> > >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com>
> > >>> wrote:
> > >>> >
> > >>> > > As suggested by Sandesh, the parameter
> > >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> > exactly
> > >>> > what
> > >>> > > is needed.
> > >>> > > Why would this not work?
> > >>> > >
> > >>> > > ~ Bhupesh
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> ~Milind bee at gee mail dot com
> > >>
> > >
> > >
> > >
> > > --
> > > ~Milind bee at gee mail dot com
> > >
> >
> >
> >
> > --
> > ~Milind bee at gee mail dot com
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Amol Kekre <am...@datatorrent.com>.
Yarn will deploy AM (Stram) on a node of its choice, therey rendering any
attribute within the app un-enforceable in terms of not deploying master on
a node.

Thks
Amol


On Thu, Dec 1, 2016 at 11:19 PM, Milind Barve <mi...@gmail.com> wrote:

> Additionally, this would apply to Stram as well i.e. the master should also
> not be deployed on these nodes. Not sure if anti-affinity goes beyond
> operators.
>
> On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com> wrote:
>
> > My previous mail explains it, but just forgot to add : -1 to cover this
> > under anti affinity.
> >
> > On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com> wrote:
> >
> >> While it is possible to extend anti-affinity to take care of this, I
> feel
> >> it will cause confusion from a user perspective. As a user, when I think
> >> about anti-affinity, what comes to mind right away is a relative
> relation
> >> between operators.
> >>
> >> On the other hand, the current ask is not that, but a relation at an
> >> application level w.r.t. a node. (Further, we might even think of
> extending
> >> this at an operator level - which would mean do not deploy an operator
> on a
> >> particular node)
> >>
> >> We would be better off clearly articulating and allowing users to
> >> configure it seperately as against using anti-affinity.
> >>
> >> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <
> bhupesh@datatorrent.com>
> >> wrote:
> >>
> >>> Okay, I think that serves an alternate purpose of detecting any newly
> >>> gone
> >>> bad node and excluding it.
> >>>
> >>> +1 for covering the original scenario under anti-affinity.
> >>>
> >>> ~ Bhupesh
> >>>
> >>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <ram@datatorrent.com
> >
> >>> wrote:
> >>>
> >>> > It only takes effect after failures -- no way to exclude from the
> >>> get-go.
> >>> >
> >>> > Ram
> >>> >
> >>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com>
> >>> wrote:
> >>> >
> >>> > > As suggested by Sandesh, the parameter
> >>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do
> exactly
> >>> > what
> >>> > > is needed.
> >>> > > Why would this not work?
> >>> > >
> >>> > > ~ Bhupesh
> >>> > >
> >>> >
> >>>
> >>
> >>
> >>
> >> --
> >> ~Milind bee at gee mail dot com
> >>
> >
> >
> >
> > --
> > ~Milind bee at gee mail dot com
> >
>
>
>
> --
> ~Milind bee at gee mail dot com
>

Re: "ExcludeNodes" for an Apex application

Posted by Milind Barve <mi...@gmail.com>.
Additionally, this would apply to Stram as well i.e. the master should also
not be deployed on these nodes. Not sure if anti-affinity goes beyond
operators.

On Fri, Dec 2, 2016 at 12:47 PM, Milind Barve <mi...@gmail.com> wrote:

> My previous mail explains it, but just forgot to add : -1 to cover this
> under anti affinity.
>
> On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com> wrote:
>
>> While it is possible to extend anti-affinity to take care of this, I feel
>> it will cause confusion from a user perspective. As a user, when I think
>> about anti-affinity, what comes to mind right away is a relative relation
>> between operators.
>>
>> On the other hand, the current ask is not that, but a relation at an
>> application level w.r.t. a node. (Further, we might even think of extending
>> this at an operator level - which would mean do not deploy an operator on a
>> particular node)
>>
>> We would be better off clearly articulating and allowing users to
>> configure it seperately as against using anti-affinity.
>>
>> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <bh...@datatorrent.com>
>> wrote:
>>
>>> Okay, I think that serves an alternate purpose of detecting any newly
>>> gone
>>> bad node and excluding it.
>>>
>>> +1 for covering the original scenario under anti-affinity.
>>>
>>> ~ Bhupesh
>>>
>>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <ra...@datatorrent.com>
>>> wrote:
>>>
>>> > It only takes effect after failures -- no way to exclude from the
>>> get-go.
>>> >
>>> > Ram
>>> >
>>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com>
>>> wrote:
>>> >
>>> > > As suggested by Sandesh, the parameter
>>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do exactly
>>> > what
>>> > > is needed.
>>> > > Why would this not work?
>>> > >
>>> > > ~ Bhupesh
>>> > >
>>> >
>>>
>>
>>
>>
>> --
>> ~Milind bee at gee mail dot com
>>
>
>
>
> --
> ~Milind bee at gee mail dot com
>



-- 
~Milind bee at gee mail dot com

Re: "ExcludeNodes" for an Apex application

Posted by Milind Barve <mi...@gmail.com>.
My previous mail explains it, but just forgot to add : -1 to cover this
under anti affinity.

On Fri, Dec 2, 2016 at 12:46 PM, Milind Barve <mi...@gmail.com> wrote:

> While it is possible to extend anti-affinity to take care of this, I feel
> it will cause confusion from a user perspective. As a user, when I think
> about anti-affinity, what comes to mind right away is a relative relation
> between operators.
>
> On the other hand, the current ask is not that, but a relation at an
> application level w.r.t. a node. (Further, we might even think of extending
> this at an operator level - which would mean do not deploy an operator on a
> particular node)
>
> We would be better off clearly articulating and allowing users to
> configure it seperately as against using anti-affinity.
>
> On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <bh...@datatorrent.com>
> wrote:
>
>> Okay, I think that serves an alternate purpose of detecting any newly gone
>> bad node and excluding it.
>>
>> +1 for covering the original scenario under anti-affinity.
>>
>> ~ Bhupesh
>>
>> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <ra...@datatorrent.com>
>> wrote:
>>
>> > It only takes effect after failures -- no way to exclude from the
>> get-go.
>> >
>> > Ram
>> >
>> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com>
>> wrote:
>> >
>> > > As suggested by Sandesh, the parameter
>> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do exactly
>> > what
>> > > is needed.
>> > > Why would this not work?
>> > >
>> > > ~ Bhupesh
>> > >
>> >
>>
>
>
>
> --
> ~Milind bee at gee mail dot com
>



-- 
~Milind bee at gee mail dot com

Re: "ExcludeNodes" for an Apex application

Posted by Milind Barve <mi...@gmail.com>.
While it is possible to extend anti-affinity to take care of this, I feel
it will cause confusion from a user perspective. As a user, when I think
about anti-affinity, what comes to mind right away is a relative relation
between operators.

On the other hand, the current ask is not that, but a relation at an
application level w.r.t. a node. (Further, we might even think of extending
this at an operator level - which would mean do not deploy an operator on a
particular node)

We would be better off clearly articulating and allowing users to configure
it seperately as against using anti-affinity.

On Fri, Dec 2, 2016 at 10:03 AM, Bhupesh Chawda <bh...@datatorrent.com>
wrote:

> Okay, I think that serves an alternate purpose of detecting any newly gone
> bad node and excluding it.
>
> +1 for covering the original scenario under anti-affinity.
>
> ~ Bhupesh
>
> On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <ra...@datatorrent.com>
> wrote:
>
> > It only takes effect after failures -- no way to exclude from the get-go.
> >
> > Ram
> >
> > On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com>
> wrote:
> >
> > > As suggested by Sandesh, the parameter
> > > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do exactly
> > what
> > > is needed.
> > > Why would this not work?
> > >
> > > ~ Bhupesh
> > >
> >
>



-- 
~Milind bee at gee mail dot com

Re: "ExcludeNodes" for an Apex application

Posted by Bhupesh Chawda <bh...@datatorrent.com>.
Okay, I think that serves an alternate purpose of detecting any newly gone
bad node and excluding it.

+1 for covering the original scenario under anti-affinity.

~ Bhupesh

On Fri, Dec 2, 2016 at 9:14 AM, Munagala Ramanath <ra...@datatorrent.com>
wrote:

> It only takes effect after failures -- no way to exclude from the get-go.
>
> Ram
>
> On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com> wrote:
>
> > As suggested by Sandesh, the parameter
> > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do exactly
> what
> > is needed.
> > Why would this not work?
> >
> > ~ Bhupesh
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Munagala Ramanath <ra...@datatorrent.com>.
It only takes effect after failures -- no way to exclude from the get-go.

Ram

On Dec 1, 2016 7:15 PM, "Bhupesh Chawda" <bh...@datatorrent.com> wrote:

> As suggested by Sandesh, the parameter
> MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do exactly what
> is needed.
> Why would this not work?
>
> ~ Bhupesh
>

Re: "ExcludeNodes" for an Apex application

Posted by Bhupesh Chawda <bh...@datatorrent.com>.
As suggested by Sandesh, the parameter
MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST seems to do exactly what
is needed.
Why would this not work?

~ Bhupesh

Re: "ExcludeNodes" for an Apex application

Posted by AJAY GUPTA <aj...@gmail.com>.
Hi,

Can't we make use of existing Node Label + queue feature in Yarn to achieve
this. Though we will have to redeploy cluster, its still possible to
exclude nodes.
https://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/NodeLabel.html


Thanks,
Ajay

On Fri, Dec 2, 2016 at 5:57 AM, Amol Kekre <am...@datatorrent.com> wrote:

> I agree, this should be on top of affinity work
>
> Thks
> Amol
>
> On Thu, Dec 1, 2016 at 1:01 PM, Pramod Immaneni <pr...@datatorrent.com>
> wrote:
>
> > I see a host locality available as an attribute in DAG for individual
> > operators. If affinity doesn't support this today, we could probably add
> > it. You could also make setting a blacklist directly a convenience
> function
> > on top of affinity.
> >
> > On Thu, Dec 1, 2016 at 11:58 AM, Sandesh Hegde <sa...@datatorrent.com>
> > wrote:
> >
> > > Pramod,
> > >
> > > How to specify,  "don't deploy any operators on Node20" using
> > > anti-affinity?
> > >
> > > I don't see any examples here,
> > > http://apex.apache.org/docs/apex/application_development/#
> affinity-rules
> > >
> > >
> > > On Thu, Dec 1, 2016 at 11:31 AM Pramod Immaneni <
> pramod@datatorrent.com>
> > > wrote:
> > >
> > > > Shouldn't this be already covered by anti-affinity. Today users can
> > > specify
> > > > multiple affinity rules, for each rule they can specify positive or
> > > > negative affinity, locality and operator selection. If an affinity
> rule
> > > > specifying negative affinity, node locality and all operators, does
> not
> > > > work then let's fix that scenario instead of creating a new option.
> > > >
> > > > On Thu, Dec 1, 2016 at 11:17 AM, Sandesh Hegde <
> > sandesh@datatorrent.com>
> > > > wrote:
> > > >
> > > > > I have created a jira, for adding the list of blacklisted nodes,
> > > > > https://issues.apache.org/jira/browse/APEXCORE-584
> > > > >
> > > > > On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <
> > sanjay@datatorrent.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Yes, Ram explained to me that in practice this would be a useful
> > > > feature
> > > > > > for Apex devops who typically have no control over Hadoop/Yarn
> > > cluster.
> > > > > >
> > > > > > On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com>
> > wrote:
> > > > > >
> > > > > >     This is a practical scenario where developers would be
> required
> > > to
> > > > > > exclude
> > > > > >     certain nodes as they might be required for some mission
> > critical
> > > > > >     applications. It would be good to have this feature.
> > > > > >
> > > > > >     I understand that Stram should not get into resourcing and
> > still
> > > > rely
> > > > > > on
> > > > > >     Yarn, however, as the App Master it should have the right to
> > > reject
> > > > > the
> > > > > >     nodes offered by Yarn and request for other resources.
> > > > > >
> > > > > >     Regards,
> > > > > >     Mohit
> > > > > >
> > > > > >     On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <
> > > > > sandesh@datatorrent.com
> > > > > > >
> > > > > >     wrote:
> > > > > >
> > > > > >     > Apex has automatic blacklisting of the troublesome nodes,
> > > please
> > > > > > take a
> > > > > >     > look at the following attributes,
> > > > > >     >
> > > > > >     > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> > > > > >     > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> > > > > >     > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> > > > > >     > FAILURES_FOR_BLACKLIST
> > > > > >     >
> > > > > >     > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
> > > > > >     >
> > > > > >     > Thanks
> > > > > >     >
> > > > > >     >
> > > > > >     >
> > > > > >     > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <
> > > > > > ram@datatorrent.com>
> > > > > >     > wrote:
> > > > > >     >
> > > > > >     > Not sure if this is what Milind had in mind but we often
> run
> > > into
> > > > > >     > situations where the dev group
> > > > > >     > working with Apex has no control over cluster configuration
> > --
> > > to
> > > > > > make any
> > > > > >     > changes to the cluster they need to
> > > > > >     > go through an elaborate process that can take many days.
> > > > > >     >
> > > > > >     > Meanwhile, if they notice that a particular node is
> > > consistently
> > > > > > causing
> > > > > >     > problems for their
> > > > > >     > app, having a simple way to exclude it would be very
> helpful
> > > > since
> > > > > > it gives
> > > > > >     > them a way
> > > > > >     > to bypass communication and process issues within their own
> > > > > > organization.
> > > > > >     >
> > > > > >     > Ram
> > > > > >     >
> > > > > >     > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <
> > > > > > sanjay@datatorrent.com>
> > > > > >     > wrote:
> > > > > >     >
> > > > > >     > > To me both use cases appear to be generic resource
> > management
> > > > use
> > > > > > cases.
> > > > > >     > > For example, a randomly rebooting node is not good for
> any
> > > > > purpose
> > > > > > esp.
> > > > > >     > > long running apps so it is a bit of a stretch to imagine
> > that
> > > > > > these nodes
> > > > > >     > > will be acceptable for some batch jobs in Yarn. So such a
> > > node
> > > > > > should be
> > > > > >     > > marked “Bad” or Unavailable in Yarn itself.
> > > > > >     > >
> > > > > >     > > Second use case is also typical anti-affinity use case
> > which
> > > > > > ideally
> > > > > >     > > should be implemented in Yarn – Milind’s example can also
> > > apply
> > > > > to
> > > > > >     > non-Apex
> > > > > >     > > batch jobs. In any case it looks like Yarn still doesn’t
> > have
> > > > it
> > > > > (
> > > > > >     > > https://issues.apache.org/jira/browse/YARN-1042) so if
> > Apex
> > > > > needs
> > > > > > it we
> > > > > >     > > will need to do it ourselves.
> > > > > >     > >
> > > > > >     > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <
> > > > ram@datatorrent.com>
> > > > > > wrote:
> > > > > >     > >
> > > > > >     > >     But then, what's the solution to the 2 problem
> > scenarios
> > > > that
> > > > > > Milind
> > > > > >     > >     describes ?
> > > > > >     > >
> > > > > >     > >     Ram
> > > > > >     > >
> > > > > >     > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> > > > > >     > > sanjay@datatorrent.com>
> > > > > >     > >     wrote:
> > > > > >     > >
> > > > > >     > >     > I think “exclude nodes” and such is really the job
> of
> > > the
> > > > > > resource
> > > > > >     > > manager
> > > > > >     > >     > i.e. Yarn. So I am not sure taking over some of
> these
> > > > tasks
> > > > > > in Apex
> > > > > >     > > would
> > > > > >     > >     > be very useful.
> > > > > >     > >     >
> > > > > >     > >     > I agree with Amol that apps should be node neutral.
> > > > > Resource
> > > > > >     > > management in
> > > > > >     > >     > Yarn together with fault tolerance in Apex should
> > > > minimize
> > > > > > the need
> > > > > >     > > for
> > > > > >     > >     > this feature although I am sure one can find use
> > cases.
> > > > > >     > >     >
> > > > > >     > >     >
> > > > > >     > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <
> > > > amol@datatorrent.com>
> > > > > > wrote:
> > > > > >     > >     >
> > > > > >     > >     >     We do have this feature in Yarn, but that
> applies
> > > to
> > > > > all
> > > > > >     > > applications.
> > > > > >     > >     > I am
> > > > > >     > >     >     not sure if Yarn has anti-affinity. This
> feature
> > > may
> > > > be
> > > > > > used,
> > > > > >     > > but in
> > > > > >     > >     >     general there is danger is an application
> taking
> > > over
> > > > > > resource
> > > > > >     > >     > allocation.
> > > > > >     > >     >     Another quirk is that big data apps should
> > ideally
> > > be
> > > > > >     > > node-neutral.
> > > > > >     > >     > This is
> > > > > >     > >     >     a good idea, if we are able to carve out
> > something
> > > > > where
> > > > > > need
> > > > > >     > is
> > > > > >     > > app
> > > > > >     > >     >     specific.
> > > > > >     > >     >
> > > > > >     > >     >     Thks
> > > > > >     > >     >     Amol
> > > > > >     > >     >
> > > > > >     > >     >
> > > > > >     > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve
> <
> > > > > >     > > milindb@gmail.com>
> > > > > >     > >     > wrote:
> > > > > >     > >     >
> > > > > >     > >     >     > We have seen 2 cases mentioned below, where,
> it
> > > > would
> > > > > > have
> > > > > >     > > been nice
> > > > > >     > >     > if
> > > > > >     > >     >     > Apex allowed us to exclude a node from the
> > > cluster
> > > > > for
> > > > > > an
> > > > > >     > >     > application.
> > > > > >     > >     >     >
> > > > > >     > >     >     > 1. A node in the cluster had gone bad (was
> > > randomly
> > > > > >     > rebooting)
> > > > > >     > > and
> > > > > >     > >     > so an
> > > > > >     > >     >     > Apex app should not use it - other apps can
> use
> > > it
> > > > as
> > > > > > they
> > > > > >     > were
> > > > > >     > >     > batch jobs.
> > > > > >     > >     >     > 2. A node is being used for a mission
> critical
> > > app
> > > > > > (Could be
> > > > > >     > > an Apex
> > > > > >     > >     > app
> > > > > >     > >     >     > itself), but another Apex app which is
> mission
> > > > > critical
> > > > > >     > should
> > > > > >     > > not
> > > > > >     > >     > be using
> > > > > >     > >     >     > resources on that node.
> > > > > >     > >     >     >
> > > > > >     > >     >     > Can we have a way in which, Stram and YARN
> can
> > > > > > coordinate
> > > > > >     > > between
> > > > > >     > >     > each
> > > > > >     > >     >     > other to not use a set of nodes for the
> > > > application.
> > > > > > It an be
> > > > > >     > > done
> > > > > >     > >     > in 2 way
> > > > > >     > >     >     > s-
> > > > > >     > >     >     >
> > > > > >     > >     >     > 1. Have a list of "exclude" nodes with Stram-
> > > when
> > > > > YARN
> > > > > >     > > allcates
> > > > > >     > >     > resources
> > > > > >     > >     >     > on either of these, STRAM rejects and gets
> > > > resources
> > > > > >     > allocated
> > > > > >     > > again
> > > > > >     > >     > frm
> > > > > >     > >     >     > YARN
> > > > > >     > >     >     > 2. Have a list of nodes that can be used for
> an
> > > > app -
> > > > > > This
> > > > > >     > can
> > > > > >     > > be a
> > > > > >     > >     > part of
> > > > > >     > >     >     > config. Hwever, I don't think this would be a
> > > right
> > > > > > way to do
> > > > > >     > > so as
> > > > > >     > >     > we will
> > > > > >     > >     >     > need support from YARN as well. Further, this
> > > might
> > > > > be
> > > > > >     > > difficult to
> > > > > >     > >     > change
> > > > > >     > >     >     > at runtim if need be.
> > > > > >     > >     >     >
> > > > > >     > >     >     > Any thoughts?
> > > > > >     > >     >     >
> > > > > >     > >     >     >
> > > > > >     > >     >     > --
> > > > > >     > >     >     > ~Milind bee at gee mail dot com
> > > > > >     > >     >     >
> > > > > >     > >     >
> > > > > >     > >     >
> > > > > >     > >     >
> > > > > >     > >     >
> > > > > >     > >
> > > > > >     > >
> > > > > >     > >
> > > > > >     > >
> > > > > >     >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Amol Kekre <am...@datatorrent.com>.
I agree, this should be on top of affinity work

Thks
Amol

On Thu, Dec 1, 2016 at 1:01 PM, Pramod Immaneni <pr...@datatorrent.com>
wrote:

> I see a host locality available as an attribute in DAG for individual
> operators. If affinity doesn't support this today, we could probably add
> it. You could also make setting a blacklist directly a convenience function
> on top of affinity.
>
> On Thu, Dec 1, 2016 at 11:58 AM, Sandesh Hegde <sa...@datatorrent.com>
> wrote:
>
> > Pramod,
> >
> > How to specify,  "don't deploy any operators on Node20" using
> > anti-affinity?
> >
> > I don't see any examples here,
> > http://apex.apache.org/docs/apex/application_development/#affinity-rules
> >
> >
> > On Thu, Dec 1, 2016 at 11:31 AM Pramod Immaneni <pr...@datatorrent.com>
> > wrote:
> >
> > > Shouldn't this be already covered by anti-affinity. Today users can
> > specify
> > > multiple affinity rules, for each rule they can specify positive or
> > > negative affinity, locality and operator selection. If an affinity rule
> > > specifying negative affinity, node locality and all operators, does not
> > > work then let's fix that scenario instead of creating a new option.
> > >
> > > On Thu, Dec 1, 2016 at 11:17 AM, Sandesh Hegde <
> sandesh@datatorrent.com>
> > > wrote:
> > >
> > > > I have created a jira, for adding the list of blacklisted nodes,
> > > > https://issues.apache.org/jira/browse/APEXCORE-584
> > > >
> > > > On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <
> sanjay@datatorrent.com
> > >
> > > > wrote:
> > > >
> > > > > Yes, Ram explained to me that in practice this would be a useful
> > > feature
> > > > > for Apex devops who typically have no control over Hadoop/Yarn
> > cluster.
> > > > >
> > > > > On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com>
> wrote:
> > > > >
> > > > >     This is a practical scenario where developers would be required
> > to
> > > > > exclude
> > > > >     certain nodes as they might be required for some mission
> critical
> > > > >     applications. It would be good to have this feature.
> > > > >
> > > > >     I understand that Stram should not get into resourcing and
> still
> > > rely
> > > > > on
> > > > >     Yarn, however, as the App Master it should have the right to
> > reject
> > > > the
> > > > >     nodes offered by Yarn and request for other resources.
> > > > >
> > > > >     Regards,
> > > > >     Mohit
> > > > >
> > > > >     On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <
> > > > sandesh@datatorrent.com
> > > > > >
> > > > >     wrote:
> > > > >
> > > > >     > Apex has automatic blacklisting of the troublesome nodes,
> > please
> > > > > take a
> > > > >     > look at the following attributes,
> > > > >     >
> > > > >     > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> > > > >     > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> > > > >     > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> > > > >     > FAILURES_FOR_BLACKLIST
> > > > >     >
> > > > >     > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
> > > > >     >
> > > > >     > Thanks
> > > > >     >
> > > > >     >
> > > > >     >
> > > > >     > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <
> > > > > ram@datatorrent.com>
> > > > >     > wrote:
> > > > >     >
> > > > >     > Not sure if this is what Milind had in mind but we often run
> > into
> > > > >     > situations where the dev group
> > > > >     > working with Apex has no control over cluster configuration
> --
> > to
> > > > > make any
> > > > >     > changes to the cluster they need to
> > > > >     > go through an elaborate process that can take many days.
> > > > >     >
> > > > >     > Meanwhile, if they notice that a particular node is
> > consistently
> > > > > causing
> > > > >     > problems for their
> > > > >     > app, having a simple way to exclude it would be very helpful
> > > since
> > > > > it gives
> > > > >     > them a way
> > > > >     > to bypass communication and process issues within their own
> > > > > organization.
> > > > >     >
> > > > >     > Ram
> > > > >     >
> > > > >     > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <
> > > > > sanjay@datatorrent.com>
> > > > >     > wrote:
> > > > >     >
> > > > >     > > To me both use cases appear to be generic resource
> management
> > > use
> > > > > cases.
> > > > >     > > For example, a randomly rebooting node is not good for any
> > > > purpose
> > > > > esp.
> > > > >     > > long running apps so it is a bit of a stretch to imagine
> that
> > > > > these nodes
> > > > >     > > will be acceptable for some batch jobs in Yarn. So such a
> > node
> > > > > should be
> > > > >     > > marked “Bad” or Unavailable in Yarn itself.
> > > > >     > >
> > > > >     > > Second use case is also typical anti-affinity use case
> which
> > > > > ideally
> > > > >     > > should be implemented in Yarn – Milind’s example can also
> > apply
> > > > to
> > > > >     > non-Apex
> > > > >     > > batch jobs. In any case it looks like Yarn still doesn’t
> have
> > > it
> > > > (
> > > > >     > > https://issues.apache.org/jira/browse/YARN-1042) so if
> Apex
> > > > needs
> > > > > it we
> > > > >     > > will need to do it ourselves.
> > > > >     > >
> > > > >     > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <
> > > ram@datatorrent.com>
> > > > > wrote:
> > > > >     > >
> > > > >     > >     But then, what's the solution to the 2 problem
> scenarios
> > > that
> > > > > Milind
> > > > >     > >     describes ?
> > > > >     > >
> > > > >     > >     Ram
> > > > >     > >
> > > > >     > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> > > > >     > > sanjay@datatorrent.com>
> > > > >     > >     wrote:
> > > > >     > >
> > > > >     > >     > I think “exclude nodes” and such is really the job of
> > the
> > > > > resource
> > > > >     > > manager
> > > > >     > >     > i.e. Yarn. So I am not sure taking over some of these
> > > tasks
> > > > > in Apex
> > > > >     > > would
> > > > >     > >     > be very useful.
> > > > >     > >     >
> > > > >     > >     > I agree with Amol that apps should be node neutral.
> > > > Resource
> > > > >     > > management in
> > > > >     > >     > Yarn together with fault tolerance in Apex should
> > > minimize
> > > > > the need
> > > > >     > > for
> > > > >     > >     > this feature although I am sure one can find use
> cases.
> > > > >     > >     >
> > > > >     > >     >
> > > > >     > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <
> > > amol@datatorrent.com>
> > > > > wrote:
> > > > >     > >     >
> > > > >     > >     >     We do have this feature in Yarn, but that applies
> > to
> > > > all
> > > > >     > > applications.
> > > > >     > >     > I am
> > > > >     > >     >     not sure if Yarn has anti-affinity. This feature
> > may
> > > be
> > > > > used,
> > > > >     > > but in
> > > > >     > >     >     general there is danger is an application taking
> > over
> > > > > resource
> > > > >     > >     > allocation.
> > > > >     > >     >     Another quirk is that big data apps should
> ideally
> > be
> > > > >     > > node-neutral.
> > > > >     > >     > This is
> > > > >     > >     >     a good idea, if we are able to carve out
> something
> > > > where
> > > > > need
> > > > >     > is
> > > > >     > > app
> > > > >     > >     >     specific.
> > > > >     > >     >
> > > > >     > >     >     Thks
> > > > >     > >     >     Amol
> > > > >     > >     >
> > > > >     > >     >
> > > > >     > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
> > > > >     > > milindb@gmail.com>
> > > > >     > >     > wrote:
> > > > >     > >     >
> > > > >     > >     >     > We have seen 2 cases mentioned below, where, it
> > > would
> > > > > have
> > > > >     > > been nice
> > > > >     > >     > if
> > > > >     > >     >     > Apex allowed us to exclude a node from the
> > cluster
> > > > for
> > > > > an
> > > > >     > >     > application.
> > > > >     > >     >     >
> > > > >     > >     >     > 1. A node in the cluster had gone bad (was
> > randomly
> > > > >     > rebooting)
> > > > >     > > and
> > > > >     > >     > so an
> > > > >     > >     >     > Apex app should not use it - other apps can use
> > it
> > > as
> > > > > they
> > > > >     > were
> > > > >     > >     > batch jobs.
> > > > >     > >     >     > 2. A node is being used for a mission critical
> > app
> > > > > (Could be
> > > > >     > > an Apex
> > > > >     > >     > app
> > > > >     > >     >     > itself), but another Apex app which is mission
> > > > critical
> > > > >     > should
> > > > >     > > not
> > > > >     > >     > be using
> > > > >     > >     >     > resources on that node.
> > > > >     > >     >     >
> > > > >     > >     >     > Can we have a way in which, Stram and YARN can
> > > > > coordinate
> > > > >     > > between
> > > > >     > >     > each
> > > > >     > >     >     > other to not use a set of nodes for the
> > > application.
> > > > > It an be
> > > > >     > > done
> > > > >     > >     > in 2 way
> > > > >     > >     >     > s-
> > > > >     > >     >     >
> > > > >     > >     >     > 1. Have a list of "exclude" nodes with Stram-
> > when
> > > > YARN
> > > > >     > > allcates
> > > > >     > >     > resources
> > > > >     > >     >     > on either of these, STRAM rejects and gets
> > > resources
> > > > >     > allocated
> > > > >     > > again
> > > > >     > >     > frm
> > > > >     > >     >     > YARN
> > > > >     > >     >     > 2. Have a list of nodes that can be used for an
> > > app -
> > > > > This
> > > > >     > can
> > > > >     > > be a
> > > > >     > >     > part of
> > > > >     > >     >     > config. Hwever, I don't think this would be a
> > right
> > > > > way to do
> > > > >     > > so as
> > > > >     > >     > we will
> > > > >     > >     >     > need support from YARN as well. Further, this
> > might
> > > > be
> > > > >     > > difficult to
> > > > >     > >     > change
> > > > >     > >     >     > at runtim if need be.
> > > > >     > >     >     >
> > > > >     > >     >     > Any thoughts?
> > > > >     > >     >     >
> > > > >     > >     >     >
> > > > >     > >     >     > --
> > > > >     > >     >     > ~Milind bee at gee mail dot com
> > > > >     > >     >     >
> > > > >     > >     >
> > > > >     > >     >
> > > > >     > >     >
> > > > >     > >     >
> > > > >     > >
> > > > >     > >
> > > > >     > >
> > > > >     > >
> > > > >     >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Pramod Immaneni <pr...@datatorrent.com>.
I see a host locality available as an attribute in DAG for individual
operators. If affinity doesn't support this today, we could probably add
it. You could also make setting a blacklist directly a convenience function
on top of affinity.

On Thu, Dec 1, 2016 at 11:58 AM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> Pramod,
>
> How to specify,  "don't deploy any operators on Node20" using
> anti-affinity?
>
> I don't see any examples here,
> http://apex.apache.org/docs/apex/application_development/#affinity-rules
>
>
> On Thu, Dec 1, 2016 at 11:31 AM Pramod Immaneni <pr...@datatorrent.com>
> wrote:
>
> > Shouldn't this be already covered by anti-affinity. Today users can
> specify
> > multiple affinity rules, for each rule they can specify positive or
> > negative affinity, locality and operator selection. If an affinity rule
> > specifying negative affinity, node locality and all operators, does not
> > work then let's fix that scenario instead of creating a new option.
> >
> > On Thu, Dec 1, 2016 at 11:17 AM, Sandesh Hegde <sa...@datatorrent.com>
> > wrote:
> >
> > > I have created a jira, for adding the list of blacklisted nodes,
> > > https://issues.apache.org/jira/browse/APEXCORE-584
> > >
> > > On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <sanjay@datatorrent.com
> >
> > > wrote:
> > >
> > > > Yes, Ram explained to me that in practice this would be a useful
> > feature
> > > > for Apex devops who typically have no control over Hadoop/Yarn
> cluster.
> > > >
> > > > On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote:
> > > >
> > > >     This is a practical scenario where developers would be required
> to
> > > > exclude
> > > >     certain nodes as they might be required for some mission critical
> > > >     applications. It would be good to have this feature.
> > > >
> > > >     I understand that Stram should not get into resourcing and still
> > rely
> > > > on
> > > >     Yarn, however, as the App Master it should have the right to
> reject
> > > the
> > > >     nodes offered by Yarn and request for other resources.
> > > >
> > > >     Regards,
> > > >     Mohit
> > > >
> > > >     On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <
> > > sandesh@datatorrent.com
> > > > >
> > > >     wrote:
> > > >
> > > >     > Apex has automatic blacklisting of the troublesome nodes,
> please
> > > > take a
> > > >     > look at the following attributes,
> > > >     >
> > > >     > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> > > >     > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> > > >     > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> > > >     > FAILURES_FOR_BLACKLIST
> > > >     >
> > > >     > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
> > > >     >
> > > >     > Thanks
> > > >     >
> > > >     >
> > > >     >
> > > >     > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <
> > > > ram@datatorrent.com>
> > > >     > wrote:
> > > >     >
> > > >     > Not sure if this is what Milind had in mind but we often run
> into
> > > >     > situations where the dev group
> > > >     > working with Apex has no control over cluster configuration --
> to
> > > > make any
> > > >     > changes to the cluster they need to
> > > >     > go through an elaborate process that can take many days.
> > > >     >
> > > >     > Meanwhile, if they notice that a particular node is
> consistently
> > > > causing
> > > >     > problems for their
> > > >     > app, having a simple way to exclude it would be very helpful
> > since
> > > > it gives
> > > >     > them a way
> > > >     > to bypass communication and process issues within their own
> > > > organization.
> > > >     >
> > > >     > Ram
> > > >     >
> > > >     > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <
> > > > sanjay@datatorrent.com>
> > > >     > wrote:
> > > >     >
> > > >     > > To me both use cases appear to be generic resource management
> > use
> > > > cases.
> > > >     > > For example, a randomly rebooting node is not good for any
> > > purpose
> > > > esp.
> > > >     > > long running apps so it is a bit of a stretch to imagine that
> > > > these nodes
> > > >     > > will be acceptable for some batch jobs in Yarn. So such a
> node
> > > > should be
> > > >     > > marked “Bad” or Unavailable in Yarn itself.
> > > >     > >
> > > >     > > Second use case is also typical anti-affinity use case which
> > > > ideally
> > > >     > > should be implemented in Yarn – Milind’s example can also
> apply
> > > to
> > > >     > non-Apex
> > > >     > > batch jobs. In any case it looks like Yarn still doesn’t have
> > it
> > > (
> > > >     > > https://issues.apache.org/jira/browse/YARN-1042) so if Apex
> > > needs
> > > > it we
> > > >     > > will need to do it ourselves.
> > > >     > >
> > > >     > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <
> > ram@datatorrent.com>
> > > > wrote:
> > > >     > >
> > > >     > >     But then, what's the solution to the 2 problem scenarios
> > that
> > > > Milind
> > > >     > >     describes ?
> > > >     > >
> > > >     > >     Ram
> > > >     > >
> > > >     > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> > > >     > > sanjay@datatorrent.com>
> > > >     > >     wrote:
> > > >     > >
> > > >     > >     > I think “exclude nodes” and such is really the job of
> the
> > > > resource
> > > >     > > manager
> > > >     > >     > i.e. Yarn. So I am not sure taking over some of these
> > tasks
> > > > in Apex
> > > >     > > would
> > > >     > >     > be very useful.
> > > >     > >     >
> > > >     > >     > I agree with Amol that apps should be node neutral.
> > > Resource
> > > >     > > management in
> > > >     > >     > Yarn together with fault tolerance in Apex should
> > minimize
> > > > the need
> > > >     > > for
> > > >     > >     > this feature although I am sure one can find use cases.
> > > >     > >     >
> > > >     > >     >
> > > >     > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <
> > amol@datatorrent.com>
> > > > wrote:
> > > >     > >     >
> > > >     > >     >     We do have this feature in Yarn, but that applies
> to
> > > all
> > > >     > > applications.
> > > >     > >     > I am
> > > >     > >     >     not sure if Yarn has anti-affinity. This feature
> may
> > be
> > > > used,
> > > >     > > but in
> > > >     > >     >     general there is danger is an application taking
> over
> > > > resource
> > > >     > >     > allocation.
> > > >     > >     >     Another quirk is that big data apps should ideally
> be
> > > >     > > node-neutral.
> > > >     > >     > This is
> > > >     > >     >     a good idea, if we are able to carve out something
> > > where
> > > > need
> > > >     > is
> > > >     > > app
> > > >     > >     >     specific.
> > > >     > >     >
> > > >     > >     >     Thks
> > > >     > >     >     Amol
> > > >     > >     >
> > > >     > >     >
> > > >     > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
> > > >     > > milindb@gmail.com>
> > > >     > >     > wrote:
> > > >     > >     >
> > > >     > >     >     > We have seen 2 cases mentioned below, where, it
> > would
> > > > have
> > > >     > > been nice
> > > >     > >     > if
> > > >     > >     >     > Apex allowed us to exclude a node from the
> cluster
> > > for
> > > > an
> > > >     > >     > application.
> > > >     > >     >     >
> > > >     > >     >     > 1. A node in the cluster had gone bad (was
> randomly
> > > >     > rebooting)
> > > >     > > and
> > > >     > >     > so an
> > > >     > >     >     > Apex app should not use it - other apps can use
> it
> > as
> > > > they
> > > >     > were
> > > >     > >     > batch jobs.
> > > >     > >     >     > 2. A node is being used for a mission critical
> app
> > > > (Could be
> > > >     > > an Apex
> > > >     > >     > app
> > > >     > >     >     > itself), but another Apex app which is mission
> > > critical
> > > >     > should
> > > >     > > not
> > > >     > >     > be using
> > > >     > >     >     > resources on that node.
> > > >     > >     >     >
> > > >     > >     >     > Can we have a way in which, Stram and YARN can
> > > > coordinate
> > > >     > > between
> > > >     > >     > each
> > > >     > >     >     > other to not use a set of nodes for the
> > application.
> > > > It an be
> > > >     > > done
> > > >     > >     > in 2 way
> > > >     > >     >     > s-
> > > >     > >     >     >
> > > >     > >     >     > 1. Have a list of "exclude" nodes with Stram-
> when
> > > YARN
> > > >     > > allcates
> > > >     > >     > resources
> > > >     > >     >     > on either of these, STRAM rejects and gets
> > resources
> > > >     > allocated
> > > >     > > again
> > > >     > >     > frm
> > > >     > >     >     > YARN
> > > >     > >     >     > 2. Have a list of nodes that can be used for an
> > app -
> > > > This
> > > >     > can
> > > >     > > be a
> > > >     > >     > part of
> > > >     > >     >     > config. Hwever, I don't think this would be a
> right
> > > > way to do
> > > >     > > so as
> > > >     > >     > we will
> > > >     > >     >     > need support from YARN as well. Further, this
> might
> > > be
> > > >     > > difficult to
> > > >     > >     > change
> > > >     > >     >     > at runtim if need be.
> > > >     > >     >     >
> > > >     > >     >     > Any thoughts?
> > > >     > >     >     >
> > > >     > >     >     >
> > > >     > >     >     > --
> > > >     > >     >     > ~Milind bee at gee mail dot com
> > > >     > >     >     >
> > > >     > >     >
> > > >     > >     >
> > > >     > >     >
> > > >     > >     >
> > > >     > >
> > > >     > >
> > > >     > >
> > > >     > >
> > > >     >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Sandesh Hegde <sa...@datatorrent.com>.
Pramod,

How to specify,  "don't deploy any operators on Node20" using anti-affinity?

I don't see any examples here,
http://apex.apache.org/docs/apex/application_development/#affinity-rules


On Thu, Dec 1, 2016 at 11:31 AM Pramod Immaneni <pr...@datatorrent.com>
wrote:

> Shouldn't this be already covered by anti-affinity. Today users can specify
> multiple affinity rules, for each rule they can specify positive or
> negative affinity, locality and operator selection. If an affinity rule
> specifying negative affinity, node locality and all operators, does not
> work then let's fix that scenario instead of creating a new option.
>
> On Thu, Dec 1, 2016 at 11:17 AM, Sandesh Hegde <sa...@datatorrent.com>
> wrote:
>
> > I have created a jira, for adding the list of blacklisted nodes,
> > https://issues.apache.org/jira/browse/APEXCORE-584
> >
> > On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <sa...@datatorrent.com>
> > wrote:
> >
> > > Yes, Ram explained to me that in practice this would be a useful
> feature
> > > for Apex devops who typically have no control over Hadoop/Yarn cluster.
> > >
> > > On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote:
> > >
> > >     This is a practical scenario where developers would be required to
> > > exclude
> > >     certain nodes as they might be required for some mission critical
> > >     applications. It would be good to have this feature.
> > >
> > >     I understand that Stram should not get into resourcing and still
> rely
> > > on
> > >     Yarn, however, as the App Master it should have the right to reject
> > the
> > >     nodes offered by Yarn and request for other resources.
> > >
> > >     Regards,
> > >     Mohit
> > >
> > >     On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <
> > sandesh@datatorrent.com
> > > >
> > >     wrote:
> > >
> > >     > Apex has automatic blacklisting of the troublesome nodes, please
> > > take a
> > >     > look at the following attributes,
> > >     >
> > >     > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> > >     > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> > >     > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> > >     > FAILURES_FOR_BLACKLIST
> > >     >
> > >     > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
> > >     >
> > >     > Thanks
> > >     >
> > >     >
> > >     >
> > >     > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <
> > > ram@datatorrent.com>
> > >     > wrote:
> > >     >
> > >     > Not sure if this is what Milind had in mind but we often run into
> > >     > situations where the dev group
> > >     > working with Apex has no control over cluster configuration -- to
> > > make any
> > >     > changes to the cluster they need to
> > >     > go through an elaborate process that can take many days.
> > >     >
> > >     > Meanwhile, if they notice that a particular node is consistently
> > > causing
> > >     > problems for their
> > >     > app, having a simple way to exclude it would be very helpful
> since
> > > it gives
> > >     > them a way
> > >     > to bypass communication and process issues within their own
> > > organization.
> > >     >
> > >     > Ram
> > >     >
> > >     > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <
> > > sanjay@datatorrent.com>
> > >     > wrote:
> > >     >
> > >     > > To me both use cases appear to be generic resource management
> use
> > > cases.
> > >     > > For example, a randomly rebooting node is not good for any
> > purpose
> > > esp.
> > >     > > long running apps so it is a bit of a stretch to imagine that
> > > these nodes
> > >     > > will be acceptable for some batch jobs in Yarn. So such a node
> > > should be
> > >     > > marked “Bad” or Unavailable in Yarn itself.
> > >     > >
> > >     > > Second use case is also typical anti-affinity use case which
> > > ideally
> > >     > > should be implemented in Yarn – Milind’s example can also apply
> > to
> > >     > non-Apex
> > >     > > batch jobs. In any case it looks like Yarn still doesn’t have
> it
> > (
> > >     > > https://issues.apache.org/jira/browse/YARN-1042) so if Apex
> > needs
> > > it we
> > >     > > will need to do it ourselves.
> > >     > >
> > >     > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <
> ram@datatorrent.com>
> > > wrote:
> > >     > >
> > >     > >     But then, what's the solution to the 2 problem scenarios
> that
> > > Milind
> > >     > >     describes ?
> > >     > >
> > >     > >     Ram
> > >     > >
> > >     > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> > >     > > sanjay@datatorrent.com>
> > >     > >     wrote:
> > >     > >
> > >     > >     > I think “exclude nodes” and such is really the job of the
> > > resource
> > >     > > manager
> > >     > >     > i.e. Yarn. So I am not sure taking over some of these
> tasks
> > > in Apex
> > >     > > would
> > >     > >     > be very useful.
> > >     > >     >
> > >     > >     > I agree with Amol that apps should be node neutral.
> > Resource
> > >     > > management in
> > >     > >     > Yarn together with fault tolerance in Apex should
> minimize
> > > the need
> > >     > > for
> > >     > >     > this feature although I am sure one can find use cases.
> > >     > >     >
> > >     > >     >
> > >     > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <
> amol@datatorrent.com>
> > > wrote:
> > >     > >     >
> > >     > >     >     We do have this feature in Yarn, but that applies to
> > all
> > >     > > applications.
> > >     > >     > I am
> > >     > >     >     not sure if Yarn has anti-affinity. This feature may
> be
> > > used,
> > >     > > but in
> > >     > >     >     general there is danger is an application taking over
> > > resource
> > >     > >     > allocation.
> > >     > >     >     Another quirk is that big data apps should ideally be
> > >     > > node-neutral.
> > >     > >     > This is
> > >     > >     >     a good idea, if we are able to carve out something
> > where
> > > need
> > >     > is
> > >     > > app
> > >     > >     >     specific.
> > >     > >     >
> > >     > >     >     Thks
> > >     > >     >     Amol
> > >     > >     >
> > >     > >     >
> > >     > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
> > >     > > milindb@gmail.com>
> > >     > >     > wrote:
> > >     > >     >
> > >     > >     >     > We have seen 2 cases mentioned below, where, it
> would
> > > have
> > >     > > been nice
> > >     > >     > if
> > >     > >     >     > Apex allowed us to exclude a node from the cluster
> > for
> > > an
> > >     > >     > application.
> > >     > >     >     >
> > >     > >     >     > 1. A node in the cluster had gone bad (was randomly
> > >     > rebooting)
> > >     > > and
> > >     > >     > so an
> > >     > >     >     > Apex app should not use it - other apps can use it
> as
> > > they
> > >     > were
> > >     > >     > batch jobs.
> > >     > >     >     > 2. A node is being used for a mission critical app
> > > (Could be
> > >     > > an Apex
> > >     > >     > app
> > >     > >     >     > itself), but another Apex app which is mission
> > critical
> > >     > should
> > >     > > not
> > >     > >     > be using
> > >     > >     >     > resources on that node.
> > >     > >     >     >
> > >     > >     >     > Can we have a way in which, Stram and YARN can
> > > coordinate
> > >     > > between
> > >     > >     > each
> > >     > >     >     > other to not use a set of nodes for the
> application.
> > > It an be
> > >     > > done
> > >     > >     > in 2 way
> > >     > >     >     > s-
> > >     > >     >     >
> > >     > >     >     > 1. Have a list of "exclude" nodes with Stram- when
> > YARN
> > >     > > allcates
> > >     > >     > resources
> > >     > >     >     > on either of these, STRAM rejects and gets
> resources
> > >     > allocated
> > >     > > again
> > >     > >     > frm
> > >     > >     >     > YARN
> > >     > >     >     > 2. Have a list of nodes that can be used for an
> app -
> > > This
> > >     > can
> > >     > > be a
> > >     > >     > part of
> > >     > >     >     > config. Hwever, I don't think this would be a right
> > > way to do
> > >     > > so as
> > >     > >     > we will
> > >     > >     >     > need support from YARN as well. Further, this might
> > be
> > >     > > difficult to
> > >     > >     > change
> > >     > >     >     > at runtim if need be.
> > >     > >     >     >
> > >     > >     >     > Any thoughts?
> > >     > >     >     >
> > >     > >     >     >
> > >     > >     >     > --
> > >     > >     >     > ~Milind bee at gee mail dot com
> > >     > >     >     >
> > >     > >     >
> > >     > >     >
> > >     > >     >
> > >     > >     >
> > >     > >
> > >     > >
> > >     > >
> > >     > >
> > >     >
> > >
> > >
> > >
> > >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Pramod Immaneni <pr...@datatorrent.com>.
Shouldn't this be already covered by anti-affinity. Today users can specify
multiple affinity rules, for each rule they can specify positive or
negative affinity, locality and operator selection. If an affinity rule
specifying negative affinity, node locality and all operators, does not
work then let's fix that scenario instead of creating a new option.

On Thu, Dec 1, 2016 at 11:17 AM, Sandesh Hegde <sa...@datatorrent.com>
wrote:

> I have created a jira, for adding the list of blacklisted nodes,
> https://issues.apache.org/jira/browse/APEXCORE-584
>
> On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <sa...@datatorrent.com>
> wrote:
>
> > Yes, Ram explained to me that in practice this would be a useful feature
> > for Apex devops who typically have no control over Hadoop/Yarn cluster.
> >
> > On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote:
> >
> >     This is a practical scenario where developers would be required to
> > exclude
> >     certain nodes as they might be required for some mission critical
> >     applications. It would be good to have this feature.
> >
> >     I understand that Stram should not get into resourcing and still rely
> > on
> >     Yarn, however, as the App Master it should have the right to reject
> the
> >     nodes offered by Yarn and request for other resources.
> >
> >     Regards,
> >     Mohit
> >
> >     On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <
> sandesh@datatorrent.com
> > >
> >     wrote:
> >
> >     > Apex has automatic blacklisting of the troublesome nodes, please
> > take a
> >     > look at the following attributes,
> >     >
> >     > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
> >     > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
> >     > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
> >     > FAILURES_FOR_BLACKLIST
> >     >
> >     > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
> >     >
> >     > Thanks
> >     >
> >     >
> >     >
> >     > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <
> > ram@datatorrent.com>
> >     > wrote:
> >     >
> >     > Not sure if this is what Milind had in mind but we often run into
> >     > situations where the dev group
> >     > working with Apex has no control over cluster configuration -- to
> > make any
> >     > changes to the cluster they need to
> >     > go through an elaborate process that can take many days.
> >     >
> >     > Meanwhile, if they notice that a particular node is consistently
> > causing
> >     > problems for their
> >     > app, having a simple way to exclude it would be very helpful since
> > it gives
> >     > them a way
> >     > to bypass communication and process issues within their own
> > organization.
> >     >
> >     > Ram
> >     >
> >     > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <
> > sanjay@datatorrent.com>
> >     > wrote:
> >     >
> >     > > To me both use cases appear to be generic resource management use
> > cases.
> >     > > For example, a randomly rebooting node is not good for any
> purpose
> > esp.
> >     > > long running apps so it is a bit of a stretch to imagine that
> > these nodes
> >     > > will be acceptable for some batch jobs in Yarn. So such a node
> > should be
> >     > > marked “Bad” or Unavailable in Yarn itself.
> >     > >
> >     > > Second use case is also typical anti-affinity use case which
> > ideally
> >     > > should be implemented in Yarn – Milind’s example can also apply
> to
> >     > non-Apex
> >     > > batch jobs. In any case it looks like Yarn still doesn’t have it
> (
> >     > > https://issues.apache.org/jira/browse/YARN-1042) so if Apex
> needs
> > it we
> >     > > will need to do it ourselves.
> >     > >
> >     > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <ra...@datatorrent.com>
> > wrote:
> >     > >
> >     > >     But then, what's the solution to the 2 problem scenarios that
> > Milind
> >     > >     describes ?
> >     > >
> >     > >     Ram
> >     > >
> >     > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
> >     > > sanjay@datatorrent.com>
> >     > >     wrote:
> >     > >
> >     > >     > I think “exclude nodes” and such is really the job of the
> > resource
> >     > > manager
> >     > >     > i.e. Yarn. So I am not sure taking over some of these tasks
> > in Apex
> >     > > would
> >     > >     > be very useful.
> >     > >     >
> >     > >     > I agree with Amol that apps should be node neutral.
> Resource
> >     > > management in
> >     > >     > Yarn together with fault tolerance in Apex should minimize
> > the need
> >     > > for
> >     > >     > this feature although I am sure one can find use cases.
> >     > >     >
> >     > >     >
> >     > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <am...@datatorrent.com>
> > wrote:
> >     > >     >
> >     > >     >     We do have this feature in Yarn, but that applies to
> all
> >     > > applications.
> >     > >     > I am
> >     > >     >     not sure if Yarn has anti-affinity. This feature may be
> > used,
> >     > > but in
> >     > >     >     general there is danger is an application taking over
> > resource
> >     > >     > allocation.
> >     > >     >     Another quirk is that big data apps should ideally be
> >     > > node-neutral.
> >     > >     > This is
> >     > >     >     a good idea, if we are able to carve out something
> where
> > need
> >     > is
> >     > > app
> >     > >     >     specific.
> >     > >     >
> >     > >     >     Thks
> >     > >     >     Amol
> >     > >     >
> >     > >     >
> >     > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
> >     > > milindb@gmail.com>
> >     > >     > wrote:
> >     > >     >
> >     > >     >     > We have seen 2 cases mentioned below, where, it would
> > have
> >     > > been nice
> >     > >     > if
> >     > >     >     > Apex allowed us to exclude a node from the cluster
> for
> > an
> >     > >     > application.
> >     > >     >     >
> >     > >     >     > 1. A node in the cluster had gone bad (was randomly
> >     > rebooting)
> >     > > and
> >     > >     > so an
> >     > >     >     > Apex app should not use it - other apps can use it as
> > they
> >     > were
> >     > >     > batch jobs.
> >     > >     >     > 2. A node is being used for a mission critical app
> > (Could be
> >     > > an Apex
> >     > >     > app
> >     > >     >     > itself), but another Apex app which is mission
> critical
> >     > should
> >     > > not
> >     > >     > be using
> >     > >     >     > resources on that node.
> >     > >     >     >
> >     > >     >     > Can we have a way in which, Stram and YARN can
> > coordinate
> >     > > between
> >     > >     > each
> >     > >     >     > other to not use a set of nodes for the application.
> > It an be
> >     > > done
> >     > >     > in 2 way
> >     > >     >     > s-
> >     > >     >     >
> >     > >     >     > 1. Have a list of "exclude" nodes with Stram- when
> YARN
> >     > > allcates
> >     > >     > resources
> >     > >     >     > on either of these, STRAM rejects and gets resources
> >     > allocated
> >     > > again
> >     > >     > frm
> >     > >     >     > YARN
> >     > >     >     > 2. Have a list of nodes that can be used for an app -
> > This
> >     > can
> >     > > be a
> >     > >     > part of
> >     > >     >     > config. Hwever, I don't think this would be a right
> > way to do
> >     > > so as
> >     > >     > we will
> >     > >     >     > need support from YARN as well. Further, this might
> be
> >     > > difficult to
> >     > >     > change
> >     > >     >     > at runtim if need be.
> >     > >     >     >
> >     > >     >     > Any thoughts?
> >     > >     >     >
> >     > >     >     >
> >     > >     >     > --
> >     > >     >     > ~Milind bee at gee mail dot com
> >     > >     >     >
> >     > >     >
> >     > >     >
> >     > >     >
> >     > >     >
> >     > >
> >     > >
> >     > >
> >     > >
> >     >
> >
> >
> >
> >
>

Re: "ExcludeNodes" for an Apex application

Posted by Sandesh Hegde <sa...@datatorrent.com>.
I have created a jira, for adding the list of blacklisted nodes,
https://issues.apache.org/jira/browse/APEXCORE-584

On Wed, Nov 30, 2016 at 11:06 PM Sanjay Pujare <sa...@datatorrent.com>
wrote:

> Yes, Ram explained to me that in practice this would be a useful feature
> for Apex devops who typically have no control over Hadoop/Yarn cluster.
>
> On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote:
>
>     This is a practical scenario where developers would be required to
> exclude
>     certain nodes as they might be required for some mission critical
>     applications. It would be good to have this feature.
>
>     I understand that Stram should not get into resourcing and still rely
> on
>     Yarn, however, as the App Master it should have the right to reject the
>     nodes offered by Yarn and request for other resources.
>
>     Regards,
>     Mohit
>
>     On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <sandesh@datatorrent.com
> >
>     wrote:
>
>     > Apex has automatic blacklisting of the troublesome nodes, please
> take a
>     > look at the following attributes,
>     >
>     > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
>     > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
>     > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
>     > FAILURES_FOR_BLACKLIST
>     >
>     > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
>     >
>     > Thanks
>     >
>     >
>     >
>     > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <
> ram@datatorrent.com>
>     > wrote:
>     >
>     > Not sure if this is what Milind had in mind but we often run into
>     > situations where the dev group
>     > working with Apex has no control over cluster configuration -- to
> make any
>     > changes to the cluster they need to
>     > go through an elaborate process that can take many days.
>     >
>     > Meanwhile, if they notice that a particular node is consistently
> causing
>     > problems for their
>     > app, having a simple way to exclude it would be very helpful since
> it gives
>     > them a way
>     > to bypass communication and process issues within their own
> organization.
>     >
>     > Ram
>     >
>     > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <
> sanjay@datatorrent.com>
>     > wrote:
>     >
>     > > To me both use cases appear to be generic resource management use
> cases.
>     > > For example, a randomly rebooting node is not good for any purpose
> esp.
>     > > long running apps so it is a bit of a stretch to imagine that
> these nodes
>     > > will be acceptable for some batch jobs in Yarn. So such a node
> should be
>     > > marked “Bad” or Unavailable in Yarn itself.
>     > >
>     > > Second use case is also typical anti-affinity use case which
> ideally
>     > > should be implemented in Yarn – Milind’s example can also apply to
>     > non-Apex
>     > > batch jobs. In any case it looks like Yarn still doesn’t have it (
>     > > https://issues.apache.org/jira/browse/YARN-1042) so if Apex needs
> it we
>     > > will need to do it ourselves.
>     > >
>     > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <ra...@datatorrent.com>
> wrote:
>     > >
>     > >     But then, what's the solution to the 2 problem scenarios that
> Milind
>     > >     describes ?
>     > >
>     > >     Ram
>     > >
>     > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
>     > > sanjay@datatorrent.com>
>     > >     wrote:
>     > >
>     > >     > I think “exclude nodes” and such is really the job of the
> resource
>     > > manager
>     > >     > i.e. Yarn. So I am not sure taking over some of these tasks
> in Apex
>     > > would
>     > >     > be very useful.
>     > >     >
>     > >     > I agree with Amol that apps should be node neutral. Resource
>     > > management in
>     > >     > Yarn together with fault tolerance in Apex should minimize
> the need
>     > > for
>     > >     > this feature although I am sure one can find use cases.
>     > >     >
>     > >     >
>     > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <am...@datatorrent.com>
> wrote:
>     > >     >
>     > >     >     We do have this feature in Yarn, but that applies to all
>     > > applications.
>     > >     > I am
>     > >     >     not sure if Yarn has anti-affinity. This feature may be
> used,
>     > > but in
>     > >     >     general there is danger is an application taking over
> resource
>     > >     > allocation.
>     > >     >     Another quirk is that big data apps should ideally be
>     > > node-neutral.
>     > >     > This is
>     > >     >     a good idea, if we are able to carve out something where
> need
>     > is
>     > > app
>     > >     >     specific.
>     > >     >
>     > >     >     Thks
>     > >     >     Amol
>     > >     >
>     > >     >
>     > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
>     > > milindb@gmail.com>
>     > >     > wrote:
>     > >     >
>     > >     >     > We have seen 2 cases mentioned below, where, it would
> have
>     > > been nice
>     > >     > if
>     > >     >     > Apex allowed us to exclude a node from the cluster for
> an
>     > >     > application.
>     > >     >     >
>     > >     >     > 1. A node in the cluster had gone bad (was randomly
>     > rebooting)
>     > > and
>     > >     > so an
>     > >     >     > Apex app should not use it - other apps can use it as
> they
>     > were
>     > >     > batch jobs.
>     > >     >     > 2. A node is being used for a mission critical app
> (Could be
>     > > an Apex
>     > >     > app
>     > >     >     > itself), but another Apex app which is mission critical
>     > should
>     > > not
>     > >     > be using
>     > >     >     > resources on that node.
>     > >     >     >
>     > >     >     > Can we have a way in which, Stram and YARN can
> coordinate
>     > > between
>     > >     > each
>     > >     >     > other to not use a set of nodes for the application.
> It an be
>     > > done
>     > >     > in 2 way
>     > >     >     > s-
>     > >     >     >
>     > >     >     > 1. Have a list of "exclude" nodes with Stram- when YARN
>     > > allcates
>     > >     > resources
>     > >     >     > on either of these, STRAM rejects and gets resources
>     > allocated
>     > > again
>     > >     > frm
>     > >     >     > YARN
>     > >     >     > 2. Have a list of nodes that can be used for an app -
> This
>     > can
>     > > be a
>     > >     > part of
>     > >     >     > config. Hwever, I don't think this would be a right
> way to do
>     > > so as
>     > >     > we will
>     > >     >     > need support from YARN as well. Further, this might be
>     > > difficult to
>     > >     > change
>     > >     >     > at runtim if need be.
>     > >     >     >
>     > >     >     > Any thoughts?
>     > >     >     >
>     > >     >     >
>     > >     >     > --
>     > >     >     > ~Milind bee at gee mail dot com
>     > >     >     >
>     > >     >
>     > >     >
>     > >     >
>     > >     >
>     > >
>     > >
>     > >
>     > >
>     >
>
>
>
>

Re: "ExcludeNodes" for an Apex application

Posted by Sanjay Pujare <sa...@datatorrent.com>.
Yes, Ram explained to me that in practice this would be a useful feature for Apex devops who typically have no control over Hadoop/Yarn cluster.

On 11/30/16, 9:22 PM, "Mohit Jotwani" <mo...@datatorrent.com> wrote:

    This is a practical scenario where developers would be required to exclude
    certain nodes as they might be required for some mission critical
    applications. It would be good to have this feature.
    
    I understand that Stram should not get into resourcing and still rely on
    Yarn, however, as the App Master it should have the right to reject the
    nodes offered by Yarn and request for other resources.
    
    Regards,
    Mohit
    
    On Thu, Dec 1, 2016 at 2:34 AM, Sandesh Hegde <sa...@datatorrent.com>
    wrote:
    
    > Apex has automatic blacklisting of the troublesome nodes, please take a
    > look at the following attributes,
    >
    > MAX_CONSECUTIVE_CONTAINER_FAILURES_FOR_BLACKLIST
    > https://www.datatorrent.com/docs/apidocs/com/datatorrent/
    > api/Context.DAGContext.html#MAX_CONSECUTIVE_CONTAINER_
    > FAILURES_FOR_BLACKLIST
    >
    > BLACKLISTED_NODE_REMOVAL_TIME_MILLIS
    >
    > Thanks
    >
    >
    >
    > On Wed, Nov 30, 2016 at 12:56 PM Munagala Ramanath <ra...@datatorrent.com>
    > wrote:
    >
    > Not sure if this is what Milind had in mind but we often run into
    > situations where the dev group
    > working with Apex has no control over cluster configuration -- to make any
    > changes to the cluster they need to
    > go through an elaborate process that can take many days.
    >
    > Meanwhile, if they notice that a particular node is consistently causing
    > problems for their
    > app, having a simple way to exclude it would be very helpful since it gives
    > them a way
    > to bypass communication and process issues within their own organization.
    >
    > Ram
    >
    > On Wed, Nov 30, 2016 at 10:58 AM, Sanjay Pujare <sa...@datatorrent.com>
    > wrote:
    >
    > > To me both use cases appear to be generic resource management use cases.
    > > For example, a randomly rebooting node is not good for any purpose esp.
    > > long running apps so it is a bit of a stretch to imagine that these nodes
    > > will be acceptable for some batch jobs in Yarn. So such a node should be
    > > marked “Bad” or Unavailable in Yarn itself.
    > >
    > > Second use case is also typical anti-affinity use case which ideally
    > > should be implemented in Yarn – Milind’s example can also apply to
    > non-Apex
    > > batch jobs. In any case it looks like Yarn still doesn’t have it (
    > > https://issues.apache.org/jira/browse/YARN-1042) so if Apex needs it we
    > > will need to do it ourselves.
    > >
    > > On 11/30/16, 10:39 AM, "Munagala Ramanath" <ra...@datatorrent.com> wrote:
    > >
    > >     But then, what's the solution to the 2 problem scenarios that Milind
    > >     describes ?
    > >
    > >     Ram
    > >
    > >     On Wed, Nov 30, 2016 at 10:34 AM, Sanjay Pujare <
    > > sanjay@datatorrent.com>
    > >     wrote:
    > >
    > >     > I think “exclude nodes” and such is really the job of the resource
    > > manager
    > >     > i.e. Yarn. So I am not sure taking over some of these tasks in Apex
    > > would
    > >     > be very useful.
    > >     >
    > >     > I agree with Amol that apps should be node neutral. Resource
    > > management in
    > >     > Yarn together with fault tolerance in Apex should minimize the need
    > > for
    > >     > this feature although I am sure one can find use cases.
    > >     >
    > >     >
    > >     > On 11/29/16, 10:41 PM, "Amol Kekre" <am...@datatorrent.com> wrote:
    > >     >
    > >     >     We do have this feature in Yarn, but that applies to all
    > > applications.
    > >     > I am
    > >     >     not sure if Yarn has anti-affinity. This feature may be used,
    > > but in
    > >     >     general there is danger is an application taking over resource
    > >     > allocation.
    > >     >     Another quirk is that big data apps should ideally be
    > > node-neutral.
    > >     > This is
    > >     >     a good idea, if we are able to carve out something where need
    > is
    > > app
    > >     >     specific.
    > >     >
    > >     >     Thks
    > >     >     Amol
    > >     >
    > >     >
    > >     >     On Tue, Nov 29, 2016 at 10:00 PM, Milind Barve <
    > > milindb@gmail.com>
    > >     > wrote:
    > >     >
    > >     >     > We have seen 2 cases mentioned below, where, it would have
    > > been nice
    > >     > if
    > >     >     > Apex allowed us to exclude a node from the cluster for an
    > >     > application.
    > >     >     >
    > >     >     > 1. A node in the cluster had gone bad (was randomly
    > rebooting)
    > > and
    > >     > so an
    > >     >     > Apex app should not use it - other apps can use it as they
    > were
    > >     > batch jobs.
    > >     >     > 2. A node is being used for a mission critical app (Could be
    > > an Apex
    > >     > app
    > >     >     > itself), but another Apex app which is mission critical
    > should
    > > not
    > >     > be using
    > >     >     > resources on that node.
    > >     >     >
    > >     >     > Can we have a way in which, Stram and YARN can coordinate
    > > between
    > >     > each
    > >     >     > other to not use a set of nodes for the application. It an be
    > > done
    > >     > in 2 way
    > >     >     > s-
    > >     >     >
    > >     >     > 1. Have a list of "exclude" nodes with Stram- when YARN
    > > allcates
    > >     > resources
    > >     >     > on either of these, STRAM rejects and gets resources
    > allocated
    > > again
    > >     > frm
    > >     >     > YARN
    > >     >     > 2. Have a list of nodes that can be used for an app - This
    > can
    > > be a
    > >     > part of
    > >     >     > config. Hwever, I don't think this would be a right way to do
    > > so as
    > >     > we will
    > >     >     > need support from YARN as well. Further, this might be
    > > difficult to
    > >     > change
    > >     >     > at runtim if need be.
    > >     >     >
    > >     >     > Any thoughts?
    > >     >     >
    > >     >     >
    > >     >     > --
    > >     >     > ~Milind bee at gee mail dot com
    > >     >     >
    > >     >
    > >     >
    > >     >
    > >     >
    > >
    > >
    > >
    > >
    >