You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ambari.apache.org by Bhuvnesh Chaudhary <bc...@pivotal.io> on 2016/03/12 00:24:08 UTC

Blueprints - RCO - Related question.

Hello Sebastian, Alejandro, Andrew,

Referring to the discussion on RB: https://reviews.apache.org/r/43948
<https://reviews.apache.org/r/43948/#review120537>, it appears that while
deploying clusters using Blueprints, RCO is not honored. Please confirm if
this understanding is correct.

While running internal test suites for HAWQ, we deploy the clusters using
BP, and we need a specific order in which the HAWQ components must be
initialized / started.

"HAWQ Standby" component should be initialized after "HAWQ Master"
component as it has to copy the contents from HAWQ Master. However, since
RCO is not honored, we often come across issues as HAWQ Standby start /
initialization before HAWQ Master.

Could you please let us know if there any work already going on for
bringing in RCO dependency for Blueprints, if not is there any other
alternative which can be used to enforce the dependency locally, or
something else which you suggest.

Thanks,
Bhuvnesh Chaudhary
Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
Desk: +1-650-846-1696 | Mobile: +1-973-906-6976

Re: Blueprints - RCO - Related question.

Posted by Eric Yang <er...@gmail.com>.

Fine with flag, but prefer to use rco as default though.  Since the default
behavior is only recently changed in the last 6 months.  It would be better
to restore to the v1 behavior.

regards,
Eric

On Mon, Mar 14, 2016 at 5:55 PM, Bhuvnesh Chaudhary <bc...@pivotal.io>
wrote:

> I have created a placeholder JIRA documenting the feature and if we all
> agree let's do it.
> https://issues.apache.org/jira/browse/AMBARI-15417
>
> Thanks,
> Bhuvnesh Chaudhary
> Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
>
> On Mon, Mar 14, 2016 at 11:17 AM, Alejandro Fernandez <
> afernandez@hortonworks.com> wrote:
>
> > I agree configuring this with a flag is ideal.
> >
> > Thanks,
> > Alejandro
> >
> > From: Bhuvnesh Chaudhary <bc...@pivotal.io>
> > Date: Monday, March 14, 2016 at 11:06 AM
> > To: Ambari <de...@ambari.apache.org>
> > Cc: Sumit Mohanty <sm...@hortonworks.com>, Alejandro Fernandez <
> > afernandez@hortonworks.com>
> > Subject: Re: Blueprints - RCO - Related question.
> >
> > Thank you very much Robert for the detailed explanation. It helps
> > to understand the background.
> >
> > Regarding HAWQ to capitalize on retry: We can potentially do some
> > tweaks to verify if HAWQ has been initialized or not according to the
> > current behavior, and change the way of doing init so that it can utilize
> > retry.
> > Currently, it goes for retry but it has certain pre-requisites which
> fails
> > after the first
> > failed installed attempt and retry is also not successul.
> > Will have to investigate on it.
> >
> > Regarding alternatives:
> > Was the option to put a flag in blueprints enabling / disabling RCO
> > considered ? Say, by default use_rco is true, and if someone want's
> > to override the behavior they can override that in blueprint.
> >
> > As quoted by Eric in the above email, in some cases, the retry can also
> > cause
> > increase in the amount of time required due to
> > 1) number of retries before it completes successfully, or it fails
> > completely
> > 2) Before retry there has to be some cleanup steps which may be
> > required for a service (for hawq currently), services must incorporate
> > that logic.
> >
> > Also with RCO, the sequence of startup is predictable and all the
> > dependencies will be met.
> >
> > So probably, making use of rco configurable in blueprints satisfies both
> > the worlds
> > who want to use rco vs not use it.
> > Your thoughts ?
> >
> >
> >
> >
> > Thanks,
> > Bhuvnesh Chaudhary
> > Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
> > Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
> >
> > On Mon, Mar 14, 2016 at 9:18 AM, Eric Yang <er...@gmail.com> wrote:
> >
> >> We have a use case where a service depends on Sqoop, Hive Metastore,
> HBase
> >> Client, Hadoop Client on a worker node.  We found that Hadoop Client is
> >> sometimes not yet installed when our service installation has already
> >> started.  This looks like a big problem for our use case.  Is there a
> way
> >> to keep RCO by using a flag?  Parallel install with retries is Chef and
> >> Puppet approach of configuring distributed loosely coupled service that
> >> has
> >> no strong tight relationship between nodes.  It doesn't solve the
> problem
> >> of virtual services where a component depends on availability of other
> >> services.  We had been scratching our heads on this since August last
> >> year.  It is good to know the problem so we can work out the kinks.
> >>
> >> If component is also monster size that it takes 60 minutes to download
> and
> >> install.  We can bump up retries for Hadoop client to very large number,
> >> but does this mean that while the monster size component is retrying,
> >> Hadoop clients maybe installed in parallel, hence second attempt of the
> >> monster component could succeed?  It seems like in this use case, the
> new
> >> optimization doesn't improve installation time because Ambari needs 120
> >> minutes to complete second retry of installation frequently.
> >>
> >> regards,
> >> Eric
> >>
> >> On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton <
> >> rnettleton@hortonworks.com> wrote:
> >>
> >> > Hi Bhuvnesh,
> >> >
> >> > You are correct.  The Blueprints deployment mechanism in Ambari no
> >> longer
> >> > relies on Role-command ordering to install or start components across
> >> the
> >> > cluster.
> >> >
> >> > This change to Blueprints was actually implemented in Ambari 2.1.0, so
> >> it
> >> > has been around for several releases now.  The new approach was
> >> implemented
> >> > to improve the performance times of cluster deployments, and provide
> >> better
> >> > support for dynamic scaling of clusters.
> >> >
> >> > That being said, the new deployment mechanism does indeed remove the
> >> > guarantee of ordering, which can potentially cause some problems for
> >> > certain types of clusters.  There were also changes implemented on the
> >> > Ambari Agent side to mitigate this problem or ordering.  The
> >> ambari-agent
> >> > will now retry INSTALL and START operations if those operations happen
> >> to
> >> > fail.  The START operation is probably the most relevant in your case,
> >> and
> >> > is also the operation that does show the ordering issues you’ve
> >> mentioned
> >> > in some deployments.
> >> >
> >> > The idea is that the ambari-agent retries should help to resolve any
> >> > issues with services starting in an unexpected order.
> >> >
> >> > This ambari-agent feature is on by default, but can be configured in a
> >> > more fine-grained fashion by setting some properties in “cluster-env”
> in
> >> > your Blueprint or Cluster Creation Template.
> >> >
> >> > Unfortunately, this is not documented very well, but the three
> >> properties
> >> > in question are set by default in the BlueprintConfigurationProcessor
> in
> >> > the following method:
> >> >
> >> >
> >> >
> >>
> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration
> >> >
> >> > The properties set in this method allow control over the types of
> >> > operations that are retried, the max number of retries attempted, and
> >> the
> >> > maximum amount of time that the agent should attempt a retry.
> >> >
> >> > We’ve seen many clusters using this new approach, and have not run
> into
> >> > that many problems with respect to ordering.
> >> >
> >> > One possible problem we’ve seen is in a small number of components
> that
> >> > launch services as a background command.  In that case, the
> ambari-agent
> >> > cannot detect that a retry is required, and so cannot attempt a
> restart
> >> of
> >> > a failed service.  This problem can usually be resolved with
> >> > component-specific retries.
> >> >
> >> > I don’t know much about the HAWQ component, but I would expect that
> >> > customizing the retry settings may help this problem.  Do the HAWQ
> >> > components implement retry attempts when booting up?
> >> >
> >> > Hope this helps.
> >> >
> >> > Thanks,
> >> > Bob
> >> >
> >> >
> >> >
> >> >
> >> > On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <
> >> > afernandez@hortonworks.com> wrote:
> >> >
> >> > > +others who have more insight into BluePrints
> >> > >
> >> > > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io>
> >> wrote:
> >> > >
> >> > >> Hello Sebastian, Alejandro, Andrew,
> >> > >>
> >> > >> Referring to the discussion on RB:
> >> https://reviews.apache.org/r/43948
> >> > >> <https://reviews.apache.org/r/43948/#review120537>, it appears
> that
> >> > while
> >> > >> deploying clusters using Blueprints, RCO is not honored. Please
> >> confirm
> >> > if
> >> > >> this understanding is correct.
> >> > >>
> >> > >> While running internal test suites for HAWQ, we deploy the clusters
> >> > using
> >> > >> BP, and we need a specific order in which the HAWQ components must
> be
> >> > >> initialized / started.
> >> > >>
> >> > >> "HAWQ Standby" component should be initialized after "HAWQ Master"
> >> > >> component as it has to copy the contents from HAWQ Master. However,
> >> > since
> >> > >> RCO is not honored, we often come across issues as HAWQ Standby
> >> start /
> >> > >> initialization before HAWQ Master.
> >> > >>
> >> > >> Could you please let us know if there any work already going on for
> >> > >> bringing in RCO dependency for Blueprints, if not is there any
> other
> >> > >> alternative which can be used to enforce the dependency locally, or
> >> > >> something else which you suggest.
> >> > >>
> >> > >> Thanks,
> >> > >> Bhuvnesh Chaudhary
> >> > >> Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
> >> > >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
> >> > >
> >> >
> >> >
> >>
> >
> >
>

Re: Blueprints - RCO - Related question.

Posted by Bhuvnesh Chaudhary <bc...@pivotal.io>.

I have created a placeholder JIRA documenting the feature and if we all
agree let's do it.
https://issues.apache.org/jira/browse/AMBARI-15417

Thanks,
Bhuvnesh Chaudhary
Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
Desk: +1-650-846-1696 | Mobile: +1-973-906-6976

On Mon, Mar 14, 2016 at 11:17 AM, Alejandro Fernandez <
afernandez@hortonworks.com> wrote:

> I agree configuring this with a flag is ideal.
>
> Thanks,
> Alejandro
>
> From: Bhuvnesh Chaudhary <bc...@pivotal.io>
> Date: Monday, March 14, 2016 at 11:06 AM
> To: Ambari <de...@ambari.apache.org>
> Cc: Sumit Mohanty <sm...@hortonworks.com>, Alejandro Fernandez <
> afernandez@hortonworks.com>
> Subject: Re: Blueprints - RCO - Related question.
>
> Thank you very much Robert for the detailed explanation. It helps
> to understand the background.
>
> Regarding HAWQ to capitalize on retry: We can potentially do some
> tweaks to verify if HAWQ has been initialized or not according to the
> current behavior, and change the way of doing init so that it can utilize
> retry.
> Currently, it goes for retry but it has certain pre-requisites which fails
> after the first
> failed installed attempt and retry is also not successul.
> Will have to investigate on it.
>
> Regarding alternatives:
> Was the option to put a flag in blueprints enabling / disabling RCO
> considered ? Say, by default use_rco is true, and if someone want's
> to override the behavior they can override that in blueprint.
>
> As quoted by Eric in the above email, in some cases, the retry can also
> cause
> increase in the amount of time required due to
> 1) number of retries before it completes successfully, or it fails
> completely
> 2) Before retry there has to be some cleanup steps which may be
> required for a service (for hawq currently), services must incorporate
> that logic.
>
> Also with RCO, the sequence of startup is predictable and all the
> dependencies will be met.
>
> So probably, making use of rco configurable in blueprints satisfies both
> the worlds
> who want to use rco vs not use it.
> Your thoughts ?
>
>
>
>
> Thanks,
> Bhuvnesh Chaudhary
> Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
>
> On Mon, Mar 14, 2016 at 9:18 AM, Eric Yang <er...@gmail.com> wrote:
>
>> We have a use case where a service depends on Sqoop, Hive Metastore, HBase
>> Client, Hadoop Client on a worker node.  We found that Hadoop Client is
>> sometimes not yet installed when our service installation has already
>> started.  This looks like a big problem for our use case.  Is there a way
>> to keep RCO by using a flag?  Parallel install with retries is Chef and
>> Puppet approach of configuring distributed loosely coupled service that
>> has
>> no strong tight relationship between nodes.  It doesn't solve the problem
>> of virtual services where a component depends on availability of other
>> services.  We had been scratching our heads on this since August last
>> year.  It is good to know the problem so we can work out the kinks.
>>
>> If component is also monster size that it takes 60 minutes to download and
>> install.  We can bump up retries for Hadoop client to very large number,
>> but does this mean that while the monster size component is retrying,
>> Hadoop clients maybe installed in parallel, hence second attempt of the
>> monster component could succeed?  It seems like in this use case, the new
>> optimization doesn't improve installation time because Ambari needs 120
>> minutes to complete second retry of installation frequently.
>>
>> regards,
>> Eric
>>
>> On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton <
>> rnettleton@hortonworks.com> wrote:
>>
>> > Hi Bhuvnesh,
>> >
>> > You are correct.  The Blueprints deployment mechanism in Ambari no
>> longer
>> > relies on Role-command ordering to install or start components across
>> the
>> > cluster.
>> >
>> > This change to Blueprints was actually implemented in Ambari 2.1.0, so
>> it
>> > has been around for several releases now.  The new approach was
>> implemented
>> > to improve the performance times of cluster deployments, and provide
>> better
>> > support for dynamic scaling of clusters.
>> >
>> > That being said, the new deployment mechanism does indeed remove the
>> > guarantee of ordering, which can potentially cause some problems for
>> > certain types of clusters.  There were also changes implemented on the
>> > Ambari Agent side to mitigate this problem or ordering.  The
>> ambari-agent
>> > will now retry INSTALL and START operations if those operations happen
>> to
>> > fail.  The START operation is probably the most relevant in your case,
>> and
>> > is also the operation that does show the ordering issues you’ve
>> mentioned
>> > in some deployments.
>> >
>> > The idea is that the ambari-agent retries should help to resolve any
>> > issues with services starting in an unexpected order.
>> >
>> > This ambari-agent feature is on by default, but can be configured in a
>> > more fine-grained fashion by setting some properties in “cluster-env” in
>> > your Blueprint or Cluster Creation Template.
>> >
>> > Unfortunately, this is not documented very well, but the three
>> properties
>> > in question are set by default in the BlueprintConfigurationProcessor in
>> > the following method:
>> >
>> >
>> >
>> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration
>> >
>> > The properties set in this method allow control over the types of
>> > operations that are retried, the max number of retries attempted, and
>> the
>> > maximum amount of time that the agent should attempt a retry.
>> >
>> > We’ve seen many clusters using this new approach, and have not run into
>> > that many problems with respect to ordering.
>> >
>> > One possible problem we’ve seen is in a small number of components that
>> > launch services as a background command.  In that case, the ambari-agent
>> > cannot detect that a retry is required, and so cannot attempt a restart
>> of
>> > a failed service.  This problem can usually be resolved with
>> > component-specific retries.
>> >
>> > I don’t know much about the HAWQ component, but I would expect that
>> > customizing the retry settings may help this problem.  Do the HAWQ
>> > components implement retry attempts when booting up?
>> >
>> > Hope this helps.
>> >
>> > Thanks,
>> > Bob
>> >
>> >
>> >
>> >
>> > On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <
>> > afernandez@hortonworks.com> wrote:
>> >
>> > > +others who have more insight into BluePrints
>> > >
>> > > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io>
>> wrote:
>> > >
>> > >> Hello Sebastian, Alejandro, Andrew,
>> > >>
>> > >> Referring to the discussion on RB:
>> https://reviews.apache.org/r/43948
>> > >> <https://reviews.apache.org/r/43948/#review120537>, it appears that
>> > while
>> > >> deploying clusters using Blueprints, RCO is not honored. Please
>> confirm
>> > if
>> > >> this understanding is correct.
>> > >>
>> > >> While running internal test suites for HAWQ, we deploy the clusters
>> > using
>> > >> BP, and we need a specific order in which the HAWQ components must be
>> > >> initialized / started.
>> > >>
>> > >> "HAWQ Standby" component should be initialized after "HAWQ Master"
>> > >> component as it has to copy the contents from HAWQ Master. However,
>> > since
>> > >> RCO is not honored, we often come across issues as HAWQ Standby
>> start /
>> > >> initialization before HAWQ Master.
>> > >>
>> > >> Could you please let us know if there any work already going on for
>> > >> bringing in RCO dependency for Blueprints, if not is there any other
>> > >> alternative which can be used to enforce the dependency locally, or
>> > >> something else which you suggest.
>> > >>
>> > >> Thanks,
>> > >> Bhuvnesh Chaudhary
>> > >> Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
>> > >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
>> > >
>> >
>> >
>>
>
>

Re: Blueprints - RCO - Related question.

Posted by Alejandro Fernandez <af...@hortonworks.com>.

I agree configuring this with a flag is ideal.

Thanks,
Alejandro

From: Bhuvnesh Chaudhary <bc...@pivotal.io>>
Date: Monday, March 14, 2016 at 11:06 AM
To: Ambari <de...@ambari.apache.org>>
Cc: Sumit Mohanty <sm...@hortonworks.com>>, Alejandro Fernandez <af...@hortonworks.com>>
Subject: Re: Blueprints - RCO - Related question.

Thank you very much Robert for the detailed explanation. It helps
to understand the background.

Regarding HAWQ to capitalize on retry: We can potentially do some
tweaks to verify if HAWQ has been initialized or not according to the
current behavior, and change the way of doing init so that it can utilize retry.
Currently, it goes for retry but it has certain pre-requisites which fails after the first
failed installed attempt and retry is also not successul.
Will have to investigate on it.

Regarding alternatives:
Was the option to put a flag in blueprints enabling / disabling RCO
considered ? Say, by default use_rco is true, and if someone want's
to override the behavior they can override that in blueprint.

As quoted by Eric in the above email, in some cases, the retry can also cause
increase in the amount of time required due to
1) number of retries before it completes successfully, or it fails completely
2) Before retry there has to be some cleanup steps which may be
required for a service (for hawq currently), services must incorporate that logic.

Also with RCO, the sequence of startup is predictable and all the
dependencies will be met.

So probably, making use of rco configurable in blueprints satisfies both the worlds
who want to use rco vs not use it.
Your thoughts ?

Thanks,
Bhuvnesh Chaudhary
Email: bchau<ma...@pivotal.io>
Desk: +1-650-846-1696 | Mobile: +1-973-906-6976

On Mon, Mar 14, 2016 at 9:18 AM, Eric Yang <er...@gmail.com>> wrote:
We have a use case where a service depends on Sqoop, Hive Metastore, HBase
Client, Hadoop Client on a worker node.  We found that Hadoop Client is
sometimes not yet installed when our service installation has already
started.  This looks like a big problem for our use case.  Is there a way
to keep RCO by using a flag?  Parallel install with retries is Chef and
Puppet approach of configuring distributed loosely coupled service that has
no strong tight relationship between nodes.  It doesn't solve the problem
of virtual services where a component depends on availability of other
services.  We had been scratching our heads on this since August last
year.  It is good to know the problem so we can work out the kinks.

If component is also monster size that it takes 60 minutes to download and
install.  We can bump up retries for Hadoop client to very large number,
but does this mean that while the monster size component is retrying,
Hadoop clients maybe installed in parallel, hence second attempt of the
monster component could succeed?  It seems like in this use case, the new
optimization doesn't improve installation time because Ambari needs 120
minutes to complete second retry of installation frequently.

regards,
Eric

On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton <
rnettleton@hortonworks.com<ma...@hortonworks.com>> wrote:

> Hi Bhuvnesh,
>
> You are correct.  The Blueprints deployment mechanism in Ambari no longer
> relies on Role-command ordering to install or start components across the
> cluster.
>
> This change to Blueprints was actually implemented in Ambari 2.1.0, so it
> has been around for several releases now.  The new approach was implemented
> to improve the performance times of cluster deployments, and provide better
> support for dynamic scaling of clusters.
>
> That being said, the new deployment mechanism does indeed remove the
> guarantee of ordering, which can potentially cause some problems for
> certain types of clusters.  There were also changes implemented on the
> Ambari Agent side to mitigate this problem or ordering.  The ambari-agent
> will now retry INSTALL and START operations if those operations happen to
> fail.  The START operation is probably the most relevant in your case, and
> is also the operation that does show the ordering issues you've mentioned
> in some deployments.
>
> The idea is that the ambari-agent retries should help to resolve any
> issues with services starting in an unexpected order.
>
> This ambari-agent feature is on by default, but can be configured in a
> more fine-grained fashion by setting some properties in "cluster-env" in
> your Blueprint or Cluster Creation Template.
>
> Unfortunately, this is not documented very well, but the three properties
> in question are set by default in the BlueprintConfigurationProcessor in
> the following method:
>
>
> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration
>
> The properties set in this method allow control over the types of
> operations that are retried, the max number of retries attempted, and the
> maximum amount of time that the agent should attempt a retry.
>
> We've seen many clusters using this new approach, and have not run into
> that many problems with respect to ordering.
>
> One possible problem we've seen is in a small number of components that
> launch services as a background command.  In that case, the ambari-agent
> cannot detect that a retry is required, and so cannot attempt a restart of
> a failed service.  This problem can usually be resolved with
> component-specific retries.
>
> I don't know much about the HAWQ component, but I would expect that
> customizing the retry settings may help this problem.  Do the HAWQ
> components implement retry attempts when booting up?
>
> Hope this helps.
>
> Thanks,
> Bob
>
>
>
>
> On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <
> afernandez@hortonworks.com<ma...@hortonworks.com>> wrote:
>
> > +others who have more insight into BluePrints
> >
> > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io>> wrote:
> >
> >> Hello Sebastian, Alejandro, Andrew,
> >>
> >> Referring to the discussion on RB: https://reviews.apache.org/r/43948
> >> <https://reviews.apache.org/r/43948/#review120537>, it appears that
> while
> >> deploying clusters using Blueprints, RCO is not honored. Please confirm
> if
> >> this understanding is correct.
> >>
> >> While running internal test suites for HAWQ, we deploy the clusters
> using
> >> BP, and we need a specific order in which the HAWQ components must be
> >> initialized / started.
> >>
> >> "HAWQ Standby" component should be initialized after "HAWQ Master"
> >> component as it has to copy the contents from HAWQ Master. However,
> since
> >> RCO is not honored, we often come across issues as HAWQ Standby start /
> >> initialization before HAWQ Master.
> >>
> >> Could you please let us know if there any work already going on for
> >> bringing in RCO dependency for Blueprints, if not is there any other
> >> alternative which can be used to enforce the dependency locally, or
> >> something else which you suggest.
> >>
> >> Thanks,
> >> Bhuvnesh Chaudhary
> >> Email: bchau <bc...@pivotal.io>
> >> Desk: +1-650-846-1696<tel:%2B1-650-846-1696> | Mobile: +1-973-906-6976<tel:%2B1-973-906-6976>
> >
>
>

Re: Blueprints - RCO - Related question.

Posted by Bhuvnesh Chaudhary <bc...@pivotal.io>.

Thank you very much Robert for the detailed explanation. It helps
to understand the background.

Regarding HAWQ to capitalize on retry: We can potentially do some
tweaks to verify if HAWQ has been initialized or not according to the
current behavior, and change the way of doing init so that it can utilize
retry.
Currently, it goes for retry but it has certain pre-requisites which fails
after the first
failed installed attempt and retry is also not successul.
Will have to investigate on it.

Regarding alternatives:
Was the option to put a flag in blueprints enabling / disabling RCO
considered ? Say, by default use_rco is true, and if someone want's
to override the behavior they can override that in blueprint.

As quoted by Eric in the above email, in some cases, the retry can also
cause
increase in the amount of time required due to
1) number of retries before it completes successfully, or it fails
completely
2) Before retry there has to be some cleanup steps which may be
required for a service (for hawq currently), services must incorporate that
logic.

Also with RCO, the sequence of startup is predictable and all the
dependencies will be met.

So probably, making use of rco configurable in blueprints satisfies both
the worlds
who want to use rco vs not use it.
Your thoughts ?




Thanks,
Bhuvnesh Chaudhary
Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
Desk: +1-650-846-1696 | Mobile: +1-973-906-6976

On Mon, Mar 14, 2016 at 9:18 AM, Eric Yang <er...@gmail.com> wrote:

> We have a use case where a service depends on Sqoop, Hive Metastore, HBase
> Client, Hadoop Client on a worker node.  We found that Hadoop Client is
> sometimes not yet installed when our service installation has already
> started.  This looks like a big problem for our use case.  Is there a way
> to keep RCO by using a flag?  Parallel install with retries is Chef and
> Puppet approach of configuring distributed loosely coupled service that has
> no strong tight relationship between nodes.  It doesn't solve the problem
> of virtual services where a component depends on availability of other
> services.  We had been scratching our heads on this since August last
> year.  It is good to know the problem so we can work out the kinks.
>
> If component is also monster size that it takes 60 minutes to download and
> install.  We can bump up retries for Hadoop client to very large number,
> but does this mean that while the monster size component is retrying,
> Hadoop clients maybe installed in parallel, hence second attempt of the
> monster component could succeed?  It seems like in this use case, the new
> optimization doesn't improve installation time because Ambari needs 120
> minutes to complete second retry of installation frequently.
>
> regards,
> Eric
>
> On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton <
> rnettleton@hortonworks.com> wrote:
>
> > Hi Bhuvnesh,
> >
> > You are correct.  The Blueprints deployment mechanism in Ambari no longer
> > relies on Role-command ordering to install or start components across the
> > cluster.
> >
> > This change to Blueprints was actually implemented in Ambari 2.1.0, so it
> > has been around for several releases now.  The new approach was
> implemented
> > to improve the performance times of cluster deployments, and provide
> better
> > support for dynamic scaling of clusters.
> >
> > That being said, the new deployment mechanism does indeed remove the
> > guarantee of ordering, which can potentially cause some problems for
> > certain types of clusters.  There were also changes implemented on the
> > Ambari Agent side to mitigate this problem or ordering.  The ambari-agent
> > will now retry INSTALL and START operations if those operations happen to
> > fail.  The START operation is probably the most relevant in your case,
> and
> > is also the operation that does show the ordering issues you’ve mentioned
> > in some deployments.
> >
> > The idea is that the ambari-agent retries should help to resolve any
> > issues with services starting in an unexpected order.
> >
> > This ambari-agent feature is on by default, but can be configured in a
> > more fine-grained fashion by setting some properties in “cluster-env” in
> > your Blueprint or Cluster Creation Template.
> >
> > Unfortunately, this is not documented very well, but the three properties
> > in question are set by default in the BlueprintConfigurationProcessor in
> > the following method:
> >
> >
> >
> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration
> >
> > The properties set in this method allow control over the types of
> > operations that are retried, the max number of retries attempted, and the
> > maximum amount of time that the agent should attempt a retry.
> >
> > We’ve seen many clusters using this new approach, and have not run into
> > that many problems with respect to ordering.
> >
> > One possible problem we’ve seen is in a small number of components that
> > launch services as a background command.  In that case, the ambari-agent
> > cannot detect that a retry is required, and so cannot attempt a restart
> of
> > a failed service.  This problem can usually be resolved with
> > component-specific retries.
> >
> > I don’t know much about the HAWQ component, but I would expect that
> > customizing the retry settings may help this problem.  Do the HAWQ
> > components implement retry attempts when booting up?
> >
> > Hope this helps.
> >
> > Thanks,
> > Bob
> >
> >
> >
> >
> > On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <
> > afernandez@hortonworks.com> wrote:
> >
> > > +others who have more insight into BluePrints
> > >
> > > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io>
> wrote:
> > >
> > >> Hello Sebastian, Alejandro, Andrew,
> > >>
> > >> Referring to the discussion on RB: https://reviews.apache.org/r/43948
> > >> <https://reviews.apache.org/r/43948/#review120537>, it appears that
> > while
> > >> deploying clusters using Blueprints, RCO is not honored. Please
> confirm
> > if
> > >> this understanding is correct.
> > >>
> > >> While running internal test suites for HAWQ, we deploy the clusters
> > using
> > >> BP, and we need a specific order in which the HAWQ components must be
> > >> initialized / started.
> > >>
> > >> "HAWQ Standby" component should be initialized after "HAWQ Master"
> > >> component as it has to copy the contents from HAWQ Master. However,
> > since
> > >> RCO is not honored, we often come across issues as HAWQ Standby start
> /
> > >> initialization before HAWQ Master.
> > >>
> > >> Could you please let us know if there any work already going on for
> > >> bringing in RCO dependency for Blueprints, if not is there any other
> > >> alternative which can be used to enforce the dependency locally, or
> > >> something else which you suggest.
> > >>
> > >> Thanks,
> > >> Bhuvnesh Chaudhary
> > >> Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
> > >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
> > >
> >
> >
>

Re: Blueprints - RCO - Related question.

Posted by Eric Yang <er...@gmail.com>.

We have a use case where a service depends on Sqoop, Hive Metastore, HBase
Client, Hadoop Client on a worker node.  We found that Hadoop Client is
sometimes not yet installed when our service installation has already
started.  This looks like a big problem for our use case.  Is there a way
to keep RCO by using a flag?  Parallel install with retries is Chef and
Puppet approach of configuring distributed loosely coupled service that has
no strong tight relationship between nodes.  It doesn't solve the problem
of virtual services where a component depends on availability of other
services.  We had been scratching our heads on this since August last
year.  It is good to know the problem so we can work out the kinks.

If component is also monster size that it takes 60 minutes to download and
install.  We can bump up retries for Hadoop client to very large number,
but does this mean that while the monster size component is retrying,
Hadoop clients maybe installed in parallel, hence second attempt of the
monster component could succeed?  It seems like in this use case, the new
optimization doesn't improve installation time because Ambari needs 120
minutes to complete second retry of installation frequently.

regards,
Eric

On Mon, Mar 14, 2016 at 6:38 AM, Robert Nettleton <
rnettleton@hortonworks.com> wrote:

> Hi Bhuvnesh,
>
> You are correct.  The Blueprints deployment mechanism in Ambari no longer
> relies on Role-command ordering to install or start components across the
> cluster.
>
> This change to Blueprints was actually implemented in Ambari 2.1.0, so it
> has been around for several releases now.  The new approach was implemented
> to improve the performance times of cluster deployments, and provide better
> support for dynamic scaling of clusters.
>
> That being said, the new deployment mechanism does indeed remove the
> guarantee of ordering, which can potentially cause some problems for
> certain types of clusters.  There were also changes implemented on the
> Ambari Agent side to mitigate this problem or ordering.  The ambari-agent
> will now retry INSTALL and START operations if those operations happen to
> fail.  The START operation is probably the most relevant in your case, and
> is also the operation that does show the ordering issues you’ve mentioned
> in some deployments.
>
> The idea is that the ambari-agent retries should help to resolve any
> issues with services starting in an unexpected order.
>
> This ambari-agent feature is on by default, but can be configured in a
> more fine-grained fashion by setting some properties in “cluster-env” in
> your Blueprint or Cluster Creation Template.
>
> Unfortunately, this is not documented very well, but the three properties
> in question are set by default in the BlueprintConfigurationProcessor in
> the following method:
>
>
> org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration
>
> The properties set in this method allow control over the types of
> operations that are retried, the max number of retries attempted, and the
> maximum amount of time that the agent should attempt a retry.
>
> We’ve seen many clusters using this new approach, and have not run into
> that many problems with respect to ordering.
>
> One possible problem we’ve seen is in a small number of components that
> launch services as a background command.  In that case, the ambari-agent
> cannot detect that a retry is required, and so cannot attempt a restart of
> a failed service.  This problem can usually be resolved with
> component-specific retries.
>
> I don’t know much about the HAWQ component, but I would expect that
> customizing the retry settings may help this problem.  Do the HAWQ
> components implement retry attempts when booting up?
>
> Hope this helps.
>
> Thanks,
> Bob
>
>
>
>
> On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <
> afernandez@hortonworks.com> wrote:
>
> > +others who have more insight into BluePrints
> >
> > On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io> wrote:
> >
> >> Hello Sebastian, Alejandro, Andrew,
> >>
> >> Referring to the discussion on RB: https://reviews.apache.org/r/43948
> >> <https://reviews.apache.org/r/43948/#review120537>, it appears that
> while
> >> deploying clusters using Blueprints, RCO is not honored. Please confirm
> if
> >> this understanding is correct.
> >>
> >> While running internal test suites for HAWQ, we deploy the clusters
> using
> >> BP, and we need a specific order in which the HAWQ components must be
> >> initialized / started.
> >>
> >> "HAWQ Standby" component should be initialized after "HAWQ Master"
> >> component as it has to copy the contents from HAWQ Master. However,
> since
> >> RCO is not honored, we often come across issues as HAWQ Standby start /
> >> initialization before HAWQ Master.
> >>
> >> Could you please let us know if there any work already going on for
> >> bringing in RCO dependency for Blueprints, if not is there any other
> >> alternative which can be used to enforce the dependency locally, or
> >> something else which you suggest.
> >>
> >> Thanks,
> >> Bhuvnesh Chaudhary
> >> Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
> >> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
> >
>
>

Re: Blueprints - RCO - Related question.

Posted by Robert Nettleton <rn...@hortonworks.com>.

Hi Bhuvnesh,

You are correct.  The Blueprints deployment mechanism in Ambari no longer relies on Role-command ordering to install or start components across the cluster.

This change to Blueprints was actually implemented in Ambari 2.1.0, so it has been around for several releases now.  The new approach was implemented to improve the performance times of cluster deployments, and provide better support for dynamic scaling of clusters.  

That being said, the new deployment mechanism does indeed remove the guarantee of ordering, which can potentially cause some problems for certain types of clusters.  There were also changes implemented on the Ambari Agent side to mitigate this problem or ordering.  The ambari-agent will now retry INSTALL and START operations if those operations happen to fail.  The START operation is probably the most relevant in your case, and is also the operation that does show the ordering issues you’ve mentioned in some deployments.  

The idea is that the ambari-agent retries should help to resolve any issues with services starting in an unexpected order.  

This ambari-agent feature is on by default, but can be configured in a more fine-grained fashion by setting some properties in “cluster-env” in your Blueprint or Cluster Creation Template. 

Unfortunately, this is not documented very well, but the three properties in question are set by default in the BlueprintConfigurationProcessor in the following method:

org.apache.ambari.server.controller.internal.BlueprintConfigurationProcessor#setRetryConfiguration

The properties set in this method allow control over the types of operations that are retried, the max number of retries attempted, and the maximum amount of time that the agent should attempt a retry. 

We’ve seen many clusters using this new approach, and have not run into that many problems with respect to ordering.  

One possible problem we’ve seen is in a small number of components that launch services as a background command.  In that case, the ambari-agent cannot detect that a retry is required, and so cannot attempt a restart of a failed service.  This problem can usually be resolved with component-specific retries.  

I don’t know much about the HAWQ component, but I would expect that customizing the retry settings may help this problem.  Do the HAWQ components implement retry attempts when booting up?  

Hope this helps.  

Thanks,
Bob

On Mar 11, 2016, at 7:18 PM, Alejandro Fernandez <af...@hortonworks.com> wrote:

> +others who have more insight into BluePrints
> 
> On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io> wrote:
> 
>> Hello Sebastian, Alejandro, Andrew,
>> 
>> Referring to the discussion on RB: https://reviews.apache.org/r/43948
>> <https://reviews.apache.org/r/43948/#review120537>, it appears that while
>> deploying clusters using Blueprints, RCO is not honored. Please confirm if
>> this understanding is correct.
>> 
>> While running internal test suites for HAWQ, we deploy the clusters using
>> BP, and we need a specific order in which the HAWQ components must be
>> initialized / started.
>> 
>> "HAWQ Standby" component should be initialized after "HAWQ Master"
>> component as it has to copy the contents from HAWQ Master. However, since
>> RCO is not honored, we often come across issues as HAWQ Standby start /
>> initialization before HAWQ Master.
>> 
>> Could you please let us know if there any work already going on for
>> bringing in RCO dependency for Blueprints, if not is there any other
>> alternative which can be used to enforce the dependency locally, or
>> something else which you suggest.
>> 
>> Thanks,
>> Bhuvnesh Chaudhary
>> Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
>> Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
>

Re: Blueprints - RCO - Related question.

Posted by Eric Yang <er...@gmail.com>.

Deploying cluster using blueprint must respect RCO.  Otherwise, traditional
configuration manager system like chef and puppet can already do the job of
Ambari.  The advantage of Ambari over other system is to ensure that cross
node service dependencies are generated by dependency description model and
allow Ambari server to create orchestrating plan accordingly.  If there is
feature request about the optimization of installation timing, that request
should be handled independently from sacrificing RCO dependencies.  This
sounds like a critical defect to fix.

On Fri, Mar 11, 2016 at 4:18 PM, Alejandro Fernandez <
afernandez@hortonworks.com> wrote:

> +others who have more insight into BluePrints
>
> On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io> wrote:
>
> >Hello Sebastian, Alejandro, Andrew,
> >
> >Referring to the discussion on RB: https://reviews.apache.org/r/43948
> ><https://reviews.apache.org/r/43948/#review120537>, it appears that while
> >deploying clusters using Blueprints, RCO is not honored. Please confirm if
> >this understanding is correct.
> >
> >While running internal test suites for HAWQ, we deploy the clusters using
> >BP, and we need a specific order in which the HAWQ components must be
> >initialized / started.
> >
> >"HAWQ Standby" component should be initialized after "HAWQ Master"
> >component as it has to copy the contents from HAWQ Master. However, since
> >RCO is not honored, we often come across issues as HAWQ Standby start /
> >initialization before HAWQ Master.
> >
> >Could you please let us know if there any work already going on for
> >bringing in RCO dependency for Blueprints, if not is there any other
> >alternative which can be used to enforce the dependency locally, or
> >something else which you suggest.
> >
> >Thanks,
> >Bhuvnesh Chaudhary
> >Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
> >Desk: +1-650-846-1696 | Mobile: +1-973-906-6976
>
>

Re: Blueprints - RCO - Related question.

Posted by Alejandro Fernandez <af...@hortonworks.com>.

+others who have more insight into BluePrints

On 3/11/16, 3:24 PM, "Bhuvnesh Chaudhary" <bc...@pivotal.io> wrote:

>Hello Sebastian, Alejandro, Andrew,
>
>Referring to the discussion on RB: https://reviews.apache.org/r/43948
><https://reviews.apache.org/r/43948/#review120537>, it appears that while
>deploying clusters using Blueprints, RCO is not honored. Please confirm if
>this understanding is correct.
>
>While running internal test suites for HAWQ, we deploy the clusters using
>BP, and we need a specific order in which the HAWQ components must be
>initialized / started.
>
>"HAWQ Standby" component should be initialized after "HAWQ Master"
>component as it has to copy the contents from HAWQ Master. However, since
>RCO is not honored, we often come across issues as HAWQ Standby start /
>initialization before HAWQ Master.
>
>Could you please let us know if there any work already going on for
>bringing in RCO dependency for Blueprints, if not is there any other
>alternative which can be used to enforce the dependency locally, or
>something else which you suggest.
>
>Thanks,
>Bhuvnesh Chaudhary
>Email: bchau <bc...@gopivotal.com>dhary@pivotal.io
>Desk: +1-650-846-1696 | Mobile: +1-973-906-6976