You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by Brent <br...@gmail.com> on 2021/08/04 22:41:36 UTC

Temporarily preventing a state transition

I had asked a question a while back about how to deal with a failed state
transition (
http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
and the correct answer there was to throw an exception to cause an ERROR
state in the state machine.

I have a slightly different but related question now. I'm using
the org.apache.helix.model.MasterSlaveSMD. In our system, for a Slave
partition to become fully in-sync with a Master partition can take a long
time (maybe 30 minutes). Under normal circumstances, until a Slave has
finished syncing data from a Master, it should not be eligible for
promotion to Master.

So let's say a node (maybe newly added to the cluster) is the Slave for
partition 22 and has been online for 10 minutes (not long enough to have
sync-ed everything from the existing partition 22 Master) and receives a
state transition from Helix saying it should go from Slave->Master. Is it
possible to temporarily reject that transition without going into ERROR
state for that partition? ERROR state seems like slightly the wrong thing
because while it's not a valid transition right now, it will be a valid
transition 20 minutes from now when the initial sync completes.

Is there a way to get this functionality to "fail" a transition, but not
fully go into ERROR state? Or is there a different way I should be
thinking about solving this problem? I was thinking this could potentially
be a frequent occurrence when new nodes are added to the cluster.

Thank you for your time and help as always!

~Brent

Re: Temporarily preventing a state transition

Posted by Wang Jiajun <er...@gmail.com>.

You're welcome : )

On Fri, Aug 6, 2021, 8:20 AM Brent <br...@gmail.com> wrote:

> Great information.  Thank you again, I really appreciate all the
> thoughtful & detailed responses.
>
> On Thu, Aug 5, 2021 at 9:45 PM Wang Jiajun <er...@gmail.com> wrote:
>
>> In short, the controller thinks a state transition is still ongoing if,
>> 1. the message exists and status = READ, 2. the instance is alive (with
>> liveInstance znode). The participant does not need to do any additional
>> signal until it updates the current states, meaning the transition is
>> either done or ERROR. The controller will read current states to get the
>> information. And yes, the call should not return.
>>
>> State transition timeout is configurable. By default, regular partition
>> state transition does not have a timeout. If anything happens in the
>> execution, Helix logic on the participant will catch and throw Exception
>> then the partition will end up with an ERROR state.
>>
>> Regarding throttling,  StateTransitionThrottleConfig would be the
>> recommended way. If it does not fit your needs, then an alternative way is
>> to use another theadpool with a limited thread size to execute the
>> bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit
>> the task to this threadpool and wait for the result. This is another way to
>> control critical system resource usage. It is more complicated obviously.
>> So, please evaluate this method carefully : )
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Thu, Aug 5, 2021 at 2:03 PM Brent <br...@gmail.com> wrote:
>>
>>> Ah!  OK, that's interesting. It seems like I *was* thinking about it the
>>> wrong way.  Rather than try to stop the transition from STANDBY -> LEADER,
>>> I need to make sure I don't let a node become STANDBY until it is ready to
>>> be promoted to LEADER if necessary.
>>>
>>> You said it does not hurt to have some long-running transition tasks
>>> (e.g. OFFLINE -> STANDBY in my case).  So when Helix sends the transition
>>> message to move from OFFLINE -> STANDBY for a given partition, how do I
>>> signal to Helix that I'm still working?  Is it literally just that I don't
>>> return from the agent Java callback until the transition is complete?  e.g.:
>>>
>>> @Transition(to = "STANDBY", from = "OFFLINE")
>>> public void onBecomeStandbyFromOffline(Message message,
>>> NotificationContext context) {
>>>     // this node has been asked to become standby for this partition
>>>
>>>     DoLongRunningOperationToBootstrapNode(message);  // this could take
>>> 30 minutes
>>>
>>>     // returning now means the node has successfully transitioned to
>>> STANDBY
>>> }
>>>
>>> Are there any default state transition timeouts or any other properties
>>> I need to worry about updating?
>>>
>>> To your other point, I think parallel bootstraps may be OK.  I was
>>> hoping a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could
>>> limit the number of those, but that seems like it applies to ALL
>>> transitions on that node, not just a particular type like
>>> OFFLINE->STANDBY.  I suspect I'd have to use that carefully to make sure
>>> I'm never blocking a STANDBY -> LEADER transition while waiting on OFFLINE
>>> -> STANDBY transitions.
>>>
>>> Thanks for changing my perspective on how I was seeing this problem
>>> Jiajun.  Very helpful!
>>>
>>> ~Brent
>>>
>>> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <er...@gmail.com>
>>> wrote:
>>>
>>>> In short, the key to this solution is to prevent STANDBY -> LEADER
>>>> message before partitions are truly ready. We do not restrict SYNCING ->
>>>> STANDBY messages at all. So the controller will send S -> S message and
>>>> wait until the transition (bootstrap) is done. After that, it can go ahead
>>>> to bring up the LEADER. It does not hurt to have some long-run transition
>>>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving
>>>> the traffic, we don't want to have a big gap with no LEADER) in the system,
>>>> because it is what happens there. But this also means in
>>>> parallel bootstraps. I'm not sure if this fits your need.
>>>>
>>>> Best Regards,
>>>> Jiajun
>>>>
>>>>
>>>> On Thu, Aug 5, 2021 at 9:42 AM Brent <br...@gmail.com> wrote:
>>>>
>>>>> Thank you for the response Jiajun!
>>>>>
>>>>> On the inclusivity thing, I'm glad to hear we're moving to different
>>>>> terminology.  Our code actually wraps the MS state machine and renames the
>>>>> terminology to "Leader" and "Follower" everywhere visible to our users and
>>>>> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
>>>>> a bit different which was why I wasn't using it, but looking at the
>>>>> definition, I guess the only difference is it doesn't seem to define an
>>>>> ERROR state like the MS SMD does.  So for the rest of this thread, let's
>>>>> use the LEADER/STANDBY terminology instead.
>>>>>
>>>>> For context, I have 1000-2000 shards of a database where each shard
>>>>> can be 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
>>>>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>>>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
>>>>> sure how I can prevent the state from transitioning from SYNCING to STANDBY
>>>>> until the node is ready (i.e. has an up-to-date copy of the leader's
>>>>> data).  Based on what you were saying, is it possible to have the Helix
>>>>> controller tell a node it's in SYNCING state, but then have the node decide
>>>>> when it's safe to transition itself to STANDBY?  Or can state transition
>>>>> cancellation be used if the node isn't ready?  Or can I just let the
>>>>> transition timeout if the node isn't ready?
>>>>>
>>>>> This seems like it would be a pretty common problem with large,
>>>>> expensive-to-move data (e.g. a shard of a large database), especially when
>>>>> adding a new node to an existing system and needing to bootstrap it from
>>>>> nothing.  I suspect people do this and I'm just thinking about it the wrong
>>>>> way or there's a Helix strategy that I'm just not grasping correctly.
>>>>>
>>>>> For the LinkedIn folks on the list, what does Espresso do for
>>>>> bootstrapping new nodes and avoiding this problem of them getting promoted
>>>>> to LEADER before they're ready?  It seems like a similar problem to mine
>>>>> (stateful node with large data that needs a leader/standby setup).
>>>>>
>>>>> Thanks again!
>>>>>
>>>>> ~Brent
>>>>>
>>>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <er...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Brent,
>>>>>>
>>>>>> AFAIK, there is no way to tell the controller to suspend a certain
>>>>>> state transition. Even if you reject the transition (although rejection is
>>>>>> not officially supported either), the controller will probably retry in the
>>>>>> next rebalance pipeline repeatedly.
>>>>>>
>>>>>> Alternatively, from your description, I think "Slave" means 2 states
>>>>>> in your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>>>>> possible you define a customized mode that differentiates these 2 states?
>>>>>> Offline -> Syncing -> Slave, etc.
>>>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>>>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>>>>> itself as the Slave. It implies offline -> Slave transition would take a
>>>>>> longer time, but once it is done, the Slave partition would be fully ready.
>>>>>>
>>>>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>>>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>>>>> MasterSlave SMD in the near future.
>>>>>>
>>>>>> Best Regards,
>>>>>> Jiajun
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <br...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I had asked a question a while back about how to deal with a failed
>>>>>>> state transition (
>>>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
>>>>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>>>>> state in the state machine.
>>>>>>>
>>>>>>> I have a slightly different but related question now.  I'm using
>>>>>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>>>>>> partition to become fully in-sync with a Master partition can take a long
>>>>>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>>>>>> finished syncing data from a Master, it should not be eligible for
>>>>>>> promotion to Master.
>>>>>>>
>>>>>>> So let's say a node (maybe newly added to the cluster) is the Slave
>>>>>>> for partition 22 and has been online for 10 minutes (not long enough to
>>>>>>> have sync-ed everything from the existing partition 22 Master) and receives
>>>>>>> a state transition from Helix saying it should go from Slave->Master.  Is
>>>>>>> it possible to temporarily reject that transition without going into ERROR
>>>>>>> state for that partition?  ERROR state seems like slightly the wrong thing
>>>>>>> because while it's not a valid transition right now, it will be a valid
>>>>>>> transition 20 minutes from now when the initial sync completes.
>>>>>>>
>>>>>>> Is there a way to get this functionality to "fail" a transition, but
>>>>>>> not fully go into ERROR state?  Or is there a different way I should be
>>>>>>> thinking about solving this problem?  I was thinking this could potentially
>>>>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>>>>
>>>>>>> Thank you for your time and help as always!
>>>>>>>
>>>>>>> ~Brent
>>>>>>>
>>>>>>

Re: Temporarily preventing a state transition

Posted by Brent <br...@gmail.com>.

Great information.  Thank you again, I really appreciate all the thoughtful
& detailed responses.

On Thu, Aug 5, 2021 at 9:45 PM Wang Jiajun <er...@gmail.com> wrote:

> In short, the controller thinks a state transition is still ongoing if, 1.
> the message exists and status = READ, 2. the instance is alive (with
> liveInstance znode). The participant does not need to do any additional
> signal until it updates the current states, meaning the transition is
> either done or ERROR. The controller will read current states to get the
> information. And yes, the call should not return.
>
> State transition timeout is configurable. By default, regular partition
> state transition does not have a timeout. If anything happens in the
> execution, Helix logic on the participant will catch and throw Exception
> then the partition will end up with an ERROR state.
>
> Regarding throttling,  StateTransitionThrottleConfig would be the
> recommended way. If it does not fit your needs, then an alternative way is
> to use another theadpool with a limited thread size to execute the
> bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit
> the task to this threadpool and wait for the result. This is another way to
> control critical system resource usage. It is more complicated obviously.
> So, please evaluate this method carefully : )
>
> Best Regards,
> Jiajun
>
>
> On Thu, Aug 5, 2021 at 2:03 PM Brent <br...@gmail.com> wrote:
>
>> Ah!  OK, that's interesting. It seems like I *was* thinking about it the
>> wrong way.  Rather than try to stop the transition from STANDBY -> LEADER,
>> I need to make sure I don't let a node become STANDBY until it is ready to
>> be promoted to LEADER if necessary.
>>
>> You said it does not hurt to have some long-running transition tasks
>> (e.g. OFFLINE -> STANDBY in my case).  So when Helix sends the transition
>> message to move from OFFLINE -> STANDBY for a given partition, how do I
>> signal to Helix that I'm still working?  Is it literally just that I don't
>> return from the agent Java callback until the transition is complete?  e.g.:
>>
>> @Transition(to = "STANDBY", from = "OFFLINE")
>> public void onBecomeStandbyFromOffline(Message message,
>> NotificationContext context) {
>>     // this node has been asked to become standby for this partition
>>
>>     DoLongRunningOperationToBootstrapNode(message);  // this could take
>> 30 minutes
>>
>>     // returning now means the node has successfully transitioned to
>> STANDBY
>> }
>>
>> Are there any default state transition timeouts or any other properties I
>> need to worry about updating?
>>
>> To your other point, I think parallel bootstraps may be OK.  I was hoping
>> a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could limit the
>> number of those, but that seems like it applies to ALL transitions on that
>> node, not just a particular type like OFFLINE->STANDBY.  I suspect I'd have
>> to use that carefully to make sure I'm never blocking a STANDBY -> LEADER
>> transition while waiting on OFFLINE -> STANDBY transitions.
>>
>> Thanks for changing my perspective on how I was seeing this problem
>> Jiajun.  Very helpful!
>>
>> ~Brent
>>
>> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <er...@gmail.com>
>> wrote:
>>
>>> In short, the key to this solution is to prevent STANDBY -> LEADER
>>> message before partitions are truly ready. We do not restrict SYNCING ->
>>> STANDBY messages at all. So the controller will send S -> S message and
>>> wait until the transition (bootstrap) is done. After that, it can go ahead
>>> to bring up the LEADER. It does not hurt to have some long-run transition
>>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving
>>> the traffic, we don't want to have a big gap with no LEADER) in the system,
>>> because it is what happens there. But this also means in
>>> parallel bootstraps. I'm not sure if this fits your need.
>>>
>>> Best Regards,
>>> Jiajun
>>>
>>>
>>> On Thu, Aug 5, 2021 at 9:42 AM Brent <br...@gmail.com> wrote:
>>>
>>>> Thank you for the response Jiajun!
>>>>
>>>> On the inclusivity thing, I'm glad to hear we're moving to different
>>>> terminology.  Our code actually wraps the MS state machine and renames the
>>>> terminology to "Leader" and "Follower" everywhere visible to our users and
>>>> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
>>>> a bit different which was why I wasn't using it, but looking at the
>>>> definition, I guess the only difference is it doesn't seem to define an
>>>> ERROR state like the MS SMD does.  So for the rest of this thread, let's
>>>> use the LEADER/STANDBY terminology instead.
>>>>
>>>> For context, I have 1000-2000 shards of a database where each shard can
>>>> be 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
>>>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
>>>> sure how I can prevent the state from transitioning from SYNCING to STANDBY
>>>> until the node is ready (i.e. has an up-to-date copy of the leader's
>>>> data).  Based on what you were saying, is it possible to have the Helix
>>>> controller tell a node it's in SYNCING state, but then have the node decide
>>>> when it's safe to transition itself to STANDBY?  Or can state transition
>>>> cancellation be used if the node isn't ready?  Or can I just let the
>>>> transition timeout if the node isn't ready?
>>>>
>>>> This seems like it would be a pretty common problem with large,
>>>> expensive-to-move data (e.g. a shard of a large database), especially when
>>>> adding a new node to an existing system and needing to bootstrap it from
>>>> nothing.  I suspect people do this and I'm just thinking about it the wrong
>>>> way or there's a Helix strategy that I'm just not grasping correctly.
>>>>
>>>> For the LinkedIn folks on the list, what does Espresso do for
>>>> bootstrapping new nodes and avoiding this problem of them getting promoted
>>>> to LEADER before they're ready?  It seems like a similar problem to mine
>>>> (stateful node with large data that needs a leader/standby setup).
>>>>
>>>> Thanks again!
>>>>
>>>> ~Brent
>>>>
>>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <er...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Brent,
>>>>>
>>>>> AFAIK, there is no way to tell the controller to suspend a certain
>>>>> state transition. Even if you reject the transition (although rejection is
>>>>> not officially supported either), the controller will probably retry in the
>>>>> next rebalance pipeline repeatedly.
>>>>>
>>>>> Alternatively, from your description, I think "Slave" means 2 states
>>>>> in your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>>>> possible you define a customized mode that differentiates these 2 states?
>>>>> Offline -> Syncing -> Slave, etc.
>>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>>>> itself as the Slave. It implies offline -> Slave transition would take a
>>>>> longer time, but once it is done, the Slave partition would be fully ready.
>>>>>
>>>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>>>> MasterSlave SMD in the near future.
>>>>>
>>>>> Best Regards,
>>>>> Jiajun
>>>>>
>>>>>
>>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <br...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I had asked a question a while back about how to deal with a failed
>>>>>> state transition (
>>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
>>>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>>>> state in the state machine.
>>>>>>
>>>>>> I have a slightly different but related question now.  I'm using
>>>>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>>>>> partition to become fully in-sync with a Master partition can take a long
>>>>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>>>>> finished syncing data from a Master, it should not be eligible for
>>>>>> promotion to Master.
>>>>>>
>>>>>> So let's say a node (maybe newly added to the cluster) is the Slave
>>>>>> for partition 22 and has been online for 10 minutes (not long enough to
>>>>>> have sync-ed everything from the existing partition 22 Master) and receives
>>>>>> a state transition from Helix saying it should go from Slave->Master.  Is
>>>>>> it possible to temporarily reject that transition without going into ERROR
>>>>>> state for that partition?  ERROR state seems like slightly the wrong thing
>>>>>> because while it's not a valid transition right now, it will be a valid
>>>>>> transition 20 minutes from now when the initial sync completes.
>>>>>>
>>>>>> Is there a way to get this functionality to "fail" a transition, but
>>>>>> not fully go into ERROR state?  Or is there a different way I should be
>>>>>> thinking about solving this problem?  I was thinking this could potentially
>>>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>>>
>>>>>> Thank you for your time and help as always!
>>>>>>
>>>>>> ~Brent
>>>>>>
>>>>>

Re: Temporarily preventing a state transition

Posted by Wang Jiajun <er...@gmail.com>.

In short, the controller thinks a state transition is still ongoing if, 1.
the message exists and status = READ, 2. the instance is alive (with
liveInstance znode). The participant does not need to do any additional
signal until it updates the current states, meaning the transition is
either done or ERROR. The controller will read current states to get the
information. And yes, the call should not return.

State transition timeout is configurable. By default, regular partition
state transition does not have a timeout. If anything happens in the
execution, Helix logic on the participant will catch and throw Exception
then the partition will end up with an ERROR state.

Regarding throttling,  StateTransitionThrottleConfig would be the
recommended way. If it does not fit your needs, then an alternative way is
to use another theadpool with a limited thread size to execute the
bootstrap task. onBecomeStandbyFromOffline on the other hand, should submit
the task to this threadpool and wait for the result. This is another way to
control critical system resource usage. It is more complicated obviously.
So, please evaluate this method carefully : )

Best Regards,
Jiajun


On Thu, Aug 5, 2021 at 2:03 PM Brent <br...@gmail.com> wrote:

> Ah!  OK, that's interesting. It seems like I *was* thinking about it the
> wrong way.  Rather than try to stop the transition from STANDBY -> LEADER,
> I need to make sure I don't let a node become STANDBY until it is ready to
> be promoted to LEADER if necessary.
>
> You said it does not hurt to have some long-running transition tasks (e.g.
> OFFLINE -> STANDBY in my case).  So when Helix sends the transition message
> to move from OFFLINE -> STANDBY for a given partition, how do I signal to
> Helix that I'm still working?  Is it literally just that I don't return
> from the agent Java callback until the transition is complete?  e.g.:
>
> @Transition(to = "STANDBY", from = "OFFLINE")
> public void onBecomeStandbyFromOffline(Message message,
> NotificationContext context) {
>     // this node has been asked to become standby for this partition
>
>     DoLongRunningOperationToBootstrapNode(message);  // this could take 30
> minutes
>
>     // returning now means the node has successfully transitioned to
> STANDBY
> }
>
> Are there any default state transition timeouts or any other properties I
> need to worry about updating?
>
> To your other point, I think parallel bootstraps may be OK.  I was hoping
> a StateTransitionThrottleConfig with ThrottleScope.INSTANCE could limit the
> number of those, but that seems like it applies to ALL transitions on that
> node, not just a particular type like OFFLINE->STANDBY.  I suspect I'd have
> to use that carefully to make sure I'm never blocking a STANDBY -> LEADER
> transition while waiting on OFFLINE -> STANDBY transitions.
>
> Thanks for changing my perspective on how I was seeing this problem
> Jiajun.  Very helpful!
>
> ~Brent
>
> On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <er...@gmail.com>
> wrote:
>
>> In short, the key to this solution is to prevent STANDBY -> LEADER
>> message before partitions are truly ready. We do not restrict SYNCING ->
>> STANDBY messages at all. So the controller will send S -> S message and
>> wait until the transition (bootstrap) is done. After that, it can go ahead
>> to bring up the LEADER. It does not hurt to have some long-run transition
>> tasks (as long as they are not STANDBY -> LEADER, because LEADER is serving
>> the traffic, we don't want to have a big gap with no LEADER) in the system,
>> because it is what happens there. But this also means in
>> parallel bootstraps. I'm not sure if this fits your need.
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Thu, Aug 5, 2021 at 9:42 AM Brent <br...@gmail.com> wrote:
>>
>>> Thank you for the response Jiajun!
>>>
>>> On the inclusivity thing, I'm glad to hear we're moving to different
>>> terminology.  Our code actually wraps the MS state machine and renames the
>>> terminology to "Leader" and "Follower" everywhere visible to our users and
>>> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
>>> a bit different which was why I wasn't using it, but looking at the
>>> definition, I guess the only difference is it doesn't seem to define an
>>> ERROR state like the MS SMD does.  So for the rest of this thread, let's
>>> use the LEADER/STANDBY terminology instead.
>>>
>>> For context, I have 1000-2000 shards of a database where each shard can
>>> be 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
>>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
>>> sure how I can prevent the state from transitioning from SYNCING to STANDBY
>>> until the node is ready (i.e. has an up-to-date copy of the leader's
>>> data).  Based on what you were saying, is it possible to have the Helix
>>> controller tell a node it's in SYNCING state, but then have the node decide
>>> when it's safe to transition itself to STANDBY?  Or can state transition
>>> cancellation be used if the node isn't ready?  Or can I just let the
>>> transition timeout if the node isn't ready?
>>>
>>> This seems like it would be a pretty common problem with large,
>>> expensive-to-move data (e.g. a shard of a large database), especially when
>>> adding a new node to an existing system and needing to bootstrap it from
>>> nothing.  I suspect people do this and I'm just thinking about it the wrong
>>> way or there's a Helix strategy that I'm just not grasping correctly.
>>>
>>> For the LinkedIn folks on the list, what does Espresso do for
>>> bootstrapping new nodes and avoiding this problem of them getting promoted
>>> to LEADER before they're ready?  It seems like a similar problem to mine
>>> (stateful node with large data that needs a leader/standby setup).
>>>
>>> Thanks again!
>>>
>>> ~Brent
>>>
>>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <er...@gmail.com>
>>> wrote:
>>>
>>>> Hi Brent,
>>>>
>>>> AFAIK, there is no way to tell the controller to suspend a certain
>>>> state transition. Even if you reject the transition (although rejection is
>>>> not officially supported either), the controller will probably retry in the
>>>> next rebalance pipeline repeatedly.
>>>>
>>>> Alternatively, from your description, I think "Slave" means 2 states in
>>>> your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>>> possible you define a customized mode that differentiates these 2 states?
>>>> Offline -> Syncing -> Slave, etc.
>>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>>> itself as the Slave. It implies offline -> Slave transition would take a
>>>> longer time, but once it is done, the Slave partition would be fully ready.
>>>>
>>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>>> MasterSlave SMD in the near future.
>>>>
>>>> Best Regards,
>>>> Jiajun
>>>>
>>>>
>>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <br...@gmail.com> wrote:
>>>>
>>>>> I had asked a question a while back about how to deal with a failed
>>>>> state transition (
>>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
>>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>>> state in the state machine.
>>>>>
>>>>> I have a slightly different but related question now.  I'm using
>>>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>>>> partition to become fully in-sync with a Master partition can take a long
>>>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>>>> finished syncing data from a Master, it should not be eligible for
>>>>> promotion to Master.
>>>>>
>>>>> So let's say a node (maybe newly added to the cluster) is the Slave
>>>>> for partition 22 and has been online for 10 minutes (not long enough to
>>>>> have sync-ed everything from the existing partition 22 Master) and receives
>>>>> a state transition from Helix saying it should go from Slave->Master.  Is
>>>>> it possible to temporarily reject that transition without going into ERROR
>>>>> state for that partition?  ERROR state seems like slightly the wrong thing
>>>>> because while it's not a valid transition right now, it will be a valid
>>>>> transition 20 minutes from now when the initial sync completes.
>>>>>
>>>>> Is there a way to get this functionality to "fail" a transition, but
>>>>> not fully go into ERROR state?  Or is there a different way I should be
>>>>> thinking about solving this problem?  I was thinking this could potentially
>>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>>
>>>>> Thank you for your time and help as always!
>>>>>
>>>>> ~Brent
>>>>>
>>>>

Re: Temporarily preventing a state transition

Posted by Brent <br...@gmail.com>.

Ah!  OK, that's interesting. It seems like I *was* thinking about it the
wrong way.  Rather than try to stop the transition from STANDBY -> LEADER,
I need to make sure I don't let a node become STANDBY until it is ready to
be promoted to LEADER if necessary.

You said it does not hurt to have some long-running transition tasks (e.g.
OFFLINE -> STANDBY in my case).  So when Helix sends the transition message
to move from OFFLINE -> STANDBY for a given partition, how do I signal to
Helix that I'm still working?  Is it literally just that I don't return
from the agent Java callback until the transition is complete?  e.g.:

@Transition(to = "STANDBY", from = "OFFLINE")
public void onBecomeStandbyFromOffline(Message message, NotificationContext
context) {
    // this node has been asked to become standby for this partition

    DoLongRunningOperationToBootstrapNode(message);  // this could take 30
minutes

    // returning now means the node has successfully transitioned to STANDBY
}

Are there any default state transition timeouts or any other properties I
need to worry about updating?

To your other point, I think parallel bootstraps may be OK.  I was hoping a
StateTransitionThrottleConfig with ThrottleScope.INSTANCE could limit the
number of those, but that seems like it applies to ALL transitions on that
node, not just a particular type like OFFLINE->STANDBY.  I suspect I'd have
to use that carefully to make sure I'm never blocking a STANDBY -> LEADER
transition while waiting on OFFLINE -> STANDBY transitions.

Thanks for changing my perspective on how I was seeing this problem
Jiajun.  Very helpful!

~Brent

On Thu, Aug 5, 2021 at 10:31 AM Wang Jiajun <er...@gmail.com> wrote:

> In short, the key to this solution is to prevent STANDBY -> LEADER message
> before partitions are truly ready. We do not restrict SYNCING -> STANDBY
> messages at all. So the controller will send S -> S message and wait until
> the transition (bootstrap) is done. After that, it can go ahead to bring up
> the LEADER. It does not hurt to have some long-run transition tasks (as
> long as they are not STANDBY -> LEADER, because LEADER is serving the
> traffic, we don't want to have a big gap with no LEADER) in the system,
> because it is what happens there. But this also means in
> parallel bootstraps. I'm not sure if this fits your need.
>
> Best Regards,
> Jiajun
>
>
> On Thu, Aug 5, 2021 at 9:42 AM Brent <br...@gmail.com> wrote:
>
>> Thank you for the response Jiajun!
>>
>> On the inclusivity thing, I'm glad to hear we're moving to different
>> terminology.  Our code actually wraps the MS state machine and renames the
>> terminology to "Leader" and "Follower" everywhere visible to our users and
>> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
>> a bit different which was why I wasn't using it, but looking at the
>> definition, I guess the only difference is it doesn't seem to define an
>> ERROR state like the MS SMD does.  So for the rest of this thread, let's
>> use the LEADER/STANDBY terminology instead.
>>
>> For context, I have 1000-2000 shards of a database where each shard can
>> be 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
>> splitting up the STANDBY state into two states like SYNCING and STANDBY
>> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
>> sure how I can prevent the state from transitioning from SYNCING to STANDBY
>> until the node is ready (i.e. has an up-to-date copy of the leader's
>> data).  Based on what you were saying, is it possible to have the Helix
>> controller tell a node it's in SYNCING state, but then have the node decide
>> when it's safe to transition itself to STANDBY?  Or can state transition
>> cancellation be used if the node isn't ready?  Or can I just let the
>> transition timeout if the node isn't ready?
>>
>> This seems like it would be a pretty common problem with large,
>> expensive-to-move data (e.g. a shard of a large database), especially when
>> adding a new node to an existing system and needing to bootstrap it from
>> nothing.  I suspect people do this and I'm just thinking about it the wrong
>> way or there's a Helix strategy that I'm just not grasping correctly.
>>
>> For the LinkedIn folks on the list, what does Espresso do for
>> bootstrapping new nodes and avoiding this problem of them getting promoted
>> to LEADER before they're ready?  It seems like a similar problem to mine
>> (stateful node with large data that needs a leader/standby setup).
>>
>> Thanks again!
>>
>> ~Brent
>>
>> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <er...@gmail.com>
>> wrote:
>>
>>> Hi Brent,
>>>
>>> AFAIK, there is no way to tell the controller to suspend a certain state
>>> transition. Even if you reject the transition (although rejection is not
>>> officially supported either), the controller will probably retry in the
>>> next rebalance pipeline repeatedly.
>>>
>>> Alternatively, from your description, I think "Slave" means 2 states in
>>> your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>>> possible you define a customized mode that differentiates these 2 states?
>>> Offline -> Syncing -> Slave, etc.
>>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>>> case? Meaning before a partition syncs with the Master, it shall not mark
>>> itself as the Slave. It implies offline -> Slave transition would take a
>>> longer time, but once it is done, the Slave partition would be fully ready.
>>>
>>> BTW, we encourage the users to use inclusive language. Maybe you can
>>> consider changing to use the LeaderStandby SMD? We might deprecate
>>> MasterSlave SMD in the near future.
>>>
>>> Best Regards,
>>> Jiajun
>>>
>>>
>>> On Wed, Aug 4, 2021 at 3:41 PM Brent <br...@gmail.com> wrote:
>>>
>>>> I had asked a question a while back about how to deal with a failed
>>>> state transition (
>>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
>>>> and the correct answer there was to throw an exception to cause an ERROR
>>>> state in the state machine.
>>>>
>>>> I have a slightly different but related question now.  I'm using
>>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>>> partition to become fully in-sync with a Master partition can take a long
>>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>>> finished syncing data from a Master, it should not be eligible for
>>>> promotion to Master.
>>>>
>>>> So let's say a node (maybe newly added to the cluster) is the Slave for
>>>> partition 22 and has been online for 10 minutes (not long enough to have
>>>> sync-ed everything from the existing partition 22 Master) and receives a
>>>> state transition from Helix saying it should go from Slave->Master.  Is it
>>>> possible to temporarily reject that transition without going into ERROR
>>>> state for that partition?  ERROR state seems like slightly the wrong thing
>>>> because while it's not a valid transition right now, it will be a valid
>>>> transition 20 minutes from now when the initial sync completes.
>>>>
>>>> Is there a way to get this functionality to "fail" a transition, but
>>>> not fully go into ERROR state?  Or is there a different way I should be
>>>> thinking about solving this problem?  I was thinking this could potentially
>>>> be a frequent occurrence when new nodes are added to the cluster.
>>>>
>>>> Thank you for your time and help as always!
>>>>
>>>> ~Brent
>>>>
>>>

Re: Temporarily preventing a state transition

Posted by Wang Jiajun <er...@gmail.com>.

In short, the key to this solution is to prevent STANDBY -> LEADER message
before partitions are truly ready. We do not restrict SYNCING -> STANDBY
messages at all. So the controller will send S -> S message and wait until
the transition (bootstrap) is done. After that, it can go ahead to bring up
the LEADER. It does not hurt to have some long-run transition tasks (as
long as they are not STANDBY -> LEADER, because LEADER is serving the
traffic, we don't want to have a big gap with no LEADER) in the system,
because it is what happens there. But this also means in
parallel bootstraps. I'm not sure if this fits your need.

Best Regards,
Jiajun


On Thu, Aug 5, 2021 at 9:42 AM Brent <br...@gmail.com> wrote:

> Thank you for the response Jiajun!
>
> On the inclusivity thing, I'm glad to hear we're moving to different
> terminology.  Our code actually wraps the MS state machine and renames the
> terminology to "Leader" and "Follower" everywhere visible to our users and
> operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
> a bit different which was why I wasn't using it, but looking at the
> definition, I guess the only difference is it doesn't seem to define an
> ERROR state like the MS SMD does.  So for the rest of this thread, let's
> use the LEADER/STANDBY terminology instead.
>
> For context, I have 1000-2000 shards of a database where each shard can be
> 100GB+ in size so bootstrapping nodes is expensive.  Your logic on
> splitting up the STANDBY state into two states like SYNCING and STANDBY
> makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
> sure how I can prevent the state from transitioning from SYNCING to STANDBY
> until the node is ready (i.e. has an up-to-date copy of the leader's
> data).  Based on what you were saying, is it possible to have the Helix
> controller tell a node it's in SYNCING state, but then have the node decide
> when it's safe to transition itself to STANDBY?  Or can state transition
> cancellation be used if the node isn't ready?  Or can I just let the
> transition timeout if the node isn't ready?
>
> This seems like it would be a pretty common problem with large,
> expensive-to-move data (e.g. a shard of a large database), especially when
> adding a new node to an existing system and needing to bootstrap it from
> nothing.  I suspect people do this and I'm just thinking about it the wrong
> way or there's a Helix strategy that I'm just not grasping correctly.
>
> For the LinkedIn folks on the list, what does Espresso do for
> bootstrapping new nodes and avoiding this problem of them getting promoted
> to LEADER before they're ready?  It seems like a similar problem to mine
> (stateful node with large data that needs a leader/standby setup).
>
> Thanks again!
>
> ~Brent
>
> On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <er...@gmail.com> wrote:
>
>> Hi Brent,
>>
>> AFAIK, there is no way to tell the controller to suspend a certain state
>> transition. Even if you reject the transition (although rejection is not
>> officially supported either), the controller will probably retry in the
>> next rebalance pipeline repeatedly.
>>
>> Alternatively, from your description, I think "Slave" means 2 states in
>> your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
>> possible you define a customized mode that differentiates these 2 states?
>> Offline -> Syncing -> Slave, etc.
>> Even simpler, is it OK to restrict the definition of Slave to the 2nd
>> case? Meaning before a partition syncs with the Master, it shall not mark
>> itself as the Slave. It implies offline -> Slave transition would take a
>> longer time, but once it is done, the Slave partition would be fully ready.
>>
>> BTW, we encourage the users to use inclusive language. Maybe you can
>> consider changing to use the LeaderStandby SMD? We might deprecate
>> MasterSlave SMD in the near future.
>>
>> Best Regards,
>> Jiajun
>>
>>
>> On Wed, Aug 4, 2021 at 3:41 PM Brent <br...@gmail.com> wrote:
>>
>>> I had asked a question a while back about how to deal with a failed
>>> state transition (
>>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
>>> and the correct answer there was to throw an exception to cause an ERROR
>>> state in the state machine.
>>>
>>> I have a slightly different but related question now.  I'm using
>>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>>> partition to become fully in-sync with a Master partition can take a long
>>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>>> finished syncing data from a Master, it should not be eligible for
>>> promotion to Master.
>>>
>>> So let's say a node (maybe newly added to the cluster) is the Slave for
>>> partition 22 and has been online for 10 minutes (not long enough to have
>>> sync-ed everything from the existing partition 22 Master) and receives a
>>> state transition from Helix saying it should go from Slave->Master.  Is it
>>> possible to temporarily reject that transition without going into ERROR
>>> state for that partition?  ERROR state seems like slightly the wrong thing
>>> because while it's not a valid transition right now, it will be a valid
>>> transition 20 minutes from now when the initial sync completes.
>>>
>>> Is there a way to get this functionality to "fail" a transition, but not
>>> fully go into ERROR state?  Or is there a different way I should be
>>> thinking about solving this problem?  I was thinking this could potentially
>>> be a frequent occurrence when new nodes are added to the cluster.
>>>
>>> Thank you for your time and help as always!
>>>
>>> ~Brent
>>>
>>

Re: Temporarily preventing a state transition

Posted by Brent <br...@gmail.com>.

Thank you for the response Jiajun!

On the inclusivity thing, I'm glad to hear we're moving to different
terminology.  Our code actually wraps the MS state machine and renames the
terminology to "Leader" and "Follower" everywhere visible to our users and
operators for similar reasons.  :-)   I thought the Leader/Standby SMD was
a bit different which was why I wasn't using it, but looking at the
definition, I guess the only difference is it doesn't seem to define an
ERROR state like the MS SMD does.  So for the rest of this thread, let's
use the LEADER/STANDBY terminology instead.

For context, I have 1000-2000 shards of a database where each shard can be
100GB+ in size so bootstrapping nodes is expensive.  Your logic on
splitting up the STANDBY state into two states like SYNCING and STANDBY
makes sense (OFFLINE -> SYNCING -> STANDBY -> LEADER), though I'm still not
sure how I can prevent the state from transitioning from SYNCING to STANDBY
until the node is ready (i.e. has an up-to-date copy of the leader's
data).  Based on what you were saying, is it possible to have the Helix
controller tell a node it's in SYNCING state, but then have the node decide
when it's safe to transition itself to STANDBY?  Or can state transition
cancellation be used if the node isn't ready?  Or can I just let the
transition timeout if the node isn't ready?

This seems like it would be a pretty common problem with large,
expensive-to-move data (e.g. a shard of a large database), especially when
adding a new node to an existing system and needing to bootstrap it from
nothing.  I suspect people do this and I'm just thinking about it the wrong
way or there's a Helix strategy that I'm just not grasping correctly.

For the LinkedIn folks on the list, what does Espresso do for bootstrapping
new nodes and avoiding this problem of them getting promoted to LEADER
before they're ready?  It seems like a similar problem to mine (stateful
node with large data that needs a leader/standby setup).

Thanks again!

~Brent

On Wed, Aug 4, 2021 at 6:32 PM Wang Jiajun <er...@gmail.com> wrote:

> Hi Brent,
>
> AFAIK, there is no way to tell the controller to suspend a certain state
> transition. Even if you reject the transition (although rejection is not
> officially supported either), the controller will probably retry in the
> next rebalance pipeline repeatedly.
>
> Alternatively, from your description, I think "Slave" means 2 states in
> your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
> possible you define a customized mode that differentiates these 2 states?
> Offline -> Syncing -> Slave, etc.
> Even simpler, is it OK to restrict the definition of Slave to the 2nd
> case? Meaning before a partition syncs with the Master, it shall not mark
> itself as the Slave. It implies offline -> Slave transition would take a
> longer time, but once it is done, the Slave partition would be fully ready.
>
> BTW, we encourage the users to use inclusive language. Maybe you can
> consider changing to use the LeaderStandby SMD? We might deprecate
> MasterSlave SMD in the near future.
>
> Best Regards,
> Jiajun
>
>
> On Wed, Aug 4, 2021 at 3:41 PM Brent <br...@gmail.com> wrote:
>
>> I had asked a question a while back about how to deal with a failed state
>> transition (
>> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
>> and the correct answer there was to throw an exception to cause an ERROR
>> state in the state machine.
>>
>> I have a slightly different but related question now.  I'm using
>> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
>> partition to become fully in-sync with a Master partition can take a long
>> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
>> finished syncing data from a Master, it should not be eligible for
>> promotion to Master.
>>
>> So let's say a node (maybe newly added to the cluster) is the Slave for
>> partition 22 and has been online for 10 minutes (not long enough to have
>> sync-ed everything from the existing partition 22 Master) and receives a
>> state transition from Helix saying it should go from Slave->Master.  Is it
>> possible to temporarily reject that transition without going into ERROR
>> state for that partition?  ERROR state seems like slightly the wrong thing
>> because while it's not a valid transition right now, it will be a valid
>> transition 20 minutes from now when the initial sync completes.
>>
>> Is there a way to get this functionality to "fail" a transition, but not
>> fully go into ERROR state?  Or is there a different way I should be
>> thinking about solving this problem?  I was thinking this could potentially
>> be a frequent occurrence when new nodes are added to the cluster.
>>
>> Thank you for your time and help as always!
>>
>> ~Brent
>>
>

Re: Temporarily preventing a state transition

Posted by Wang Jiajun <er...@gmail.com>.

Hi Brent,

AFAIK, there is no way to tell the controller to suspend a certain state
transition. Even if you reject the transition (although rejection is not
officially supported either), the controller will probably retry in the
next rebalance pipeline repeatedly.

Alternatively, from your description, I think "Slave" means 2 states in
your system. 1. new Slave that is out of sync. 2. sync-ed Slave. It is
possible you define a customized mode that differentiates these 2 states?
Offline -> Syncing -> Slave, etc.
Even simpler, is it OK to restrict the definition of Slave to the 2nd case?
Meaning before a partition syncs with the Master, it shall not mark itself
as the Slave. It implies offline -> Slave transition would take a longer
time, but once it is done, the Slave partition would be fully ready.

BTW, we encourage the users to use inclusive language. Maybe you can
consider changing to use the LeaderStandby SMD? We might deprecate
MasterSlave SMD in the near future.

Best Regards,
Jiajun

On Wed, Aug 4, 2021 at 3:41 PM Brent <br...@gmail.com> wrote:

> I had asked a question a while back about how to deal with a failed state
> transition (
> http://mail-archives.apache.org/mod_mbox/helix-user/202009.mbox/%3CCA64308A-E7DB-408A-B7C8-81583B5A9851@gmail.com%3E)
> and the correct answer there was to throw an exception to cause an ERROR
> state in the state machine.
>
> I have a slightly different but related question now.  I'm using
> the org.apache.helix.model.MasterSlaveSMD.  In our system, for a Slave
> partition to become fully in-sync with a Master partition can take a long
> time (maybe 30 minutes).  Under normal circumstances, until a Slave has
> finished syncing data from a Master, it should not be eligible for
> promotion to Master.
>
> So let's say a node (maybe newly added to the cluster) is the Slave for
> partition 22 and has been online for 10 minutes (not long enough to have
> sync-ed everything from the existing partition 22 Master) and receives a
> state transition from Helix saying it should go from Slave->Master.  Is it
> possible to temporarily reject that transition without going into ERROR
> state for that partition?  ERROR state seems like slightly the wrong thing
> because while it's not a valid transition right now, it will be a valid
> transition 20 minutes from now when the initial sync completes.
>
> Is there a way to get this functionality to "fail" a transition, but not
> fully go into ERROR state?  Or is there a different way I should be
> thinking about solving this problem?  I was thinking this could potentially
> be a frequent occurrence when new nodes are added to the cluster.
>
> Thank you for your time and help as always!
>
> ~Brent
>