You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by Vinayak Borkar <vb...@yahoo.com> on 2013/02/27 19:16:05 UTC

State transitions of partitions

Hi Guys,


I am trying to understand how state transitions work in Helix. My 
understanding is that once a controller decides to perform a state 
transition, this information is conveyed to the relevant participants. A 
method corresponding to the transitions is invoked on the state model 
object at the participant corresponding to the partition whose state 
needs to change.

When state transitions involve data movement, performing the actual 
transition at the participant is not an instantaneous activity. So while 
the invocation of the transition function triggers the actual data 
movement, is there a way for the participant to indicate that the 
transition is complete? What is the state of the system while a 
transition is being effected on the cluster?

I am guessing Helix needs to model the state of an in-progress 
transition to correctly provide throttling guarantees on the cluster.

Where should I be looking in the code to understand how this works?


Thanks,
Vinayak

Re: State transitions of partitions

Posted by Vinayak Borkar <vb...@yahoo.com>.

Hi Kishore,


I think these discussions give me a great starting point to begin 
integration with Helix. Thanks for taking the time to answer all my 
questions.

Vinayak


On 2/28/13 11:40 PM, kishore g wrote:
> Hi Vinayak,
>
> This is how you can make the operation idempotent, irrespective of 1)  2)
> or 3) you always use clustermessagingservice first to check if any node
> already has the data and you can pull from them. If you dont get response
> from other nodes in the system,  then you fall back to 2) and use the data
> you already have else you assume you are creating the resource for the
> first time. Search system at LinkedIn used to do this earlier but now they
> kind of dont have this requirement since they assume the indexes are
> available when the node starts. I might be wrong here, but anyhow you get
> the idea.
>
> Messaging infrastructure is generic enough but not intended to be a RPC
> mechanism between nodes. They communicate via zookeeper, so cant really be
> used to achieve high throughput/latency. Again the reason here is allows
> messages to be persistent across node restarts to ensure every node
> processed the message. In case where you dont need such persistence its
> possible to extend the messaging service to do RPC between nodes.
>
> Shirshanka is developing a helix container module based on netty that might
> allow one to extend the messaging service to do RPC within the cluster and
> will be useful to transfer files between nodes efficiently without getting
> the data into application memory by using sendfile api. Its actually a good
> utility and I can see it being used in multiple systems.
>
> Thanks,
> Kishore G
>
>
> On Thu, Feb 28, 2013 at 9:44 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>
>>
>>> Will this address your problem, we dont have distinct actions based on
>>> ERROR codes that controller will understand and take different actions.
>>> Were you looking for something like that ?
>>>
>>
>> I will need to think more about this. I think the retry mechnism might be
>> good enough for now.
>>
>>
>>
>>> Good point on not differentiating if the partition once existed v/s newly
>>> created.  We actually plan to modify the drop notification
>>> behavior. Jason/Terence are discussing about this in another thread.
>>> Please
>>> add your suggestion to that thread. We should probably have a create and
>>> drop method(not transition) on the participants.
>>>
>>
>> Currently, how do other systems that use Helix handle the bootstrapping
>> process? When a resource is created for the first time, the actions of a
>> participant are different as compared to other times when a resource
>> partition is expanded to use another instance. Specifically, there are
>> three cases that need to be handled with respect to bootstrapping:
>>
>> 1. A cluster is up and running, and a new resource is created and
>> rebalanced.
>> 2. A cluster that had resources is being started after being shutdown
>> 3. A cluster is running and a resource is already laid out on the cluster.
>> Then some partitions are moved to instances that previously did not have
>> any partitions of that resource.
>>
>> I looked through the examples and found the ClusterMessagingService
>> interface that can be used to send messages to instances in the cluster. I
>> can see 3 can be handled by using the messaging infrastructure. However,
>> both 1 and 2 will have the resource partitions start in the OFFLINE mode.
>> The messaging API cannot help because all instances in the cluster are in
>> the same boat for a particular resource in case 1 and case 2. So what is
>> the preferred way to know if you are in case 1 or in case 2? One way I see
>> is that if you have local artifacts matching the partitions that are
>> transiting from OFFLINE -> SLAVE mode, one could infer it is case 2. Is
>> that how other systems solve this issue?
>>
>>
>> On a separate note, is the messaging infrastructure general purpose? As in
>> can that be used by applications to perform RPC in the cluster obviating
>> the need for a separate RPC mechanism like Avro? I can see that the handler
>> will need more code than one would need to write when using Avro to get RPC
>> working, but my question is about the design point of the messaging
>> infrastructure.
>>
>>
>>
>> Thanks,
>> Vinayak
>>
>

Re: State transitions of partitions

Posted by kishore g <g....@gmail.com>.

Hi Vinayak,

This is how you can make the operation idempotent, irrespective of 1)  2)
or 3) you always use clustermessagingservice first to check if any node
already has the data and you can pull from them. If you dont get response
from other nodes in the system,  then you fall back to 2) and use the data
you already have else you assume you are creating the resource for the
first time. Search system at LinkedIn used to do this earlier but now they
kind of dont have this requirement since they assume the indexes are
available when the node starts. I might be wrong here, but anyhow you get
the idea.

Messaging infrastructure is generic enough but not intended to be a RPC
mechanism between nodes. They communicate via zookeeper, so cant really be
used to achieve high throughput/latency. Again the reason here is allows
messages to be persistent across node restarts to ensure every node
processed the message. In case where you dont need such persistence its
possible to extend the messaging service to do RPC between nodes.

Shirshanka is developing a helix container module based on netty that might
allow one to extend the messaging service to do RPC within the cluster and
will be useful to transfer files between nodes efficiently without getting
the data into application memory by using sendfile api. Its actually a good
utility and I can see it being used in multiple systems.

Thanks,
Kishore G

On Thu, Feb 28, 2013 at 9:44 AM, Vinayak Borkar <vb...@yahoo.com> wrote:

>
>> Will this address your problem, we dont have distinct actions based on
>> ERROR codes that controller will understand and take different actions.
>> Were you looking for something like that ?
>>
>
> I will need to think more about this. I think the retry mechnism might be
> good enough for now.
>
>
>
>> Good point on not differentiating if the partition once existed v/s newly
>> created.  We actually plan to modify the drop notification
>> behavior. Jason/Terence are discussing about this in another thread.
>> Please
>> add your suggestion to that thread. We should probably have a create and
>> drop method(not transition) on the participants.
>>
>
> Currently, how do other systems that use Helix handle the bootstrapping
> process? When a resource is created for the first time, the actions of a
> participant are different as compared to other times when a resource
> partition is expanded to use another instance. Specifically, there are
> three cases that need to be handled with respect to bootstrapping:
>
> 1. A cluster is up and running, and a new resource is created and
> rebalanced.
> 2. A cluster that had resources is being started after being shutdown
> 3. A cluster is running and a resource is already laid out on the cluster.
> Then some partitions are moved to instances that previously did not have
> any partitions of that resource.
>
> I looked through the examples and found the ClusterMessagingService
> interface that can be used to send messages to instances in the cluster. I
> can see 3 can be handled by using the messaging infrastructure. However,
> both 1 and 2 will have the resource partitions start in the OFFLINE mode.
> The messaging API cannot help because all instances in the cluster are in
> the same boat for a particular resource in case 1 and case 2. So what is
> the preferred way to know if you are in case 1 or in case 2? One way I see
> is that if you have local artifacts matching the partitions that are
> transiting from OFFLINE -> SLAVE mode, one could infer it is case 2. Is
> that how other systems solve this issue?
>
>
> On a separate note, is the messaging infrastructure general purpose? As in
> can that be used by applications to perform RPC in the cluster obviating
> the need for a separate RPC mechanism like Avro? I can see that the handler
> will need more code than one would need to write when using Avro to get RPC
> working, but my question is about the design point of the messaging
> infrastructure.
>
>
>
> Thanks,
> Vinayak
>

Re: State transitions of partitions

Posted by Vinayak Borkar <vb...@yahoo.com>.

>
> Will this address your problem, we dont have distinct actions based on
> ERROR codes that controller will understand and take different actions.
> Were you looking for something like that ?

I will need to think more about this. I think the retry mechnism might 
be good enough for now.

>
> Good point on not differentiating if the partition once existed v/s newly
> created.  We actually plan to modify the drop notification
> behavior. Jason/Terence are discussing about this in another thread. Please
> add your suggestion to that thread. We should probably have a create and
> drop method(not transition) on the participants.

Currently, how do other systems that use Helix handle the bootstrapping 
process? When a resource is created for the first time, the actions of a 
participant are different as compared to other times when a resource 
partition is expanded to use another instance. Specifically, there are 
three cases that need to be handled with respect to bootstrapping:

1. A cluster is up and running, and a new resource is created and 
rebalanced.
2. A cluster that had resources is being started after being shutdown
3. A cluster is running and a resource is already laid out on the 
cluster. Then some partitions are moved to instances that previously did 
not have any partitions of that resource.

I looked through the examples and found the ClusterMessagingService 
interface that can be used to send messages to instances in the cluster. 
I can see 3 can be handled by using the messaging infrastructure. 
However, both 1 and 2 will have the resource partitions start in the 
OFFLINE mode. The messaging API cannot help because all instances in the 
cluster are in the same boat for a particular resource in case 1 and 
case 2. So what is the preferred way to know if you are in case 1 or in 
case 2? One way I see is that if you have local artifacts matching the 
partitions that are transiting from OFFLINE -> SLAVE mode, one could 
infer it is case 2. Is that how other systems solve this issue?


On a separate note, is the messaging infrastructure general purpose? As 
in can that be used by applications to perform RPC in the cluster 
obviating the need for a separate RPC mechanism like Avro? I can see 
that the handler will need more code than one would need to write when 
using Avro to get RPC working, but my question is about the design point 
of the messaging infrastructure.



Thanks,
Vinayak

Re: State transitions of partitions

Posted by kishore g <g....@gmail.com>.

Its my pleasure and glad you are liking the design choices. Feel free to
suggest/contribute changes.

Good question on propagating the exception to controller. There are two
ways, if the transition throws an exception we put that partition in ERROR
state and depending on the mode of execution controller will select another
replica to satisfy the constraints. This partition will permanently be in
ERROR state until admin manually invokes a RESET partition api which will
put it back into OFFLINE or initial state.
Before getting to the next option, please note that you can actually have
time outs and multiple retires for a transition (analogous to task attempts
in hadoop), you can configure the number of retries before a transition is
considered as failed. In general, this might be a transient error either
due to configuration or some setup issue and in most cases all partitions
go into error state. This also is reset on node restart, that is controller
wont remember that you failed the previous transition.

In a more permanent failure, like disk crash or other catastrophic
scenario, the participant can request controller to disable itself either
for that partition or the entire node. This mean the node has permanent
issue and its disable across restarts unlike ERROR which gets reset when
you restart the node. The advantage of disable is you get the vm to be up
so that you can debug the issues without impacting the cluster stability.

Will this address your problem, we dont have distinct actions based on
ERROR codes that controller will understand and take different actions.
Were you looking for something like that ?

Good point on not differentiating if the partition once existed v/s newly
created.  We actually plan to modify the drop notification
behavior. Jason/Terence are discussing about this in another thread. Please
add your suggestion to that thread. We should probably have a create and
drop method(not transition) on the participants.

Thanks,
Kishore G

On Wed, Feb 27, 2013 at 4:42 PM, Vinayak Borkar <vb...@yahoo.com> wrote:

> Kishore,
>
> As always, thanks for the prompt response.
>
>
>
> On 2/27/13 10:53 AM, kishore g wrote:
>
>> Hi Vinayak,
>>
>> By default, a transition is not time bound(it can be a short one or really
>> long), you can do the data movement as part of the transition and return
>> from the transition after its complete.
>>
>
> It looks like I can run long-running transitions in this call itself
> unlike traditional event based systems where the event callback needs to
> hand over long running tasks to a separate worker. Furthermore, it looks
> like if the transition function throws an Exception, the server treats it
> as a transition failure so the controller can react to that. This is
> perfect! One question - what is the best way to propagate the exception to
> the controller so that the Controller can take different actions based on
> different kinds of problems (transient issues vs. more permanent errors,
> for e.g.).
>
>
>
>> Lets say you have some transition called STARTED-BOOTSTRAPPED where you
>> copy the data. And say you have 100 nodes in the cluster that needs to go
>> from STARTED-BOOTSTRAPED. You can simply set the end state of system like
>> N1:BOOTSTRAPPED
>> N2:BOOTSTRAPPED
>> ...
>> ....
>> N100:BOOTSTRAPED
>>
>>
> Talking about STARTED/BOOTSTRAPPED, I noticed that the builtin MasterSlave
> StateModel has a transition from OFFLINE to DROPPED, but there is no
> corresponding transition for bringing on a fresh dataset partition.
>
> For example, imagine the case where a new dataset is created and installed
> as a resource. Every participant who will store a partition of that dataset
> will see an OFFLINE->SLAVE transition and then possibly a SLAVE->MASTER
> transition for the masters. However, what is the best way to differentiate
> between the initial on-boarding of a dataset and the case where a dataset
> partition is being moved to a participant. In both cases the participant
> sees the same transition I think.
>
>
> Thanks,
> Vinayak
>

Re: State transitions of partitions

Posted by Vinayak Borkar <vb...@yahoo.com>.

Kishore,

As always, thanks for the prompt response.

On 2/27/13 10:53 AM, kishore g wrote:
> Hi Vinayak,
>
> By default, a transition is not time bound(it can be a short one or really
> long), you can do the data movement as part of the transition and return
> from the transition after its complete.

It looks like I can run long-running transitions in this call itself 
unlike traditional event based systems where the event callback needs to 
hand over long running tasks to a separate worker. Furthermore, it looks 
like if the transition function throws an Exception, the server treats 
it as a transition failure so the controller can react to that. This is 
perfect! One question - what is the best way to propagate the exception 
to the controller so that the Controller can take different actions 
based on different kinds of problems (transient issues vs. more 
permanent errors, for e.g.).

>
> Lets say you have some transition called STARTED-BOOTSTRAPPED where you
> copy the data. And say you have 100 nodes in the cluster that needs to go
> from STARTED-BOOTSTRAPED. You can simply set the end state of system like
> N1:BOOTSTRAPPED
> N2:BOOTSTRAPPED
> ...
> ....
> N100:BOOTSTRAPED
>

Talking about STARTED/BOOTSTRAPPED, I noticed that the builtin 
MasterSlave StateModel has a transition from OFFLINE to DROPPED, but 
there is no corresponding transition for bringing on a fresh dataset 
partition.

For example, imagine the case where a new dataset is created and 
installed as a resource. Every participant who will store a partition of 
that dataset will see an OFFLINE->SLAVE transition and then possibly a 
SLAVE->MASTER transition for the masters. However, what is the best way 
to differentiate between the initial on-boarding of a dataset and the 
case where a dataset partition is being moved to a participant. In both 
cases the participant sees the same transition I think.

Thanks,
Vinayak

Re: State transitions of partitions

Posted by kishore g <g....@gmail.com>.

Hi Vinayak,

By default, a transition is not time bound(it can be a short one or really
long), you can do the data movement as part of the transition and return
from the transition after its complete.

Helix is aware of the pending transition and its maintained in ZK and is
modelled as a first class citizen of the state machine. It allows you to
specify throttling requirements on each transition type. The constraint can
be set per partition, resource, node or cluster.

Lets say you have some transition called STARTED-BOOTSTRAPPED where you
copy the data. And say you have 100 nodes in the cluster that needs to go
from STARTED-BOOTSTRAPED. You can simply set the end state of system like
N1:BOOTSTRAPPED
N2:BOOTSTRAPPED
...
....
N100:BOOTSTRAPED

and set a constraint that only 10 nodes should do this transition at once.

Helix will then ensure that only 10 nodes in the cluster are doing this
transition at a time. Note this is not like 10 batches of 10 at a time,
instead 10 at all times which means as soon as one transition is done it
will add another transition.

See TestConstraint on how to use these features, we probably have java
api's but no CLI ( Jason please correct me if I am wrong).

To see how its implemented see MessageThrottleStage

thanks,
Kishore G

On Wed, Feb 27, 2013 at 10:16 AM, Vinayak Borkar <vb...@yahoo.com> wrote:

> Hi Guys,
>
>
> I am trying to understand how state transitions work in Helix. My
> understanding is that once a controller decides to perform a state
> transition, this information is conveyed to the relevant participants. A
> method corresponding to the transitions is invoked on the state model
> object at the participant corresponding to the partition whose state needs
> to change.
>
> When state transitions involve data movement, performing the actual
> transition at the participant is not an instantaneous activity. So while
> the invocation of the transition function triggers the actual data
> movement, is there a way for the participant to indicate that the
> transition is complete? What is the state of the system while a transition
> is being effected on the cluster?
>
> I am guessing Helix needs to model the state of an in-progress transition
> to correctly provide throttling guarantees on the cluster.
>
> Where should I be looking in the code to understand how this works?
>
>
> Thanks,
> Vinayak
>