You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@helix.apache.org by William Morgan <wi...@morgan-fam.com> on 2022/08/29 18:45:00 UTC

Question regarding a partition that goes into a failure state

Hey folks,

I was wondering what guidance there would be on how to handle the following scenario:

  1.  We have a distributed DB with N number of shards with the partitioning, failover, etc. handled via Helix using the Master-Slave model with WagedRebalancer and using Full_Auto
  2.  Let's say Shard 1 gets assigned to Host 1 and we successfully transition to MASTER state.
  3.  It continues to be alive and happy for a period of time, but then a failure occurs which doesn't take the host offline but prevents the host from fully functioning. (A good example of this is corruption of the shard because of Disk Failure where parts of the SSD have been worn out)
  4.  We're able to see that we're unable to write to disk and want to rebalance that shard to elsewhere.

What would be the recommended way of doing step 4 using Helix?

I'm unsure what's the best way because we would have to communicate to Helix that the shard has transitioned to an ERROR state on that host so it can be rebalanced elsewhere. Up to this point we've only reacted to state transitions sent to Helix, so I would be curious how feedback like this would be given to Controller so it could rebalance correctly.

Thanks,
Will

Re: Question regarding a partition that goes into a failure state

Posted by William Morgan <wi...@morgan-fam.com>.

Ok that makes sense, I'll make use of the disable API as it looks to do what I require.

Thanks!

Will
________________________________
From: Junkai Xue <jx...@apache.org>
Sent: Monday, August 29, 2022 4:09 PM
To: user@helix.apache.org <us...@helix.apache.org>
Subject: Re: Question regarding a partition that goes into a failure state

The disable API only helps to move the replica state from Master to another host. And then that replica will be marked as OFFLINE state in the original host.

If you prefer to totally move out the replica, I would suggest you use a hybrid model in the assignment. It means in the RESOURCE folder, you can create a resource config. For that specific partition, you can define the preference list like SEMI AUTO did. So it will be a hybrid model where other partitions will leverage FULL_AUTO waged algorithms but that partition will depend on the input of the preference list for assignment. Otherwise, Helix does not provide any API that allows you to "move" the replica to other hosts in FULL_AUTO.

Best,

Junkai

On Mon, Aug 29, 2022 at 12:40 PM William Morgan <wi...@morgan-fam.com>> wrote:
The idea is that we have N hosts that have M number of shards assigned to each host by Helix. There can be a situation where the host is still overall healthy, but a shard isn't on that host.

So, if I'm understanding correctly, the way to communicate to helix that the shard should be moved off the host would be to mark it as disabled via the HelixAdmin API.

To delve further into this line of marking a partition disabled on a host, what does this mean in the context of Helix? Just that the shard for that resource can no longer be scheduled onto that host?

Thanks for the help!

Will

________________________________
From: Junkai Xue <ju...@gmail.com>>
Sent: Monday, August 29, 2022 2:55 PM
To: user@helix.apache.org<ma...@helix.apache.org> <us...@helix.apache.org>>
Subject: Re: Question regarding a partition that goes into a failure state

Hi Will.

To summarize, do you mean the node is broken but somehow the connection was always there and kept holding the leadership of that replica?

Depends on the scenario,

  *   If the machine is gone and cannot do anything (like ssh, accepting Helix messages), the only thing you can do is bounce the machine.
  *   If it is just some parts failure (like disk failure) but main process functioning, then you can try to disable that partition for that instance by using HelixAdmin API. The leadership will be switched out.

Please let me know if I understand your story correctly.

Best,

Junkai

On Mon, Aug 29, 2022 at 11:45 AM William Morgan <wi...@morgan-fam.com>> wrote:
Hey folks,

I was wondering what guidance there would be on how to handle the following scenario:

  1.  We have a distributed DB with N number of shards with the partitioning, failover, etc. handled via Helix using the Master-Slave model with WagedRebalancer and using Full_Auto
  2.  Let's say Shard 1 gets assigned to Host 1 and we successfully transition to MASTER state.
  3.  It continues to be alive and happy for a period of time, but then a failure occurs which doesn't take the host offline but prevents the host from fully functioning. (A good example of this is corruption of the shard because of Disk Failure where parts of the SSD have been worn out)
  4.  We're able to see that we're unable to write to disk and want to rebalance that shard to elsewhere.

What would be the recommended way of doing step 4 using Helix?

I'm unsure what's the best way because we would have to communicate to Helix that the shard has transitioned to an ERROR state on that host so it can be rebalanced elsewhere. Up to this point we've only reacted to state transitions sent to Helix, so I would be curious how feedback like this would be given to Controller so it could rebalance correctly.

Thanks,
Will

--
Junkai Xue

Re: Question regarding a partition that goes into a failure state

Posted by Junkai Xue <jx...@apache.org>.

The disable API only helps to move the replica state from Master to another
host. And then that replica will be marked as OFFLINE state in the original
host.

If you prefer to totally move out the replica, I would suggest you use a
hybrid model in the assignment. It means in the RESOURCE folder, you can
create a resource config. For that specific partition, you can define the
preference list like SEMI AUTO did. So it will be a hybrid model where
other partitions will leverage FULL_AUTO waged algorithms but that
partition will depend on the input of the preference list for assignment.
Otherwise, Helix does not provide any API that allows you to "move" the
replica to other hosts in FULL_AUTO.

Best,

Junkai

On Mon, Aug 29, 2022 at 12:40 PM William Morgan <wi...@morgan-fam.com>
wrote:

> The idea is that we have N hosts that have M number of shards assigned to
> each host by Helix. There can be a situation where the host is still
> overall healthy, but a shard isn't on that host.
>
> So, if I'm understanding correctly, the way to communicate to helix that
> the shard should be moved off the host would be to mark it as disabled via
> the HelixAdmin API.
>
> To delve further into this line of marking a partition disabled on a host,
> what does this mean in the context of Helix? Just that the shard for that
> resource can no longer be scheduled onto that host?
>
> Thanks for the help!
>
> Will
>
>
> ------------------------------
> *From:* Junkai Xue <ju...@gmail.com>
> *Sent:* Monday, August 29, 2022 2:55 PM
> *To:* user@helix.apache.org <us...@helix.apache.org>
> *Subject:* Re: Question regarding a partition that goes into a failure
> state
>
> Hi Will.
>
> To summarize, do you mean the node is broken but somehow the connection
> was always there and kept holding the leadership of that replica?
>
> Depends on the scenario,
>
>    - If the machine is gone and cannot do anything (like ssh, accepting
>    Helix messages), the only thing you can do is bounce the machine.
>    - If it is just some parts failure (like disk failure) but main
>    process functioning, then you can try to disable that partition for that
>    instance by using HelixAdmin API. The leadership will be switched out.
>
> Please let me know if I understand your story correctly.
>
> Best,
>
> Junkai
>
> On Mon, Aug 29, 2022 at 11:45 AM William Morgan <wi...@morgan-fam.com>
> wrote:
>
> Hey folks,
>
> I was wondering what guidance there would be on how to handle the
> following scenario:
>
>    1. We have a distributed DB with N number of shards with the
>    partitioning, failover, etc. handled via Helix using the Master-Slave model
>    with WagedRebalancer and using Full_Auto
>    2. Let's say Shard 1 gets assigned to Host 1 and we successfully
>    transition to MASTER state.
>    3. It continues to be alive and happy for a period of time, but then a
>    failure occurs which doesn't take the host offline but prevents the host
>    from fully functioning. (A good example of this is corruption of the shard
>    because of Disk Failure where parts of the SSD have been worn out)
>    4. We're able to see that we're unable to write to disk and want to
>    rebalance that shard to elsewhere.
>
> What would be the recommended way of doing step 4 using Helix?
>
> I'm unsure what's the best way because we would have to communicate to
> Helix that the shard has transitioned to an ERROR state on that host so it
> can be rebalanced elsewhere. Up to this point we've only reacted to state
> transitions sent to Helix, so I would be curious how feedback like this
> would be given to Controller so it could rebalance correctly.
>
> Thanks,
> Will
>
>
>
>
> --
> Junkai Xue
>

Re: Question regarding a partition that goes into a failure state

Posted by William Morgan <wi...@morgan-fam.com>.

The idea is that we have N hosts that have M number of shards assigned to each host by Helix. There can be a situation where the host is still overall healthy, but a shard isn't on that host.

So, if I'm understanding correctly, the way to communicate to helix that the shard should be moved off the host would be to mark it as disabled via the HelixAdmin API.

To delve further into this line of marking a partition disabled on a host, what does this mean in the context of Helix? Just that the shard for that resource can no longer be scheduled onto that host?

Thanks for the help!

Will


________________________________
From: Junkai Xue <ju...@gmail.com>
Sent: Monday, August 29, 2022 2:55 PM
To: user@helix.apache.org <us...@helix.apache.org>
Subject: Re: Question regarding a partition that goes into a failure state

Hi Will.

To summarize, do you mean the node is broken but somehow the connection was always there and kept holding the leadership of that replica?

Depends on the scenario,

  *   If the machine is gone and cannot do anything (like ssh, accepting Helix messages), the only thing you can do is bounce the machine.
  *   If it is just some parts failure (like disk failure) but main process functioning, then you can try to disable that partition for that instance by using HelixAdmin API. The leadership will be switched out.

Please let me know if I understand your story correctly.

Best,

Junkai

On Mon, Aug 29, 2022 at 11:45 AM William Morgan <wi...@morgan-fam.com>> wrote:
Hey folks,

I was wondering what guidance there would be on how to handle the following scenario:

  1.  We have a distributed DB with N number of shards with the partitioning, failover, etc. handled via Helix using the Master-Slave model with WagedRebalancer and using Full_Auto
  2.  Let's say Shard 1 gets assigned to Host 1 and we successfully transition to MASTER state.
  3.  It continues to be alive and happy for a period of time, but then a failure occurs which doesn't take the host offline but prevents the host from fully functioning. (A good example of this is corruption of the shard because of Disk Failure where parts of the SSD have been worn out)
  4.  We're able to see that we're unable to write to disk and want to rebalance that shard to elsewhere.

What would be the recommended way of doing step 4 using Helix?

I'm unsure what's the best way because we would have to communicate to Helix that the shard has transitioned to an ERROR state on that host so it can be rebalanced elsewhere. Up to this point we've only reacted to state transitions sent to Helix, so I would be curious how feedback like this would be given to Controller so it could rebalance correctly.

Thanks,
Will




--
Junkai Xue

Re: Question regarding a partition that goes into a failure state

Posted by Junkai Xue <ju...@gmail.com>.

Hi Will.

To summarize, do you mean the node is broken but somehow the connection was
always there and kept holding the leadership of that replica?

Depends on the scenario,

   - If the machine is gone and cannot do anything (like ssh, accepting
   Helix messages), the only thing you can do is bounce the machine.
   - If it is just some parts failure (like disk failure) but main process
   functioning, then you can try to disable that partition for that instance
   by using HelixAdmin API. The leadership will be switched out.

Please let me know if I understand your story correctly.

Best,

Junkai

On Mon, Aug 29, 2022 at 11:45 AM William Morgan <wi...@morgan-fam.com>
wrote:

> Hey folks,
>
> I was wondering what guidance there would be on how to handle the
> following scenario:
>
>    1. We have a distributed DB with N number of shards with the
>    partitioning, failover, etc. handled via Helix using the Master-Slave model
>    with WagedRebalancer and using Full_Auto
>    2. Let's say Shard 1 gets assigned to Host 1 and we successfully
>    transition to MASTER state.
>    3. It continues to be alive and happy for a period of time, but then a
>    failure occurs which doesn't take the host offline but prevents the host
>    from fully functioning. (A good example of this is corruption of the shard
>    because of Disk Failure where parts of the SSD have been worn out)
>    4. We're able to see that we're unable to write to disk and want to
>    rebalance that shard to elsewhere.
>
> What would be the recommended way of doing step 4 using Helix?
>
> I'm unsure what's the best way because we would have to communicate to
> Helix that the shard has transitioned to an ERROR state on that host so it
> can be rebalanced elsewhere. Up to this point we've only reacted to state
> transitions sent to Helix, so I would be curious how feedback like this
> would be given to Controller so it could rebalance correctly.
>
> Thanks,
> Will
>
>
>

-- 
Junkai Xue