You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@helix.apache.org by Vinayak Borkar <vb...@yahoo.com> on 2013/04/18 06:28:47 UTC

Resource Partition Failure

Hi,


What is the expected way for a system to indicate to Helix that a 
partition of a resource has failed?

Say the bits on disk of a particular partition are found to be 
corrupted. Is there a way to tell helix that that partition of that 
resource needs to "fail" without killing the whole node and hence 
destroying all other resources on that machine?

Thanks,
Vinayak

Re: Resource Partition Failure

Posted by Vinayak Borkar <vb...@yahoo.com>.
Kishore,

Thanks for the explanation. I saw that HelixAdmin had calls to reset 
partitions from error state -> initial state. So I was wondering if 
moving the partition to error state by the instance itself would be a 
good idea. But Ming's answer and your explanation obviate the need for that.


Thanks,
Vinayak


On 4/17/13 11:29 PM, kishore g wrote:
> Ming is correct, you can use the enablePartition(false) to disable only the
> corrupted partition on the node. This will trigger the rebalancer which
> recomputes the ideal state.
>
> We thought about allowing instance to move itself into ERROR state but we
> were worried that giving control to instance to change its state
> automatically is dangerous and makes it harder to debug issues.
>
> We do have a mechanism for the participant to send a request to controller
> to initiate a transition for example you can send a message to controller
> to disable a partition/instance. ( This is different from disabling using
> helix admin but though the end result is the same).
>
> I dint get the second part " which was then reset by possibly the
> controller"
>
>
>
>
> On Wed, Apr 17, 2013 at 11:00 PM, Vinayak Borkar <vb...@yahoo.com> wrote:
>
>> That sounds more promising. Does disabling a partition trigger ideal state
>> computation to rebalance the cluster?
>>
>> Ideally it would be great if the corrupted instance could move itself to
>> the ERROR state which was then reset by possibly the controller. Is that
>> possible?
>>
>>
>>
>>
>>
>> On 4/17/13 10:55 PM, Ming Fang wrote:
>>
>>> how about HelixAdmin.enablePartition()?
>>>
>>> On Apr 18, 2013, at 1:53 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>>>
>>>   Hi Ming Fang,
>>>>
>>>>
>>>> Enable/Disable instance will take out all the resources hosted on an
>>>> instance. I would like to disable only the corrupted partition on the
>>>> system without impacting other resources.
>>>>
>>>> Thanks,
>>>> Vinayak
>>>>
>>>>
>>>> On 4/17/13 10:43 PM, Ming Fang wrote:
>>>>
>>>>> Try HelixAdmin.enableInstance()
>>>>>
>>>>> On Apr 18, 2013, at 12:28 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>>>>>
>>>>>   Hi,
>>>>>>
>>>>>>
>>>>>> What is the expected way for a system to indicate to Helix that a
>>>>>> partition of a resource has failed?
>>>>>>
>>>>>> Say the bits on disk of a particular partition are found to be
>>>>>> corrupted. Is there a way to tell helix that that partition of that
>>>>>> resource needs to "fail" without killing the whole node and hence
>>>>>> destroying all other resources on that machine?
>>>>>>
>>>>>> Thanks,
>>>>>> Vinayak
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>


Re: Resource Partition Failure

Posted by kishore g <g....@gmail.com>.
Ming is correct, you can use the enablePartition(false) to disable only the
corrupted partition on the node. This will trigger the rebalancer which
recomputes the ideal state.

We thought about allowing instance to move itself into ERROR state but we
were worried that giving control to instance to change its state
automatically is dangerous and makes it harder to debug issues.

We do have a mechanism for the participant to send a request to controller
to initiate a transition for example you can send a message to controller
to disable a partition/instance. ( This is different from disabling using
helix admin but though the end result is the same).

I dint get the second part " which was then reset by possibly the
controller"




On Wed, Apr 17, 2013 at 11:00 PM, Vinayak Borkar <vb...@yahoo.com> wrote:

> That sounds more promising. Does disabling a partition trigger ideal state
> computation to rebalance the cluster?
>
> Ideally it would be great if the corrupted instance could move itself to
> the ERROR state which was then reset by possibly the controller. Is that
> possible?
>
>
>
>
>
> On 4/17/13 10:55 PM, Ming Fang wrote:
>
>> how about HelixAdmin.enablePartition()?
>>
>> On Apr 18, 2013, at 1:53 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>>
>>  Hi Ming Fang,
>>>
>>>
>>> Enable/Disable instance will take out all the resources hosted on an
>>> instance. I would like to disable only the corrupted partition on the
>>> system without impacting other resources.
>>>
>>> Thanks,
>>> Vinayak
>>>
>>>
>>> On 4/17/13 10:43 PM, Ming Fang wrote:
>>>
>>>> Try HelixAdmin.enableInstance()
>>>>
>>>> On Apr 18, 2013, at 12:28 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>>>>
>>>>  Hi,
>>>>>
>>>>>
>>>>> What is the expected way for a system to indicate to Helix that a
>>>>> partition of a resource has failed?
>>>>>
>>>>> Say the bits on disk of a particular partition are found to be
>>>>> corrupted. Is there a way to tell helix that that partition of that
>>>>> resource needs to "fail" without killing the whole node and hence
>>>>> destroying all other resources on that machine?
>>>>>
>>>>> Thanks,
>>>>> Vinayak
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: Resource Partition Failure

Posted by Vinayak Borkar <vb...@yahoo.com>.
That sounds more promising. Does disabling a partition trigger ideal 
state computation to rebalance the cluster?

Ideally it would be great if the corrupted instance could move itself to 
the ERROR state which was then reset by possibly the controller. Is that 
possible?




On 4/17/13 10:55 PM, Ming Fang wrote:
> how about HelixAdmin.enablePartition()?
>
> On Apr 18, 2013, at 1:53 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>
>> Hi Ming Fang,
>>
>>
>> Enable/Disable instance will take out all the resources hosted on an instance. I would like to disable only the corrupted partition on the system without impacting other resources.
>>
>> Thanks,
>> Vinayak
>>
>>
>> On 4/17/13 10:43 PM, Ming Fang wrote:
>>> Try HelixAdmin.enableInstance()
>>>
>>> On Apr 18, 2013, at 12:28 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> What is the expected way for a system to indicate to Helix that a partition of a resource has failed?
>>>>
>>>> Say the bits on disk of a particular partition are found to be corrupted. Is there a way to tell helix that that partition of that resource needs to "fail" without killing the whole node and hence destroying all other resources on that machine?
>>>>
>>>> Thanks,
>>>> Vinayak
>>>
>>>
>>
>
>


Re: Resource Partition Failure

Posted by Ming Fang <mi...@mac.com>.
how about HelixAdmin.enablePartition()?

On Apr 18, 2013, at 1:53 AM, Vinayak Borkar <vb...@yahoo.com> wrote:

> Hi Ming Fang,
> 
> 
> Enable/Disable instance will take out all the resources hosted on an instance. I would like to disable only the corrupted partition on the system without impacting other resources.
> 
> Thanks,
> Vinayak
> 
> 
> On 4/17/13 10:43 PM, Ming Fang wrote:
>> Try HelixAdmin.enableInstance()
>> 
>> On Apr 18, 2013, at 12:28 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>> 
>>> Hi,
>>> 
>>> 
>>> What is the expected way for a system to indicate to Helix that a partition of a resource has failed?
>>> 
>>> Say the bits on disk of a particular partition are found to be corrupted. Is there a way to tell helix that that partition of that resource needs to "fail" without killing the whole node and hence destroying all other resources on that machine?
>>> 
>>> Thanks,
>>> Vinayak
>> 
>> 
> 


Re: Resource Partition Failure

Posted by Vinayak Borkar <vb...@yahoo.com>.
Hi Ming Fang,


Enable/Disable instance will take out all the resources hosted on an 
instance. I would like to disable only the corrupted partition on the 
system without impacting other resources.

Thanks,
Vinayak


On 4/17/13 10:43 PM, Ming Fang wrote:
> Try HelixAdmin.enableInstance()
>
> On Apr 18, 2013, at 12:28 AM, Vinayak Borkar <vb...@yahoo.com> wrote:
>
>> Hi,
>>
>>
>> What is the expected way for a system to indicate to Helix that a partition of a resource has failed?
>>
>> Say the bits on disk of a particular partition are found to be corrupted. Is there a way to tell helix that that partition of that resource needs to "fail" without killing the whole node and hence destroying all other resources on that machine?
>>
>> Thanks,
>> Vinayak
>
>


Re: Resource Partition Failure

Posted by Ming Fang <mi...@mac.com>.
Try HelixAdmin.enableInstance()

On Apr 18, 2013, at 12:28 AM, Vinayak Borkar <vb...@yahoo.com> wrote:

> Hi,
> 
> 
> What is the expected way for a system to indicate to Helix that a partition of a resource has failed?
> 
> Say the bits on disk of a particular partition are found to be corrupted. Is there a way to tell helix that that partition of that resource needs to "fail" without killing the whole node and hence destroying all other resources on that machine?
> 
> Thanks,
> Vinayak