You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Charles Gordon <ch...@gmail.com> on 2010/05/28 17:28:43 UTC

Locking and Partial Failure

Hello,

I am new to using Zookeeper and I have a quick question about the locking
recipe that can be found here:

http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks

It appears to me that there is a flaw in this algorithm related to partial
failure, and I am curious to know how to fix it.

The algorithm follows these steps:

 1. Call "create()" with a pathname like "/some/path/to/parent/child-lock-".
 2. Call "getChildren()" on the lock node without the watch flag set.
 3. If the path created in step (1) has the lowest sequence number, you are
the master (skip the next steps).
 4. Otherwise, call "exists()" with the watch flag set on the child with the
next lowest sequence number.
 5. If "exists()" returns false, go to step (2), otherwise wait for a
notification from the path, then go to step (2).

The scenario that seems to be faulty is a partial failure in step (1).
Assume that my client program follows step (1) and calls "create()". Assume
that the call succeeds on the Zookeeper server, but there is a
ConnectionLoss event right as the server sends the response (e.g., a network
partition, some dropped packets, the ZK server goes down, etc). Assume
further that the client immediately reconnects, so the session is not timed
out. At this point there is a child node that was created by my client, but
that my client does not know about (since it never received the response).
Since my client doesn't know about the child, it won't know to watch the
previous child to it, and it also won't know to delete it. That means all
clients using that lock will fail to make progress as soon as the orphaned
child is the lowest sequence number. This state will continue until my
client closes it's session (which may be a while if I have a long lived
session, as I would like to have). Correctness is maintained here, but
live-ness is not.

The only good solution I have found for this problem is to establish a new
session with Zookeeper before acquiring a lock, and to close that session
immediately upon any connection loss in step (1). If everything works, the
session could be re-used, but you'd need to guarantee that the session was
closed if there was a failure during creation of the child node. Are there
other good solutions?

I looked at the sample code that comes with the Zookeeper distribution (I'm
using 3.2.2 right now), and it uses the current session ID as part of the
child node name. Then, if there is a failure during creation, it tries to
look up the child using that session ID. This isn't really helpful in the
environment I'm using, where a single session could be shared by multiple
threads, any of which could request a lock (so I can't uniquely identify a
lock by session ID). I could use thread ID, but then I run the risk of a
thread being reused and getting the wrong lock. In any case, there is also
the risk that a second failure prevents me from looking up the lock after a
connection loss, so I'm right back to an orphaned lock child, as above. I
could, presumably, be careful enough with try/catch logic to prevent even
that case, but it makes for pretty bug-prone code. Also, as a side note,
that code appears to be sorting the child nodes by the session ID first,
then the sequence number, which could cause locks to be ordered incorrectly.

Thanks for any help you can provide!

Charles Gordon

Re: Locking and Partial Failure

Posted by Charles Gordon <ch...@gmail.com>.
It does look like a special case of that JIRA item. I read back through the
Chubby paper and it sounds like they solve this problem using a similar
mechanism. They just block the client until either they manage to
re-establish a session or until the session timeout expires (at which case
they return an error to the application). That seems like the right thing to
do here as well.

I can solve this problem for myself by just treating a CONNECTION LOSS event
(while holding a lock) as the end of a session and clearing my application
state. It isn't ideal, but it will do the job in a safe way while
guaranteeing progress on the locks.

CG.

On Mon, May 31, 2010 at 1:54 PM, Ted Dunning <te...@gmail.com> wrote:

>
> Isn't this a special case of
> https://issues.apache.org/jira/browse/ZOOKEEPER-22 ?
>
> Is there any progress on this?
>
>
> On Mon, May 31, 2010 at 12:34 PM, Patrick Hunt <ph...@apache.org> wrote:
>
>> Hi Charles, any luck with this? Re the issues you found with the recipes
>> please enter a JIRA, it would be good to address the problem(s) you found.
>> https://issues.apache.org/jira/browse/ZOOKEEPER
>>
>> re use of session/thread id, might you use some sort of unique token
>> that's dynamically assigned to the thread making a request on the shared
>> session? The calling code could then be identified by that token in recovery
>> cases.
>>
>> Patrick
>>
>> On 05/28/2010 08:28 AM, Charles Gordon wrote:
>>
>>> Hello,
>>>
>>> I am new to using Zookeeper and I have a quick question about the locking
>>> recipe that can be found here:
>>>
>>>
>>> http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks
>>>
>>> It appears to me that there is a flaw in this algorithm related to
>>> partial
>>> failure, and I am curious to know how to fix it.
>>>
>>> The algorithm follows these steps:
>>>
>>>  1. Call "create()" with a pathname like
>>> "/some/path/to/parent/child-lock-".
>>>  2. Call "getChildren()" on the lock node without the watch flag set.
>>>  3. If the path created in step (1) has the lowest sequence number, you
>>> are
>>> the master (skip the next steps).
>>>  4. Otherwise, call "exists()" with the watch flag set on the child with
>>> the
>>> next lowest sequence number.
>>>  5. If "exists()" returns false, go to step (2), otherwise wait for a
>>> notification from the path, then go to step (2).
>>>
>>> The scenario that seems to be faulty is a partial failure in step (1).
>>> Assume that my client program follows step (1) and calls "create()".
>>> Assume
>>> that the call succeeds on the Zookeeper server, but there is a
>>> ConnectionLoss event right as the server sends the response (e.g., a
>>> network
>>> partition, some dropped packets, the ZK server goes down, etc). Assume
>>> further that the client immediately reconnects, so the session is not
>>> timed
>>> out. At this point there is a child node that was created by my client,
>>> but
>>> that my client does not know about (since it never received the
>>> response).
>>> Since my client doesn't know about the child, it won't know to watch the
>>> previous child to it, and it also won't know to delete it. That means all
>>> clients using that lock will fail to make progress as soon as the
>>> orphaned
>>> child is the lowest sequence number. This state will continue until my
>>> client closes it's session (which may be a while if I have a long lived
>>> session, as I would like to have). Correctness is maintained here, but
>>> live-ness is not.
>>>
>>> The only good solution I have found for this problem is to establish a
>>> new
>>> session with Zookeeper before acquiring a lock, and to close that session
>>> immediately upon any connection loss in step (1). If everything works,
>>> the
>>> session could be re-used, but you'd need to guarantee that the session
>>> was
>>> closed if there was a failure during creation of the child node. Are
>>> there
>>> other good solutions?
>>>
>>> I looked at the sample code that comes with the Zookeeper distribution
>>> (I'm
>>> using 3.2.2 right now), and it uses the current session ID as part of the
>>> child node name. Then, if there is a failure during creation, it tries to
>>> look up the child using that session ID. This isn't really helpful in the
>>> environment I'm using, where a single session could be shared by multiple
>>> threads, any of which could request a lock (so I can't uniquely identify
>>> a
>>> lock by session ID). I could use thread ID, but then I run the risk of a
>>> thread being reused and getting the wrong lock. In any case, there is
>>> also
>>> the risk that a second failure prevents me from looking up the lock after
>>> a
>>> connection loss, so I'm right back to an orphaned lock child, as above. I
>>> could, presumably, be careful enough with try/catch logic to prevent even
>>> that case, but it makes for pretty bug-prone code. Also, as a side note,
>>> that code appears to be sorting the child nodes by the session ID first,
>>> then the sequence number, which could cause locks to be ordered
>>> incorrectly.
>>>
>>> Thanks for any help you can provide!
>>>
>>> Charles Gordon
>>>
>>>
>

Re: Locking and Partial Failure

Posted by Ted Dunning <te...@gmail.com>.
Isn't this a special case of
https://issues.apache.org/jira/browse/ZOOKEEPER-22 ?

Is there any progress on this?

On Mon, May 31, 2010 at 12:34 PM, Patrick Hunt <ph...@apache.org> wrote:

> Hi Charles, any luck with this? Re the issues you found with the recipes
> please enter a JIRA, it would be good to address the problem(s) you found.
> https://issues.apache.org/jira/browse/ZOOKEEPER
>
> re use of session/thread id, might you use some sort of unique token that's
> dynamically assigned to the thread making a request on the shared session?
> The calling code could then be identified by that token in recovery cases.
>
> Patrick
>
> On 05/28/2010 08:28 AM, Charles Gordon wrote:
>
>> Hello,
>>
>> I am new to using Zookeeper and I have a quick question about the locking
>> recipe that can be found here:
>>
>>
>> http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks
>>
>> It appears to me that there is a flaw in this algorithm related to partial
>> failure, and I am curious to know how to fix it.
>>
>> The algorithm follows these steps:
>>
>>  1. Call "create()" with a pathname like
>> "/some/path/to/parent/child-lock-".
>>  2. Call "getChildren()" on the lock node without the watch flag set.
>>  3. If the path created in step (1) has the lowest sequence number, you
>> are
>> the master (skip the next steps).
>>  4. Otherwise, call "exists()" with the watch flag set on the child with
>> the
>> next lowest sequence number.
>>  5. If "exists()" returns false, go to step (2), otherwise wait for a
>> notification from the path, then go to step (2).
>>
>> The scenario that seems to be faulty is a partial failure in step (1).
>> Assume that my client program follows step (1) and calls "create()".
>> Assume
>> that the call succeeds on the Zookeeper server, but there is a
>> ConnectionLoss event right as the server sends the response (e.g., a
>> network
>> partition, some dropped packets, the ZK server goes down, etc). Assume
>> further that the client immediately reconnects, so the session is not
>> timed
>> out. At this point there is a child node that was created by my client,
>> but
>> that my client does not know about (since it never received the response).
>> Since my client doesn't know about the child, it won't know to watch the
>> previous child to it, and it also won't know to delete it. That means all
>> clients using that lock will fail to make progress as soon as the orphaned
>> child is the lowest sequence number. This state will continue until my
>> client closes it's session (which may be a while if I have a long lived
>> session, as I would like to have). Correctness is maintained here, but
>> live-ness is not.
>>
>> The only good solution I have found for this problem is to establish a new
>> session with Zookeeper before acquiring a lock, and to close that session
>> immediately upon any connection loss in step (1). If everything works, the
>> session could be re-used, but you'd need to guarantee that the session was
>> closed if there was a failure during creation of the child node. Are there
>> other good solutions?
>>
>> I looked at the sample code that comes with the Zookeeper distribution
>> (I'm
>> using 3.2.2 right now), and it uses the current session ID as part of the
>> child node name. Then, if there is a failure during creation, it tries to
>> look up the child using that session ID. This isn't really helpful in the
>> environment I'm using, where a single session could be shared by multiple
>> threads, any of which could request a lock (so I can't uniquely identify a
>> lock by session ID). I could use thread ID, but then I run the risk of a
>> thread being reused and getting the wrong lock. In any case, there is also
>> the risk that a second failure prevents me from looking up the lock after
>> a
>> connection loss, so I'm right back to an orphaned lock child, as above. I
>> could, presumably, be careful enough with try/catch logic to prevent even
>> that case, but it makes for pretty bug-prone code. Also, as a side note,
>> that code appears to be sorting the child nodes by the session ID first,
>> then the sequence number, which could cause locks to be ordered
>> incorrectly.
>>
>> Thanks for any help you can provide!
>>
>> Charles Gordon
>>
>>

Re: Locking and Partial Failure

Posted by Patrick Hunt <ph...@apache.org>.
Hi Charles, any luck with this? Re the issues you found with the recipes 
please enter a JIRA, it would be good to address the problem(s) you found.
https://issues.apache.org/jira/browse/ZOOKEEPER

re use of session/thread id, might you use some sort of unique token 
that's dynamically assigned to the thread making a request on the shared 
session? The calling code could then be identified by that token in 
recovery cases.

Patrick

On 05/28/2010 08:28 AM, Charles Gordon wrote:
> Hello,
>
> I am new to using Zookeeper and I have a quick question about the locking
> recipe that can be found here:
>
> http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks
>
> It appears to me that there is a flaw in this algorithm related to partial
> failure, and I am curious to know how to fix it.
>
> The algorithm follows these steps:
>
>   1. Call "create()" with a pathname like "/some/path/to/parent/child-lock-".
>   2. Call "getChildren()" on the lock node without the watch flag set.
>   3. If the path created in step (1) has the lowest sequence number, you are
> the master (skip the next steps).
>   4. Otherwise, call "exists()" with the watch flag set on the child with the
> next lowest sequence number.
>   5. If "exists()" returns false, go to step (2), otherwise wait for a
> notification from the path, then go to step (2).
>
> The scenario that seems to be faulty is a partial failure in step (1).
> Assume that my client program follows step (1) and calls "create()". Assume
> that the call succeeds on the Zookeeper server, but there is a
> ConnectionLoss event right as the server sends the response (e.g., a network
> partition, some dropped packets, the ZK server goes down, etc). Assume
> further that the client immediately reconnects, so the session is not timed
> out. At this point there is a child node that was created by my client, but
> that my client does not know about (since it never received the response).
> Since my client doesn't know about the child, it won't know to watch the
> previous child to it, and it also won't know to delete it. That means all
> clients using that lock will fail to make progress as soon as the orphaned
> child is the lowest sequence number. This state will continue until my
> client closes it's session (which may be a while if I have a long lived
> session, as I would like to have). Correctness is maintained here, but
> live-ness is not.
>
> The only good solution I have found for this problem is to establish a new
> session with Zookeeper before acquiring a lock, and to close that session
> immediately upon any connection loss in step (1). If everything works, the
> session could be re-used, but you'd need to guarantee that the session was
> closed if there was a failure during creation of the child node. Are there
> other good solutions?
>
> I looked at the sample code that comes with the Zookeeper distribution (I'm
> using 3.2.2 right now), and it uses the current session ID as part of the
> child node name. Then, if there is a failure during creation, it tries to
> look up the child using that session ID. This isn't really helpful in the
> environment I'm using, where a single session could be shared by multiple
> threads, any of which could request a lock (so I can't uniquely identify a
> lock by session ID). I could use thread ID, but then I run the risk of a
> thread being reused and getting the wrong lock. In any case, there is also
> the risk that a second failure prevents me from looking up the lock after a
> connection loss, so I'm right back to an orphaned lock child, as above. I
> could, presumably, be careful enough with try/catch logic to prevent even
> that case, but it makes for pretty bug-prone code. Also, as a side note,
> that code appears to be sorting the child nodes by the session ID first,
> then the sequence number, which could cause locks to be ordered incorrectly.
>
> Thanks for any help you can provide!
>
> Charles Gordon
>