You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by Cameron McKenzie <mc...@gmail.com> on 2014/08/08 08:02:20 UTC

CURATOR-79

Guys,
I've been looking into a fix for CURATOR-79 (
https://issues.apache.org/jira/browse/CURATOR-79) and have found it to be
slightly more complicated than initially expected.

The locking recipes are using protected zNodes (i.e the zNode name contains
a random UUID that is tied to a particular builder instance) for locks,
which is sensible, but there seems to be an issue with this.

The protected logic basically looks for the cause of failure on a create,
and if it's connection loss, then it does an ensured deleted on the path it
was trying to create to ensure that it's removed if it did get created.

For CURATOR-79, and InterruptedException is causing this call to fail when
waiting for the response from ZK. This means that the protected logic does
not fire and we end up with an orphaned node.

It's possible with some ugliness to handle this in the InterprocesMutex,
but I think that maybe it's better fixed in the protected logic. Maybe the
protected logic could be modified so that it will occur on ConnectionLoss
or on any non-KeeperException (i.e. InterruptedException). This would cause
the zNode to be removed if it was created, and would fix this deadlock
issue.

I would welcome anyone's opinion on the way forward.
cheers
Cam

Re: CURATOR-79

Posted by Mike Drob <ma...@cloudera.com>.
+1


On Fri, Aug 8, 2014 at 4:58 PM, Cameron McKenzie <mc...@gmail.com>
wrote:

> The proposed alternative is pretty much the same thing, but moving the
> logic to the LockInternals class. The added complexity of this is that the
> LockInternals class doesn't know what the name of the zNode that was going
> to be created is called. We'd need to expose the generated UUID on the
> CreateBuilder.
>
> My preference is to fix this in the protected() code. We can either just
> handle the InterruptedException, or handle any non KeeperException (other
> than ConnectionLoss). The InterruptedException is what's happening in this
> case, but if we get an NPE or something in that code, the same thing would
> occur, so I think that perhaps we should just have a blanket catch all.
>
> I have coded this up already and all of the unit tests pass ok.
> cheers
>
>
> On Fri, Aug 8, 2014 at 11:05 PM, Mike Drob <ma...@cloudera.com> wrote:
>
> > Explicitly coding for a possible InterruptedException sounds good to me,
> > but we already know that I prefer (more) diverse Exception types.
> >
> > I'm not sure I understand what the proposed alternative is?
> >
> > Mike
> >
> >
> > On Fri, Aug 8, 2014 at 1:02 AM, Cameron McKenzie <mckenzie.cam@gmail.com
> >
> > wrote:
> >
> > > Guys,
> > > I've been looking into a fix for CURATOR-79 (
> > > https://issues.apache.org/jira/browse/CURATOR-79) and have found it to
> > be
> > > slightly more complicated than initially expected.
> > >
> > > The locking recipes are using protected zNodes (i.e the zNode name
> > contains
> > > a random UUID that is tied to a particular builder instance) for locks,
> > > which is sensible, but there seems to be an issue with this.
> > >
> > > The protected logic basically looks for the cause of failure on a
> create,
> > > and if it's connection loss, then it does an ensured deleted on the
> path
> > it
> > > was trying to create to ensure that it's removed if it did get created.
> > >
> > > For CURATOR-79, and InterruptedException is causing this call to fail
> > when
> > > waiting for the response from ZK. This means that the protected logic
> > does
> > > not fire and we end up with an orphaned node.
> > >
> > > It's possible with some ugliness to handle this in the
> InterprocesMutex,
> > > but I think that maybe it's better fixed in the protected logic. Maybe
> > the
> > > protected logic could be modified so that it will occur on
> ConnectionLoss
> > > or on any non-KeeperException (i.e. InterruptedException). This would
> > cause
> > > the zNode to be removed if it was created, and would fix this deadlock
> > > issue.
> > >
> > > I would welcome anyone's opinion on the way forward.
> > > cheers
> > > Cam
> > >
> >
>

Re: CURATOR-79

Posted by Cameron McKenzie <mc...@gmail.com>.
The proposed alternative is pretty much the same thing, but moving the
logic to the LockInternals class. The added complexity of this is that the
LockInternals class doesn't know what the name of the zNode that was going
to be created is called. We'd need to expose the generated UUID on the
CreateBuilder.

My preference is to fix this in the protected() code. We can either just
handle the InterruptedException, or handle any non KeeperException (other
than ConnectionLoss). The InterruptedException is what's happening in this
case, but if we get an NPE or something in that code, the same thing would
occur, so I think that perhaps we should just have a blanket catch all.

I have coded this up already and all of the unit tests pass ok.
cheers


On Fri, Aug 8, 2014 at 11:05 PM, Mike Drob <ma...@cloudera.com> wrote:

> Explicitly coding for a possible InterruptedException sounds good to me,
> but we already know that I prefer (more) diverse Exception types.
>
> I'm not sure I understand what the proposed alternative is?
>
> Mike
>
>
> On Fri, Aug 8, 2014 at 1:02 AM, Cameron McKenzie <mc...@gmail.com>
> wrote:
>
> > Guys,
> > I've been looking into a fix for CURATOR-79 (
> > https://issues.apache.org/jira/browse/CURATOR-79) and have found it to
> be
> > slightly more complicated than initially expected.
> >
> > The locking recipes are using protected zNodes (i.e the zNode name
> contains
> > a random UUID that is tied to a particular builder instance) for locks,
> > which is sensible, but there seems to be an issue with this.
> >
> > The protected logic basically looks for the cause of failure on a create,
> > and if it's connection loss, then it does an ensured deleted on the path
> it
> > was trying to create to ensure that it's removed if it did get created.
> >
> > For CURATOR-79, and InterruptedException is causing this call to fail
> when
> > waiting for the response from ZK. This means that the protected logic
> does
> > not fire and we end up with an orphaned node.
> >
> > It's possible with some ugliness to handle this in the InterprocesMutex,
> > but I think that maybe it's better fixed in the protected logic. Maybe
> the
> > protected logic could be modified so that it will occur on ConnectionLoss
> > or on any non-KeeperException (i.e. InterruptedException). This would
> cause
> > the zNode to be removed if it was created, and would fix this deadlock
> > issue.
> >
> > I would welcome anyone's opinion on the way forward.
> > cheers
> > Cam
> >
>

Re: CURATOR-79

Posted by Mike Drob <ma...@cloudera.com>.
Explicitly coding for a possible InterruptedException sounds good to me,
but we already know that I prefer (more) diverse Exception types.

I'm not sure I understand what the proposed alternative is?

Mike


On Fri, Aug 8, 2014 at 1:02 AM, Cameron McKenzie <mc...@gmail.com>
wrote:

> Guys,
> I've been looking into a fix for CURATOR-79 (
> https://issues.apache.org/jira/browse/CURATOR-79) and have found it to be
> slightly more complicated than initially expected.
>
> The locking recipes are using protected zNodes (i.e the zNode name contains
> a random UUID that is tied to a particular builder instance) for locks,
> which is sensible, but there seems to be an issue with this.
>
> The protected logic basically looks for the cause of failure on a create,
> and if it's connection loss, then it does an ensured deleted on the path it
> was trying to create to ensure that it's removed if it did get created.
>
> For CURATOR-79, and InterruptedException is causing this call to fail when
> waiting for the response from ZK. This means that the protected logic does
> not fire and we end up with an orphaned node.
>
> It's possible with some ugliness to handle this in the InterprocesMutex,
> but I think that maybe it's better fixed in the protected logic. Maybe the
> protected logic could be modified so that it will occur on ConnectionLoss
> or on any non-KeeperException (i.e. InterruptedException). This would cause
> the zNode to be removed if it was created, and would fix this deadlock
> issue.
>
> I would welcome anyone's opinion on the way forward.
> cheers
> Cam
>