You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Gustavo Niemeyer <gu...@niemeyer.net> on 2010/08/23 16:34:31 UTC

Parent nodes & multi-step transactions

Greetings,

We (a development team at Canonical) are stumbling into a situation
here which I'd be curious to understand what is the general practice,
since I'm sure this is somewhat of a common issue.

It's quite easy to describe it: say there's a parent node A somewhere
in the tree.  That node was created dynamically over the course of
running the system, because it's associated with some resource which
has its own life-span.  Now, under this node we put some control nodes
for different reasons (say, A/B), and we also want to track some
information which is related to a sequence of nodes (say, A/C/D-0,
A/C/D-1, etc).

So, we end up with something like this:

    A/B
    A/C/D-0
    A/C/D-1

The question here is about best-practices for taking care of nodes
like A/C.  It'd be fantastic to be able to create A's structure
together with A itself, otherwise we risk getting in a situation where
a client can see the node A before its "initialization" has been
finished (A/C doesn't exist yet).  In fact, A/C may never exist, since
it is possible for a client to die between the creation of A and C.

Anyway, I'm sure you all understand the problem.  The question here
is: this is pretty common, and quite boring to deal with properly on
every single client.  Is there any feature in the roadmap to deal with
this, and any common practice besides the obvious "check for
half-initialization and wait for A/C to be created or deal with
timeouts and whatnot" on every client?

I'm about to start writing another layer on top of Zookeeper's API, so
it'd be great to have some additional insight into this issue.

-- 
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Re: Parent nodes & multi-step transactions

Posted by Gustavo Niemeyer <gu...@niemeyer.net>.

> http://www.mail-archive.com/zookeeper-dev@hadoop.apache.org/msg08317.html
>
> Mostly looked like this wasn¹t on our roadmap for short term, but definitely
> something to think about longer term.

Nice, thanks for the pointer.  This looks precisely like the kind of
support we'd like to see implemented as well, except in our case it'd
be create/delete rather than set.

I agree with the rationale there as well: coordination gets
significantly more complex if one has to deal not only with the
problem at hand, but also with the intermediate states which have to
be put in place towards satisfying the necessary structure.

Regarding Ben's reservations, I don't share some of them:

1) Regarding blocking, the suggestion is just to group a finite number
of operations together, rather than a RDB-like begin/commit
transaction mechanism. The lack of these primitives means that the
workaround logic people have to put in place, besides being
significantly more complex to implement and understand (bugs!), also
is significantly more expensive computationally than something which
executed the operations at once (e.g. how expensive would it be to
create two nodes, vs. creating lock/liveness nodes and putting
watching in places everywhere).

2) My impression is that aborting the whole thing on any failures in
the grouped operations would be pretty easy to understand and would be
what most people would expect when using such a grouping primitive
(why would they want grouping otherwise?).

3) Regarding partitioning, my poor understanding becomes poorer.
Given all the guarantees of ordering and whatnot that ZK already has
to enforce, feels like it shouldn't be significantly harder to batch
things together, but I don't have enough understanding to say anything
here to be honest.

-- 
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Re: Parent nodes & multi-step transactions

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

Hi Gustavo,
 There was some talk of  startTransaction(), addTransaction(), commit() kind
of api on the list.

Here is the link:

http://www.mail-archive.com/zookeeper-dev@hadoop.apache.org/msg08317.html

Mostly looked like this wasn¹t on our roadmap for short term, but definitely
something to think about longer term.

Thanks
mahadev


On 8/23/10 3:32 PM, "Gustavo Niemeyer" <gu...@niemeyer.net> wrote:

>> So, we end up with something like this:
>> 
>>    A/B
>>    A/C/D-0
>>    A/C/D-1
> 
> While people are thinking, let me ask this more explicitly: how hard
> would it be to add multi-step atomic actions to Zookeeper?
> 
> The interest is specifically to:
> 
> 1) Avoid intermediate states to be persisted when the client creating
> the state crashes
> 
> 2) Avoid intermediate states to be seen while a coordination structure
> is being put in place
> 
> I understand that there are tricks which may be used to avoid some of
> the related problems by dealing with locks, liveness nodes, and "side
> services" which monitor and clean up the state, but it'd be fantastic
> to have some internal support in Zookeeper to make these actions
> simpler and less error prone.  It feels like, given Zookeeper
> guarantees, it shouldn't be too hard to extend the protocol to offer
> some basic-level operation grouping (e.g. multi-create and
> multi-delete, at least).
> 
> Does that make sense?
> 
> --
> Gustavo Niemeyer
> http://niemeyer.net
> http://niemeyer.net/blog
> http://niemeyer.net/twitter
>

Re: Parent nodes & multi-step transactions

Posted by Gustavo Niemeyer <gu...@niemeyer.net>.

> So, we end up with something like this:
>
>    A/B
>    A/C/D-0
>    A/C/D-1

While people are thinking, let me ask this more explicitly: how hard
would it be to add multi-step atomic actions to Zookeeper?

The interest is specifically to:

1) Avoid intermediate states to be persisted when the client creating
the state crashes

2) Avoid intermediate states to be seen while a coordination structure
is being put in place

I understand that there are tricks which may be used to avoid some of
the related problems by dealing with locks, liveness nodes, and "side
services" which monitor and clean up the state, but it'd be fantastic
to have some internal support in Zookeeper to make these actions
simpler and less error prone.  It feels like, given Zookeeper
guarantees, it shouldn't be too hard to extend the protocol to offer
some basic-level operation grouping (e.g. multi-create and
multi-delete, at least).

Does that make sense?

-- 
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Re: Parent nodes & multi-step transactions

Posted by Gustavo Niemeyer <gu...@niemeyer.net>.

> Every functionality added to ZK will make it harder to maintain. The use case

Definitely, but it's hard to debate about features at that level.  If
we delete the whole code base, we have nothing to maintain, so given
this r.

> recursiveDelete, recursiveCreate: If you want to create /A/C/D-1 just use
> recursiveCreate and you will end up with  /A/C/D-1, even if the full parent
> path did not exist before.

You're missing the actual problem. Recursive create and delete are
non-issues per se.  They become issues once you want to use the ZK
filesystem state for coordination, which is the only advised use case
for ZK.  Other messages in this thread have already described the
problems related to intermediate state visibility, and some techniques
to deal with them.  The problem is that as the number of dynamic
pieces increase, the cost of maintaining all of that logic increases
too, and it becomes non-practical.

ZK is great at what it does, and these compound atomic operations
target real use cases for what it's most useful at.  In my view, the
additional complexity in the code would not be so great to have this
feature, and it would be absolutely nothing if compared to the
additional logic which these realistic use cases require to deal with
intermediate states.

-- 
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Re: Parent nodes & multi-step transactions

Posted by Thomas Koch <th...@koch.ro>.

Gustavo Niemeyer:
> Hi Thomas,
> 
> > I have a very strong feeling against more complex operations in the ZK
> > server.
> 
> Can you please describe a little better what that feeling is about?
Every functionality added to ZK will make it harder to maintain. The use case 
you're asking for is IMHO easily solvable in a client site helper library. So 
there's no reason to let ZK solve your problems.

> > These are things that should be provided by a ZK client helper library.
> > The
> 
> Which things should be provided by client helper libraries? 
> [...]
> > zkclient library from 101tec for example gives you exactly that.
> 
> It's not clear to me what "exactly that" is in this context.  I've
> looked for the code and couldn't find an answer/alternative to the
> issues discussed in this thread.
recursiveDelete, recursiveCreate: If you want to create /A/C/D-1 just use 
recursiveCreate and you will end up with  /A/C/D-1, even if the full parent 
path did not exist before.

> > If you're planning to write another layer on top of the ZK API please
> > have a look at https://issues.apache.org/jira/browse/ZOOKEEPER-835
> 
> Looked there as well.  Also can't find anything relative to this
> discussion.
>
> > I'm planning to provide an alternative java client API for 3.4.0 and
> > would then propose to deprecate the current one in the long run.
> > You can preview the new API at
> > http://github.com/thkoch2001/zookeeper/tree/operation_classes
> 
> And this is a full branch of ZK.  Tried checking out the commit
> messages or something to get an idea of what you mean, but also am
> unable to find answers to these problems.
The idea is to provide operation classes that can be handed around. So you can 
create a list of create operation and hand the full list to a specific 
executor. If the executor ignores NodeExists exeptions then you already have 
an implementation of recursiveCreate:

List creates = new List {new Create("/A"), new Create("/A/C"), new 
Create("/A/C/D-1")}
myExecutor.execute(creates)

> If you actually have/know of solutions for the suggested problems
> which were not yet covered here, I'm very interested in knowing about
> them, but will need slightly more precise information.
An alternative would be that you have a special znode in /A that signals, that 
the full structure has correctly been setup.

Best regards,

Thomas Koch, http://www.koch.ro

Re: Parent nodes & multi-step transactions

Posted by Gustavo Niemeyer <gu...@niemeyer.net>.

Hi Thomas,

> I have a very strong feeling against more complex operations in the ZK server.

Can you please describe a little better what that feeling is about?

> These are things that should be provided by a ZK client helper library. The

Which things should be provided by client helper libraries?  Client
libraries cannot provide atomic operations, which means that the
reasoning and logic which must happen on top of ZK to avoid
half-initialized states and observation of structure set up and tear
down must continue to be taken in account.  It basically means that to
avoid having a relatively simple batch operation, the reasoning which
must happen around ZK gets significantly more complex, or has to be
avoided entirely.

> zkclient library from 101tec for example gives you exactly that.

It's not clear to me what "exactly that" is in this context.  I've
looked for the code and couldn't find an answer/alternative to the
issues discussed in this thread.

> If you're planning to write another layer on top of the ZK API please have a
> look at https://issues.apache.org/jira/browse/ZOOKEEPER-835

Looked there as well.  Also can't find anything relative to this discussion.

> I'm planning to provide an alternative java client API for 3.4.0 and would
> then propose to deprecate the current one in the long run.
> You can preview the new API at
> http://github.com/thkoch2001/zookeeper/tree/operation_classes

And this is a full branch of ZK.  Tried checking out the commit
messages or something to get an idea of what you mean, but also am
unable to find answers to these problems.

If you actually have/know of solutions for the suggested problems
which were not yet covered here, I'm very interested in knowing about
them, but will need slightly more precise information.

-- 
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Re: Parent nodes & multi-step transactions

Posted by Thomas Koch <th...@koch.ro>.

Hi Gustavo,

I have a very strong feeling against more complex operations in the ZK server. 
These are things that should be provided by a ZK client helper library. The 
zkclient library from 101tec for example gives you exactly that.
If you're planning to write another layer on top of the ZK API please have a 
look at https://issues.apache.org/jira/browse/ZOOKEEPER-835
I'm planning to provide an alternative java client API for 3.4.0 and would 
then propose to deprecate the current one in the long run.
You can preview the new API at
http://github.com/thkoch2001/zookeeper/tree/operation_classes
However we need to redo it on top of ZOOKEEPER-823 ones it is applied to 
trunk.

Best regards,

Thomas Koch, http://www.koch.ro

Re: Parent nodes & multi-step transactions

Posted by Gustavo Niemeyer <gu...@niemeyer.net>.

> My own opinion is that lots of these structure sorts of problems are solved
> by putting the structure into a single znode.  Atomic creation and update
> come for free at that point and we can even make the node ephemeral which we
> can't really do if there are children.

Sure, it makes sense that using a single znode gets rid of some of the
problems, after all we'd be effectively getting an atomic operation.
It also gets rid of many of the advantages of using ZooKeeper, though.
Independent changes become conflicts, watches fire more frequently
than they should, clients have to parse the whole blob to know what
has changed and filter locally, etc.

> The natural representation is to have the nodes signal that they are
> handling a particular node by creating an ephemeral file under a per shard
> directory.  This is nice because node failures cause automagical update of
> the data.  The dual is also natural ... we can create shard files under node
> directories.  That dual is a serious mistake, however, and it is much better
> to put all the dual information in a single node file that the node itself
> creates.  This allows ephemerality to maintain a correct view for us.

Interesting indeed.

(...)
> This doesn't eliminate all desire for transactions, but it gets rid of LOTs
> of them.

Thanks for these ideas.

-- 
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Re: Parent nodes & multi-step transactions

Posted by Ted Dunning <te...@gmail.com>.

My own opinion is that lots of these structure sorts of problems are solved
by putting the structure into a single znode.  Atomic creation and update
come for free at that point and we can even make the node ephemeral which we
can't really do if there are children.

It is tempting to use children and grand-children in ZK when this is needed,
but it is surprisingly useful to avoid this.

Take Katta as an example.  This is a sharded query systems.  The master
knows about shards that need to be handled by nodes.  Nodes come on-line and
advertise their existence.  The master assigns shards to nodes.  The nodes
download the shards and advertise that they are handling the nodes.  The
master has to handle node failures and recoveries.

The natural representation is to have the nodes signal that they are
handling a particular node by creating an ephemeral file under a per shard
directory.  This is nice because node failures cause automagical update of
the data.  The dual is also natural ... we can create shard files under node
directories.  That dual is a serious mistake, however, and it is much better
to put all the dual information in a single node file that the node itself
creates.  This allows ephemerality to maintain a correct view for us.

There are other places where this idea works well.  One such thing is a
queue of tasks.  The queue itself can be represented as several files that
contain lots of tasks instead of keeping each task in a separate file.

This doesn't eliminate all desire for transactions, but it gets rid of LOTs
of them.

On Tue, Aug 24, 2010 at 12:31 AM, Dave Wright <wr...@gmail.com> wrote:

> For my $0.02, I really think it would be nice if ZK supported
> "lightweight transactions". By that, I simply mean that a batch of
> create/update/delete requests could be submitted in a single request,
> and be processed atomically (if any of the requests would fail, none
> are applied).
> I know transactions have been discussed before and discarded as adding
> too much complexity, but I think a simple version of transactions
> would be extremely helpful. A significant portion of our code is
> cleanup/workarounds for the inability to make several updates
> atomically. Should the time allow for me to work on any single
> feature, that's probably the one I would pick, although I'm concerned
> that there would be resistance to accepting upstream.
>
> -Dave Wright
>
> On Mon, Aug 23, 2010 at 6:51 PM, Gustavo Niemeyer <gu...@niemeyer.net>
> wrote:
> > Hi Mahadev,
> >
> >>  Usually the paradigm I like to suggest is to have something like
> >>
> >> /A/init
> >>
> >> Every client watches for the existence of this node and this node is
> only
> >> created after /A has been initialized with the creation of /A/C or other
> >> stuff.
> >>
> >> Would that work for you?
> >
> > Yeah, this is what I referred to as "liveness nodes" in my prior
> > ramblings, but I'm a bit sad about the amount of boilerplate work that
> > will have to be done to put use something like this.  It feels like as
> > the size of the problem increases, it might become a bit hard to keep
> > the whole picture in mind.
> >
> > Here is a slightly more realistic example (still significantly
> > reduced), to give you an idea of the problem size:
> >
> > /services/wordpress/settings
> > /services/wordpress/units/wordpress-0/agent-connected
> > /services/wordpress/units/wordpress-1
> > /machines/machine-0/agent-connected
> > /machines/machine-0/units/wordpress-1
> > /machines/machine-1/units/wordpress-0
> >
> > There are quite a few dynamic nodes here which are created and
> > initialized on demand.  If we use these liveness nodes, we'll have to
> > not only set watches in several places, but also have some kind of
> > recovering daemon to heal a half-created state, and also filter
> > user-oriented feedback to avoid showing nodes which may be dead.  All
> > of that would be avoided if there was a way to have multi-step atomic
> > actions.  I'm almost pondering about a journal-like system on top of
> > the basic API, to avoid having to deal with this manually.
> >
> > --
> > Gustavo Niemeyer
> > http://niemeyer.net
> > http://niemeyer.net/blog
> > http://niemeyer.net/twitter
> >
>

Re: Parent nodes & multi-step transactions

Posted by Dave Wright <wr...@gmail.com>.

For my $0.02, I really think it would be nice if ZK supported
"lightweight transactions". By that, I simply mean that a batch of
create/update/delete requests could be submitted in a single request,
and be processed atomically (if any of the requests would fail, none
are applied).
I know transactions have been discussed before and discarded as adding
too much complexity, but I think a simple version of transactions
would be extremely helpful. A significant portion of our code is
cleanup/workarounds for the inability to make several updates
atomically. Should the time allow for me to work on any single
feature, that's probably the one I would pick, although I'm concerned
that there would be resistance to accepting upstream.

-Dave Wright

On Mon, Aug 23, 2010 at 6:51 PM, Gustavo Niemeyer <gu...@niemeyer.net> wrote:
> Hi Mahadev,
>
>>  Usually the paradigm I like to suggest is to have something like
>>
>> /A/init
>>
>> Every client watches for the existence of this node and this node is only
>> created after /A has been initialized with the creation of /A/C or other
>> stuff.
>>
>> Would that work for you?
>
> Yeah, this is what I referred to as "liveness nodes" in my prior
> ramblings, but I'm a bit sad about the amount of boilerplate work that
> will have to be done to put use something like this.  It feels like as
> the size of the problem increases, it might become a bit hard to keep
> the whole picture in mind.
>
> Here is a slightly more realistic example (still significantly
> reduced), to give you an idea of the problem size:
>
> /services/wordpress/settings
> /services/wordpress/units/wordpress-0/agent-connected
> /services/wordpress/units/wordpress-1
> /machines/machine-0/agent-connected
> /machines/machine-0/units/wordpress-1
> /machines/machine-1/units/wordpress-0
>
> There are quite a few dynamic nodes here which are created and
> initialized on demand.  If we use these liveness nodes, we'll have to
> not only set watches in several places, but also have some kind of
> recovering daemon to heal a half-created state, and also filter
> user-oriented feedback to avoid showing nodes which may be dead.  All
> of that would be avoided if there was a way to have multi-step atomic
> actions.  I'm almost pondering about a journal-like system on top of
> the basic API, to avoid having to deal with this manually.
>
> --
> Gustavo Niemeyer
> http://niemeyer.net
> http://niemeyer.net/blog
> http://niemeyer.net/twitter
>

Re: Parent nodes & multi-step transactions

Posted by Gustavo Niemeyer <gu...@niemeyer.net>.

Hi Mahadev,

>  Usually the paradigm I like to suggest is to have something like
>
> /A/init
>
> Every client watches for the existence of this node and this node is only
> created after /A has been initialized with the creation of /A/C or other
> stuff.
>
> Would that work for you?

Yeah, this is what I referred to as "liveness nodes" in my prior
ramblings, but I'm a bit sad about the amount of boilerplate work that
will have to be done to put use something like this.  It feels like as
the size of the problem increases, it might become a bit hard to keep
the whole picture in mind.

Here is a slightly more realistic example (still significantly
reduced), to give you an idea of the problem size:

/services/wordpress/settings
/services/wordpress/units/wordpress-0/agent-connected
/services/wordpress/units/wordpress-1
/machines/machine-0/agent-connected
/machines/machine-0/units/wordpress-1
/machines/machine-1/units/wordpress-0

There are quite a few dynamic nodes here which are created and
initialized on demand.  If we use these liveness nodes, we'll have to
not only set watches in several places, but also have some kind of
recovering daemon to heal a half-created state, and also filter
user-oriented feedback to avoid showing nodes which may be dead.  All
of that would be avoided if there was a way to have multi-step atomic
actions.  I'm almost pondering about a journal-like system on top of
the basic API, to avoid having to deal with this manually.

-- 
Gustavo Niemeyer
http://niemeyer.net
http://niemeyer.net/blog
http://niemeyer.net/twitter

Re: Parent nodes & multi-step transactions

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

Hi Gustavo,
 Usually the paradigm I like to suggest is to have something like

/A/init

Every client watches for the existence of this node and this node is only
created after /A has been initialized with the creation of /A/C or other
stuff.

Would that work for you?

Thanks
mahadev


On 8/23/10 7:34 AM, "Gustavo Niemeyer" <gu...@niemeyer.net> wrote:

> Greetings,
> 
> We (a development team at Canonical) are stumbling into a situation
> here which I'd be curious to understand what is the general practice,
> since I'm sure this is somewhat of a common issue.
> 
> It's quite easy to describe it: say there's a parent node A somewhere
> in the tree.  That node was created dynamically over the course of
> running the system, because it's associated with some resource which
> has its own life-span.  Now, under this node we put some control nodes
> for different reasons (say, A/B), and we also want to track some
> information which is related to a sequence of nodes (say, A/C/D-0,
> A/C/D-1, etc).
> 
> So, we end up with something like this:
> 
>     A/B
>     A/C/D-0
>     A/C/D-1
> 
> The question here is about best-practices for taking care of nodes
> like A/C.  It'd be fantastic to be able to create A's structure
> together with A itself, otherwise we risk getting in a situation where
> a client can see the node A before its "initialization" has been
> finished (A/C doesn't exist yet).  In fact, A/C may never exist, since
> it is possible for a client to die between the creation of A and C.
> 
> Anyway, I'm sure you all understand the problem.  The question here
> is: this is pretty common, and quite boring to deal with properly on
> every single client.  Is there any feature in the roadmap to deal with
> this, and any common practice besides the obvious "check for
> half-initialization and wait for A/C to be created or deal with
> timeouts and whatnot" on every client?
> 
> I'm about to start writing another layer on top of Zookeeper's API, so
> it'd be great to have some additional insight into this issue.
> 
> --
> Gustavo Niemeyer
> http://niemeyer.net
> http://niemeyer.net/blog
> http://niemeyer.net/twitter
>