You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2010/02/17 17:17:54 UTC

[jr3] Flat hierarchy

Hi,

This one's a quite frequently asked feature. Currently we store the
full list of child nodes in the same bundle or node state record with
the parent node. This makes it expensive to support nodes with large
amounts (>>1k) of child nodes.

The obvious solution to this problem is to use a data structure like a
B-tree to keep track of the child node entries. Doing this efficiently
while still supporting both orderable nodes and same-name-siblings
(i.e. nt:unstructured) probably requires some deeper thought. An added
benefit would be the ability to do this on top of an append-only
storage model.

Any other ideas on how we could do this?

BR,

Jukka Zitting

Re: [jr3] Flat hierarchy

Posted by Ian Boston <ie...@tfd.co.uk>.

On 18 Feb 2010, at 13:04, Stefan Guggisberg wrote:

> On Wed, Feb 17, 2010 at 5:17 PM, Jukka Zitting <ju...@gmail.com> wrote:
>> Hi,
>> 
>> This one's a quite frequently asked feature. Currently we store the
>> full list of child nodes in the same bundle or node state record with
>> the parent node. This makes it expensive to support nodes with large
>> amounts (>>1k) of child nodes.
> 
> in my experience notable write-performance impacts start with >>10-20k
> child nodes ;)
> 10-20k is quite a lot IMO, most applications probably won't ever hit
> this 'limit'.

I am not arguing for flat hierarchies but I spend a lot of time persuading UX designers and management that 100k or more child nodes in not acceptable in a hierarchical content system.

With an instance supporting 4M identified users with a unique user name, it hard to argue that /home/<uid> is not acceptable when there are so many other systems that support this. (we have modified the UserManager to scale)

> 
> i agree that flat hierarchies is an important feature, however we
> shouldn't compromise
> the performance for the 'standard' case with less than 1k child nodes
> just for the sake
> of supporting flat hierarchies. after all, the JCR api is inherently
> hierarchic.
> 
> maybe we should provide separate node types or node type attributes to
> indicate that
> instances of  that type may have a large number of child nodes. nodes
> of that node
> type would be represented/handled differently in the persistence layer.

This is the approach that we have taken, however its not without problems when trying to discover whats in /home

Ian

> 
> cheers
> stefan
> 
>> 
>> The obvious solution to this problem is to use a data structure like a
>> B-tree to keep track of the child node entries. Doing this efficiently
>> while still supporting both orderable nodes and same-name-siblings
>> (i.e. nt:unstructured) probably requires some deeper thought. An added
>> benefit would be the ability to do this on top of an append-only
>> storage model.
>> 
>> Any other ideas on how we could do this?
>> 
>> BR,
>> 
>> Jukka Zitting
>>

Re: [jr3] Flat hierarchy

Posted by Thomas Müller <th...@day.com>.

Hi,

A Jackrabbit repository is "some kind of" b-tree - just that the pages
never split and balanced automatically. Maybe using b-tree is
confusing? Let's call it "manual b-tree" then.

> i agree that flat hierarchies is an important feature, however we
> shouldn't compromise
> the performance for the 'standard' case with less than 1k child nodes

I agree. Using the b-tree style wouldn't slow down the standard case.
In the standard case things would stay exactly like they are now.

> add a "next" pointer
> This makes the data structure more complex but allows us to maintain support for orderable nodes.

That's definitely an option. I just wonder if it's really required. I
guess we will find out.

Regards,
Thomas

Re: [jr3] Flat hierarchy

Posted by Stefan Guggisberg <st...@gmail.com>.

On Wed, Feb 17, 2010 at 5:17 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> This one's a quite frequently asked feature. Currently we store the
> full list of child nodes in the same bundle or node state record with
> the parent node. This makes it expensive to support nodes with large
> amounts (>>1k) of child nodes.

in my experience notable write-performance impacts start with >>10-20k
child nodes ;)
10-20k is quite a lot IMO, most applications probably won't ever hit
this 'limit'.

i agree that flat hierarchies is an important feature, however we
shouldn't compromise
the performance for the 'standard' case with less than 1k child nodes
just for the sake
of supporting flat hierarchies. after all, the JCR api is inherently
hierarchic.

maybe we should provide separate node types or node type attributes to
indicate that
instances of  that type may have a large number of child nodes. nodes
of that node
type would be represented/handled differently in the persistence layer.

cheers
stefan

>
> The obvious solution to this problem is to use a data structure like a
> B-tree to keep track of the child node entries. Doing this efficiently
> while still supporting both orderable nodes and same-name-siblings
> (i.e. nt:unstructured) probably requires some deeper thought. An added
> benefit would be the ability to do this on top of an append-only
> storage model.
>
> Any other ideas on how we could do this?
>
> BR,
>
> Jukka Zitting
>

Re: [jr3] Flat hierarchy

Posted by Thomas Müller <th...@day.com>.

About micro-kernel: I think the micro-kernel shouldn't have to support
splitting large child lists into smaller lists. That could be done on
a higher level. What it should support is all the features required by
this mechanism:

- Support 'hidden' nodes (those are marked using a hidden property).
That means the path doesn't always map directly to the stored
structure. Therefore the micro-kernel should not be directly
responsible to build or interpret JCR paths (micro-kernel path are
similar but they don't always match the JCR path).

- The entry in the child node list may contain multiple properties (in
most cases the name, and the child node id; but sometimes also the
reference of the next child node). The number of properties for each
entry is the same however. For sorting, only the first element is
relevant.

- The child node list can always be stored in sorted order. But this
sorting doesn't always map to the JCR child node list.

Regards,
Thomas

Re: [jr3] Flat hierarchy

Posted by Jeff Yemin <je...@gmail.com>.

Thomas Müller-2 wrote:
> 
>> The part that's not clear to me is how this can be efficiently combined
>> with
>> an append-only storage format that's being discussed on the "[jr3]
>> Unified
>> persistence" thread.  It wouldn't be good if every time a list of
>> children
>> is modified the persistence layer has to make a complete copy of the
>> modified B-tree
> 
> You only have to update the b-tree node that is modified. That may be
> a hidden node (internal, hidden b-tree node) or a "real" node.
> 

In that case re-balancing of the tree will have to be taken into account,
where multiple internal nodes of the tree can be modified as a result of an
insertion or deletion.

Also, it seems to me that technically we're probably talking about a B+
tree, as it makes sense that the real child nodes would all be pointed to
from the leaf nodes.

Jeff

-- 
View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560979.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: [jr3] Flat hierarchy

Posted by Alexander Klimetschek <ak...@day.com>.

On Thu, Feb 18, 2010 at 22:36, Justin Edelson <ju...@gmail.com> wrote:
> Perhaps JR3 should define a node type called unstructured-unordered :)
> Would be non-standard, of course.

He he ;-) But this could simply be documented clearly, eg. in the FAQ,
and provide a sample node type "my:bigpile" that has residual child
node definitions, but no orderable flag.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jr3] Flat hierarchy

Posted by Justin Edelson <ju...@gmail.com>.

On 2/18/10 4:27 PM, Alexander Klimetschek wrote:
> On Thu, Feb 18, 2010 at 22:04, Jeff Yemin <je...@gmail.com> wrote:
>>
>> Thomas Müller-2 wrote:
>>> You are right. Unfortunately "orderable child nodes" is the default.
>>
>> Where in the spec does it say what the default is for orderable child nodes?
>> The only place I see it is in the CND for nt:unstructured.  Or is this a
>> Jackrabbit implementation default that users now rely on?
> 
> I think Thomas refers to the commonly used nt:unstructured that has
> orderable child nodes. Even if you inherit from it, you can't change
> that behavior.

Perhaps JR3 should define a node type called unstructured-unordered :)
Would be non-standard, of course.

Justin

> 
> 
>> Of the use cases I'm aware of for flat
>> hierarchies, none require an ordering of the children anyway, so this might
>> be an acceptable limitation in the implementation.
> 
> Right, there are either a low number of nodes that might be orderable
> or a large number of nodes, that are unordered. At least these are the
> cases that jr3 should be optimized for.
> 
> So for the unordered case, one should use a node type that does not
> have orderable child nodes. If someone choses to add a lot of nodes to
> an orderable child nodes list, it is ok to be slow.
> 
> Regards,
> Alex
>

Re: [jr3] Flat hierarchy

Posted by Alexander Klimetschek <ak...@day.com>.

On Thu, Feb 18, 2010 at 22:04, Jeff Yemin <je...@gmail.com> wrote:
>
> Thomas Müller-2 wrote:
>> You are right. Unfortunately "orderable child nodes" is the default.
>
> Where in the spec does it say what the default is for orderable child nodes?
> The only place I see it is in the CND for nt:unstructured.  Or is this a
> Jackrabbit implementation default that users now rely on?

I think Thomas refers to the commonly used nt:unstructured that has
orderable child nodes. Even if you inherit from it, you can't change
that behavior.

> Of the use cases I'm aware of for flat
> hierarchies, none require an ordering of the children anyway, so this might
> be an acceptable limitation in the implementation.

Right, there are either a low number of nodes that might be orderable
or a large number of nodes, that are unordered. At least these are the
cases that jr3 should be optimized for.

So for the unordered case, one should use a node type that does not
have orderable child nodes. If someone choses to add a lot of nodes to
an orderable child nodes list, it is ok to be slow.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jr3] Flat hierarchy

Posted by Jeff Yemin <je...@gmail.com>.

Thomas Müller-2 wrote:
> 
> You are right. Unfortunately "orderable child nodes" is the default.
> 

Where in the spec does it say what the default is for orderable child nodes? 
The only place I see it is in the CND for nt:unstructured.  Or is this a
Jackrabbit implementation default that users now rely on?

Jeff

-- 
View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560831.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: [jr3] Flat hierarchy

Posted by Thomas Müller <th...@day.com>.

Hi,

> Even without using orderBefore, the specification still requires a stable
> ordering of children for nodes that support orderable child nodes (see 23.1
> of JCR 2.0 spec).

Thanks for the link! I see now my idea violates the specification
which says (23.3) "When a child node is added to a node that has
orderable child nodes it is added to the end of the list." My idea was
to add the child node according to its name (until the order is
changed using "orderBefore").

> One possibility is to limit the use of the B-tree to nodes that do not support orderable children

You are right. Unfortunately "orderable child nodes" is the default.

Another solution is to keep a linked list from child node entry to
child node entry (only in this case). Let's see how complicated that
is.

By the way "same name siblings" would work fine (no linked list required).

Regards,
Thomas

Re: [jr3] Flat hierarchy

Posted by Jeff Yemin <je...@gmail.com>.

Thomas Müller-2 wrote:
> 
> If there are many child nodes, then we should try to optimize for the
> most important case. I think it doesn't make sense to optimize for the
> case that the long list (many thousand) children are manually
> re-ordered (using orderBefore).
> 

Even without using orderBefore, the specification still requires a stable
ordering of children for nodes that support orderable child nodes (see 23.1
of JCR 2.0 spec).  One possibility is to limit the use of the B-tree to
nodes that do not support orderable children, and fall back to the simple
array for orderable children.  Of the use cases I'm aware of for flat
hierarchies, none require an ordering of the children anyway, so this might
be an acceptable limitation in the implementation.

Jeff

-- 
View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560650.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: [jr3] Flat hierarchy

Posted by Thomas Müller <th...@day.com>.

Hi,

> JCR requires lookup of children by name and/or position (for orderable
> children), so the implementation needs to support all these cases
> efficiently.  The trickiest one to handle is probably Node.getNodes(String
> namePattern) because it requires using both name and position together.

While it's true that all that needs to be supported, I doubt that we
should try to optimize for all cases. Otherwise the normal case will
be slower.

Usually, there are not that many child nodes. In that case lookup is
not a problem: both a array and a hash map can be used (in memory).

If there are many child nodes, then we should try to optimize for the
most important case. I think it doesn't make sense to optimize for the
case that the long list (many thousand) children are manually
re-ordered (using orderBefore).

Regards,
Thomas

Re: [jr3] Flat hierarchy

Posted by Jeff Yemin <je...@gmail.com>.

JCR requires lookup of children by name and/or position (for orderable
children), so the implementation needs to support all these cases
efficiently.  The trickiest one to handle is probably Node.getNodes(String
namePattern) because it requires using both name and position together.

Jeff

-- 
View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560524.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: [jr3] Flat hierarchy

Posted by Thomas Müller <th...@day.com>.

Hi,

> I think Jukka is correct that the correct use of B-trees is to use one for
> each list of child nodes, not as a way to model the entire hierarchy.

If you are more comfortable with this view, that's OK. I probably
should have said: the whole repository is a "tree data structure".

> And there are modifications that can easily be applied to
> B-trees that deal with arbitrary (not based on a key) ordering of the nodes

Sure. Jackrabbit needs a way to quickly navigate to a node by path.
For that, you have to traverse the nodes and for each node you have to
find the correct child. To do that, it's better if the child node list
is ordered by name. Otherwise you have to iterate over all child nodes
until you find the right one. Or you need a secondary index (Lucene?).
And that's no matter if it's using a b-tree internally or not.

> The part that's not clear to me is how this can be efficiently combined with
> an append-only storage format that's being discussed on the "[jr3] Unified
> persistence" thread.  It wouldn't be good if every time a list of children
> is modified the persistence layer has to make a complete copy of the
> modified B-tree

You only have to update the b-tree node that is modified. That may be
a hidden node (internal, hidden b-tree node) or a "real" node.

Regards,
Thomas

Re: [jr3] Flat hierarchy

Posted by Jeff Yemin <je...@gmail.com>.

I think Jukka is correct that the correct use of B-trees is to use one for
each list of child nodes, not as a way to model the entire hierarchy.  I've
seen this implementation approach used for similar use cases, so I know that
it can work.  And there are modifications that can easily be applied to
B-trees that deal with arbitrary (not based on a key) ordering of the nodes,
even allowing you to skip to the n-th child efficiently (without having to
traverse the entire tree).

The part that's not clear to me is how this can be efficiently combined with
an append-only storage format that's being discussed on the "[jr3] Unified
persistence" thread.  It wouldn't be good if every time a list of children
is modified the persistence layer has to make a complete copy of the
modified B-tree, as that would defeat the purpose of use a b-tree in the
first place, wouldn't it?

Jeff
-- 
View this message in context: http://n4.nabble.com/jr3-Flat-hierarchy-tp1558925p1560288.html
Sent from the Jackrabbit - Dev mailing list archive at Nabble.com.

Re: [jr3] Flat hierarchy

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Thu, Feb 18, 2010 at 2:01 PM, Thomas Müller <th...@day.com> wrote:
> The repository is one large b-tree, and each JCR node is a b-tree node
> (except for the JCR nodes that don't have any child nodes: those are
> b-tree leaves).

I don't think we can model the entire repository as a B-tree (it's
certainly a tree, but the B-tree restructuring and balancing features
don't apply). Instead I'd store the child node lists of each node as
separate B-trees.

> Path lookups would still be fast (the same speed as now), except for
> large child node lists that were re-ordered. The difference is only
> for large child node lists. There is a difference between 'orderable'
> nodes (have the ability to reorder the child node list) and actually
> 're-ordered' child node lists. Is it acceptable if new nodes appear in
> lexicographic order in the child node list?

Instead of modifying the standard insertion order semantics for
orderable nodes, we could add a "next" pointer to the B-tree entries
to overlay a linked list on top of the B-tree structure. This makes
the data structure more complex but allows us to maintain support for
orderable nodes.

BR,

Jukka Zitting

Re: [jr3] Flat hierarchy

Posted by Thomas Müller <th...@day.com>.

Hi,

>> I would also use a b-tree structure. If a node has too many child
>> nodes, two new "invisible internal nodes" are created, and the list of
>> child nodes is split up. Those internal nodes wouldn't have any
>> properties.
>
> You mean a b-tree for each node? I think this could be a separate
> index, but one for the whole tree.

The repository is one large b-tree, and each JCR node is a b-tree node
(except for the JCR nodes that don't have any child nodes: those are
b-tree leaves). If a JCR node has many child nodes, then there is at
least one more level of b-tree nodes between the node and the child
nodes.

> I think supporting fast path lookups for orderable child nodes is a
> bit more important than flat hierarchies

Path lookups would still be fast (the same speed as now), except for
large child node lists that were re-ordered. The difference is only
for large child node lists. There is a difference between 'orderable'
nodes (have the ability to reorder the child node list) and actually
're-ordered' child node lists. Is it acceptable if new nodes appear in
lexicographic order in the child node list?

Regards,
Thomas

Re: [jr3] Flat hierarchy

Posted by Alexander Klimetschek <ak...@day.com>.

On Thu, Feb 18, 2010 at 12:34, Thomas Müller <th...@day.com> wrote:
> I would also use a b-tree structure. If a node has too many child
> nodes, two new "invisible internal nodes" are created, and the list of
> child nodes is split up. Those internal nodes wouldn't have any
> properties.

You mean a b-tree for each node? I think this could be a separate
index, but one for the whole tree.

> If the user changes the child node order (manually re-ordered the
> nodes), then the sort order is broken. Then the path lookup has to
> scan through all nodes. While that's much slower, I think it's
> acceptable.

I think supporting fast path lookups for orderable child nodes is a
bit more important than flat hierarchies, at least for CMS
applications. So we should make it fast in that case as well.

Regards,
Alex

-- 
Alexander Klimetschek
alexander.klimetschek@day.com

Re: [jr3] Flat hierarchy

Posted by Thomas Müller <th...@day.com>.

Hi,

I would also use a b-tree structure. If a node has too many child
nodes, two new "invisible internal nodes" are created, and the list of
child nodes is split up. Those internal nodes wouldn't have any
properties.

For efficient path lookup, the child node list should be sorted by
name. This is a bit tricky.

Currently, when adding a node, it is added as the last child. I
suggest to change that behavior, and add it at the "right place" by
default (so that the sort order is preserved). Like this, a path
lookup is very fast even if there are many child nodes (binary search
/ b-tree). Is that an acceptable change (usability and spec conform)?

If the user changes the child node order (manually re-ordered the
nodes), then the sort order is broken. Then the path lookup has to
scan through all nodes. While that's much slower, I think it's
acceptable. One alternative is to use a linked list (each child node
points to the next child node), which is very problematic for sharable
nodes.

So there would be a hidden flag 'child nodes are sorted by name'.

Regards,
Thomas