You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2014/06/20 18:06:22 UTC

Name index

Hi,

Here's an idea for an index structure (for now somewhat specific to
SegmentMK) for speeding up node name and property existence queries.
The index could also be used to avoid full repository traversals when
processing queries with unindexed property constraints. In some cases
it might also work pretty well for queries with multi-property
constraints for which we currently don't have very good support.

INDEX STRUCTURE

The basic idea is to keep track of how many times each node or
property name occurs within each subtree. The index structure would
look something like this:

    { "foo": { "a": 238, "b": 13 },
      "bar": { "a": 172, "c": 86 } }

Such an index would for example tell that there are 238 nodes named
"foo" or with a property named "foo" within the "/a" subtree, and 86
nodes with "bar" within the "/c" subtree.

The index structure would be repeated for each subtree that contains
more than two nodes. For example the index structure for "/a" could
be:

    { "foo": { "d": 137, "e": 100 },
      "bar": { "e": 100, "f": 72 } }

So we'd have 137 nodes with "foo" in the "/a/d" subtree and 100 in the
"/a/e" subtree. Based on the index we can also tell that there are no
"bar" nodes or properties within the "/a/d" subtree.

COST CALCULATION

To calculate the cost of a query like "//[@foo='...']", you'd look at
the index structure and add together the subtree counts for that name.
In this case the cost estimate would be:

    238 + 13 = 251

If multiple names are referenced in a query, like "//[@foo='...' and
@bar='...']", then you'd take the counts for both names and add
together the minimum value of each subtree. In this case the cost
estimate would be:

    min(238, 172) + min(13, 0) + min(0, 86) = 172

If a query has a path restriction, like in
"/jcr:root/a//[@foo='...']", we can use the index structure within
that path for a more specific cost estimate.

INDEX LOOKUP

Index lookup would be implemented as a tree traversal that's guided by
the subtree counts stored in the index.

When looking up a query like "//[@foo='...']", you'd first look a the
top-level index structure to find that the only nodes with "foo" items
are within the "/a" and "/b" subtrees, and would then descend to those
subtrees. In "/a" you'd find that there are more "foo" items within
"/a/d" and "/a/e" so that's where the traversal would continue. Also,
since 137 + 100 < 238, you'd report the "/a" path as a potential match
for the query. Finally, when encountering a leaf node for which no
explicit index data is present, you'd do a hasProperty("foo") check to
determine whether that path should be reported as a potential match.

And if multiple names are referenced in a query, like "//[@foo='...'
and @bar='...']", you'd use the minimum of the matching subtree counts
to direct the tree traversal. Since by looking at the index we can
tell that neither the "/b" nor the "/c" subtree contains both "foo"
and "bar" items, the traversal can be limited to just the "/a"
subtree, and in there to the "/a/e" subtree that is the only place
where both "foo" and "bar" can be found.

If a query has a path restriction, the guided traversal can start
directly at that path.

INDEX UPDATES

The index would be maintained by a normal index editor that for each
added/removed items would update the respective name counts at each
level of the index up to the root. These updates can be consolidated
at each level of the index to keep the number of writes down to a
minimum. The cost of updating the index would be O(d * k) where d is
the depth of the changes in a commit and k the number of added/removed
items. The cost is similar to that of the property index where k is
the number of added/removed values. Changing the value of an existing
property would trigger no updates to this index.

Since the index updates work their way recursively up to the root of
the tree, there is an inherent synchronization bottleneck in updating
the count values higher up the tree. This (and the way storage is
optimized) makes this index structure well suited for the SegmentMK,
but presents a problem for the DocMK. In there a better way to
implement a similar index might be to leverage the underlying indexing
capabilities of MongoDB or whichever database is used under the hood.
Alternatively it might be possible to adjust the index structure to
avoid this problem, though for now I don't see any good way of doing
so.

BR,

Jukka Zitting

Re: Name index

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Jun 23, 2014 at 3:46 AM, Thomas Mueller <mu...@adobe.com> wrote:
> What if a node contains millions of (direct) child nodes, how would one do
> an efficient lookup?

The index structure contains the names of the matching subtrees, which
allows us to avoid iterating over all the child nodes in such cases.

For example, a part of the index structure for a node with 10k child
nodes named "aNNNN" might look something like this:

    { "foo": { "a2135": 238, "a5382": 13 },
      "bar": { "a2135": 172, "a9821": 86 } }

When looking up nodes with "foo", you'd only need to consider the
"a2135" and "a5382" subtrees, and can ignore the remaining 9998
children.

> We have quite many (property) indexes, what would be the storage overhead?
> (I think it would be quite significant with about 100 property indexes.)

This would be a separate index, with nothing in common with the
property indexes.

Given the index structure and the way SegmentMK avoids storing
duplicate copies of repeating names, I expect the storage overhead to
be pretty low, < 1% of the repository size. Big-O calculation of the
storage overhead is a bit complicated, so I plan to do a simple
prototype to better estimate the actual overhead on a few real-world
repositories.

> How could this index structure speed up node name queries? Would the root
> node know all the names of all nodes in the repository?

The root node would know all the *distinct* names in the repository,
along with the subtrees that contain at least a single instance of
those names.

> What would be very, very, very valuable (to better estimate the cost of an
> index or traversal), is to know the number of descendent nodes of a node,
> see also OAK-894.

Checking this index for the count of "jcr:primaryType" would give a
pretty accurate estimate of the size of a subtree.

Alternatively the index could be easily be extended to also maintain
an exact subtree size count, for example by indexing a virtual ":node"
name for each node in the repository. Similarly a property count could
be implemented by indexing ":property" along with the normal name of
each property. Even further, we could pretty efficiently keep track of
the size of each subtree by indexing ":size" for the length() of each
value. With such extensions the index data for an nt:resource leaf
node could look like this:

    { "jcr:primaryType": 1,
      "jcr:uuid", 1,
      "jcr:data": 1,
      "jcr:lastModified": 1,
      "jcr:mimeType": 1,
      ":node": 1,
      ":property", 5,
      ":size", 15827 }

(Of course, to save storage space, such single-node index data would
not actually be stored in the repository, only aggregated to the index
entries higher up the tree.)

BR,

Jukka Zitting

Re: Name index

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

What if a node contains millions of (direct) child nodes, how would one do
an efficient lookup?

We have quite many (property) indexes, what would be the storage overhead?
(I think it would be quite significant with about 100 property indexes.)

> for speeding up node name and property existence queries.


How could this index structure speed up node name queries? Would the root
node know all the names of all nodes in the repository?

What would be very, very, very valuable (to better estimate the cost of an
index or traversal), is to know the number of descendent nodes of a node,
see also OAK-894.

Regards,
Thomas



On 20/06/14 18:06, "Jukka Zitting" <ju...@gmail.com> wrote:

>Hi,
>
>Here's an idea for an index structure (for now somewhat specific to
>SegmentMK) for speeding up node name and property existence queries.
>The index could also be used to avoid full repository traversals when
>processing queries with unindexed property constraints. In some cases
>it might also work pretty well for queries with multi-property
>constraints for which we currently don't have very good support.
>
>INDEX STRUCTURE
>
>The basic idea is to keep track of how many times each node or
>property name occurs within each subtree. The index structure would
>look something like this:
>
>    { "foo": { "a": 238, "b": 13 },
>      "bar": { "a": 172, "c": 86 } }
>
>Such an index would for example tell that there are 238 nodes named
>"foo" or with a property named "foo" within the "/a" subtree, and 86
>nodes with "bar" within the "/c" subtree.
>
>The index structure would be repeated for each subtree that contains
>more than two nodes. For example the index structure for "/a" could
>be:
>
>    { "foo": { "d": 137, "e": 100 },
>      "bar": { "e": 100, "f": 72 } }
>
>So we'd have 137 nodes with "foo" in the "/a/d" subtree and 100 in the
>"/a/e" subtree. Based on the index we can also tell that there are no
>"bar" nodes or properties within the "/a/d" subtree.
>
>COST CALCULATION
>
>To calculate the cost of a query like "//[@foo='...']", you'd look at
>the index structure and add together the subtree counts for that name.
>In this case the cost estimate would be:
>
>    238 + 13 = 251
>
>If multiple names are referenced in a query, like "//[@foo='...' and
>@bar='...']", then you'd take the counts for both names and add
>together the minimum value of each subtree. In this case the cost
>estimate would be:
>
>    min(238, 172) + min(13, 0) + min(0, 86) = 172
>
>If a query has a path restriction, like in
>"/jcr:root/a//[@foo='...']", we can use the index structure within
>that path for a more specific cost estimate.
>
>INDEX LOOKUP
>
>Index lookup would be implemented as a tree traversal that's guided by
>the subtree counts stored in the index.
>
>When looking up a query like "//[@foo='...']", you'd first look a the
>top-level index structure to find that the only nodes with "foo" items
>are within the "/a" and "/b" subtrees, and would then descend to those
>subtrees. In "/a" you'd find that there are more "foo" items within
>"/a/d" and "/a/e" so that's where the traversal would continue. Also,
>since 137 + 100 < 238, you'd report the "/a" path as a potential match
>for the query. Finally, when encountering a leaf node for which no
>explicit index data is present, you'd do a hasProperty("foo") check to
>determine whether that path should be reported as a potential match.
>
>And if multiple names are referenced in a query, like "//[@foo='...'
>and @bar='...']", you'd use the minimum of the matching subtree counts
>to direct the tree traversal. Since by looking at the index we can
>tell that neither the "/b" nor the "/c" subtree contains both "foo"
>and "bar" items, the traversal can be limited to just the "/a"
>subtree, and in there to the "/a/e" subtree that is the only place
>where both "foo" and "bar" can be found.
>
>If a query has a path restriction, the guided traversal can start
>directly at that path.
>
>INDEX UPDATES
>
>The index would be maintained by a normal index editor that for each
>added/removed items would update the respective name counts at each
>level of the index up to the root. These updates can be consolidated
>at each level of the index to keep the number of writes down to a
>minimum. The cost of updating the index would be O(d * k) where d is
>the depth of the changes in a commit and k the number of added/removed
>items. The cost is similar to that of the property index where k is
>the number of added/removed values. Changing the value of an existing
>property would trigger no updates to this index.
>
>Since the index updates work their way recursively up to the root of
>the tree, there is an inherent synchronization bottleneck in updating
>the count values higher up the tree. This (and the way storage is
>optimized) makes this index structure well suited for the SegmentMK,
>but presents a problem for the DocMK. In there a better way to
>implement a similar index might be to leverage the underlying indexing
>capabilities of MongoDB or whichever database is used under the hood.
>Alternatively it might be possible to adjust the index structure to
>avoid this problem, though for now I don't see any good way of doing
>so.
>
>BR,
>
>Jukka Zitting

Re: Name index

Posted by Jukka Zitting <ju...@gmail.com>.

Hi,

On Mon, Jun 23, 2014 at 3:07 AM, Davide Giannella
<gi...@gmail.com> wrote:
> What concern me most is the update part. AFAIU doing a node count it's
> not that cheap so I guess you were thinking something around
> getCount(MAX) and if the count == max do some estimation around what
> could be there over the MAX limit.

Since the counts are already included in the index structure, all we
need to do is iterate over the properties of the matching index node
and sum up the values. If that turns out to be too costly, we can even
keep a pre-calculated total count in an extra property.

> The other bit is that is very Segment specific.

Right. The proposed index is designed to leverage some of the benefits
of the SegmentMK model. The DocMK has different tradeoffs and thus can
better accommodate a differently organized index.

BTW, the same applies also to the property index. The default content
mirroring strategy used by the property index is actually unnecessary
overhead on the SegmentMK, where using a multi-valued property per
each index entry would be more efficient and would also easily give
more exact cost estimates as the number of matching paths would be
directly available.

> Last point is how/if we want the end user configure it. Within the
> repository as a standard property index?

I have two alternatives in mind:

1) Using a normal query index definition in /oak:index, with a custom
index type like "name".

2) Making the SegmentMK itself maintain such statistics in a hidden
subtree like /:segmentmk/names. There would be no need to explicitly
configure the index, and on repositories where that subtree is present
a matching QueryIndex implementation would automatically use it to
speed up affected queries.

I kind of like the latter approach, as it's conceptually similar to
the idea of using something like a MongoDB index to speed up certain
queries when using MongoMK.

> I'm saying because it will have an impact on repository size and performane
> for the update and maybe someone would like to say something like: I know
> property foo won't bee anywhere else than /a/b and in case I don't care. I
> would like to have only /a/b updated for foo.

As outlined in the proposal, the extra cost of updating this index is
actually pretty small. And the way the SegmentMK avoids duplicate
storage of names and values makes the size overhead pretty low.

So a non-configurable alternative like 2 might be just fine, or in
alternative 1 it would be possible to explicitly configure some
exclude rules like "I don't care about the 'foo' name, nor the '/bar'
subtree".

BR,

Jukka Zitting

Re: Name index

Posted by Davide Giannella <gi...@gmail.com>.

On 20/06/2014 18:06, Jukka Zitting wrote:
> Hi,
>
> Here's an idea for an index structure (for now somewhat specific to
> SegmentMK) for speeding up node name and property existence queries.
> ...
>
> INDEX UPDATES
>
> The index would be maintained by a normal index editor that for each
> added/removed items would update the respective name counts at each
> level of the index up to the root. These updates can be consolidated
> at each level of the index to keep the number of writes down to a
> minimum. The cost of updating the index would be O(d * k) where d is
> the depth of the changes in a commit and k the number of added/removed
> items. The cost is similar to that of the property index where k is
> the number of added/removed values. Changing the value of an existing
> property would trigger no updates to this index.
>
> Since the index updates work their way recursively up to the root of
> the tree, there is an inherent synchronization bottleneck in updating
> the count values higher up the tree. This (and the way storage is
> optimized) makes this index structure well suited for the SegmentMK,
> but presents a problem for the DocMK. In there a better way to
> implement a similar index might be to leverage the underlying indexing
> capabilities of MongoDB or whichever database is used under the hood.
> Alternatively it might be possible to adjust the index structure to
> avoid this problem, though for now I don't see any good way of doing
> so.
First of all I think it could work out well :)

What concern me most is the update part. AFAIU doing a node count it's
not that cheap so I guess you were thinking something around
getCount(MAX) and if the count == max do some estimation around what
could be there over the MAX limit.

The other bit is that is very Segment specific. As you already
highlighted we should come up with something different based on the
underlying used persistence.  It's not nice from a code point of view as
it will complicate things but I don't see any other way either as of
now. We could make it as if it's not there fall back on traversing as usual.

Last point is how/if we want the end user configure it. Within the
repository as a standard property index? I'm saying because it will have
an impact on repository size and performane for the update and maybe
someone would like to say something like: I know property foo won't bee
anywhere else than /a/b and in case I don't care. I would like to have
only /a/b updated for foo.

Cheers
Davide