You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Jukka Zitting <ju...@gmail.com> on 2012/11/20 17:24:32 UTC

Identifier- or hash-based access in the MicroKernel

Hi,

A lot of functionality in Oak (node states, the diff and hook
mechanisms, etc.) are based on walking down the tree hierarchy one
level at a time. To do this, for example to access changes below
/a/b/c, oak-core will currently request paths /a, /a/b, /a/b/c and so
on from the underlying MK implementation.

This would work reasonably well with MK implementations that are
essentially big hash table that map the full path (and revision) to
the content at that location. Even then there's some space overhead as
even tiny nodes (think of an ACL entry) get paired with the full path
(and revision) of the node. The current MongoMK with its path keys
works like this, though even there a secondary index is needed for the
path lookups.

The approach is less ideal for MK implementations (like the default
H2-based one) that have to traverse the path when some content is
accessed. For example, with the above oak-core access pattern, the
sequence of accessed nodes would be [ a, a, b, a, b, c ], where
ideally just [ a, b, c ] would suffice. The KernelNodeStore cache in
oak-core prevents this from being too big an issue, but ideally we'd
be able to avoid such extra levels of caching.

To solve that mismatch without impacting the overall architecture too
much I'd like to propose the following:

* When requested using the filter argument, the getNodes() call may
(but is not required to) return special ":hash" or ":id" properties as
parts of the (possibly otherwise empty) child node objects included in
the JSON response.

* When returned by getNodes(), those values can be used by the client
instead of the normal path argument when requesting the content of
such child nodes using other getNodes() calls. The MK implementation
is expected to automatically detect whether a given string argument is
a path, a hash or an identifier, possibly as simply as looking at
whether it starts with a slash.

* Both ":hash" and ":id" values are expected to uniquely identify a
specific immutable state of a node. The only difference is that the
inequality of two hashes implies the inequality of the referenced
nodes (which can be used by oak-core to optimize some operations),
whereas it's possible for two different ids to refer to nodes with the
exact same content.

Such a solution would allow the following sequence

   getNodes("/") => { "a": {} }
   getNodes("/a") => { "b": {} }
   getNodes("/a/b") => { "c": {} }
   getNodes("/a/b/c") => {}

to become something like

   getNodes("/") => { "a": { ":id": "x" } }
   getNodes("x") => { "b": { :id": "y" } }
   getNodes("y") => { "c": { :id": "z"} }
   getNodes("z") => {}

with x, y and z being some implementation-specific identifiers, like
ObjectIDs in MongoDB.

In any case the MK implementation would still be required to support
access by full path.

BR,

Jukka Zitting

Re: Identifier- or hash-based access in the MicroKernel

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Fri, Nov 23, 2012 at 4:21 PM, Stefan Guggisberg
<st...@gmail.com> wrote:
> On Thu, Nov 22, 2012 at 4:56 PM, Jukka Zitting <ju...@gmail.com> wrote:
>> See the attachment in https://issues.apache.org/jira/browse/OAK-468
>
> i committed the API changes, MicroKernelImpl support and adapted
> integration tests in svn r1412898.

Great, thanks! I'll take a look at leveraging this on the oak-core side.

BR,

Jukka Zitting

Re: Identifier- or hash-based access in the MicroKernel

Posted by Stefan Guggisberg <st...@gmail.com>.
On Thu, Nov 22, 2012 at 4:56 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Wed, Nov 21, 2012 at 11:00 AM, Jukka Zitting <ju...@gmail.com> wrote:
>> On Tue, Nov 20, 2012 at 8:01 PM, Stefan Guggisberg
>> <st...@gmail.com> wrote:
>>> - do you have a proposal for the suggested MicroKernel API (java doc)
>>>   changes?
>>
>> I'll have one to share shortly...
>
> See the attachment in https://issues.apache.org/jira/browse/OAK-468

i committed the API changes, MicroKernelImpl support and adapted
integration tests
in svn r1412898.

cheers
stefan

>
> BR,
>
> Jukka Zitting

Re: Identifier- or hash-based access in the MicroKernel

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Wed, Nov 21, 2012 at 11:00 AM, Jukka Zitting <ju...@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 8:01 PM, Stefan Guggisberg
> <st...@gmail.com> wrote:
>> - do you have a proposal for the suggested MicroKernel API (java doc)
>>   changes?
>
> I'll have one to share shortly...

See the attachment in https://issues.apache.org/jira/browse/OAK-468

BR,

Jukka Zitting

Re: Identifier- or hash-based access in the MicroKernel

Posted by Stefan Guggisberg <st...@gmail.com>.
On Wed, Nov 21, 2012 at 10:00 AM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> On Tue, Nov 20, 2012 at 8:01 PM, Stefan Guggisberg
> <st...@gmail.com> wrote:
>> - returning an :id and/or :hash should be optional, i.e. we shouldn't
>>   require an implementation to return an :id or :hash for every path
>
> Exactly.
>
>> - i suggest we prefix the id/path getNodes parameter value with ':id:'
>> and ':hash:'
>
> I'd leave the format up to the MK implementation to decide, with
> oak-core just passing whatever the MK gave as the ":id" or ":hash"
> attribute of a child object. For example, an MK that selects to use
> ":id:" and ":hash:" prefixes for such values, would work something
> like this:
>
>     getNodes("/") => { "a": { ":id": ":id:x" } }
>     getNodes(":id:x") => { "b": { :id": ":id:y" } }
>     getNodes(":id:y") => { "c": { :id": ":id:z"} }
>     getNodes(":id:z") => {}

ok, agreed.

cheers
stefan

>
>> - do you have a proposal for the suggested MicroKernel API (java doc)
>>   changes?
>
> I'll have one to share shortly...
>
> BR,
>
> Jukka Zitting

Re: Identifier- or hash-based access in the MicroKernel

Posted by Jukka Zitting <ju...@gmail.com>.
Hi,

On Tue, Nov 20, 2012 at 8:01 PM, Stefan Guggisberg
<st...@gmail.com> wrote:
> - returning an :id and/or :hash should be optional, i.e. we shouldn't
>   require an implementation to return an :id or :hash for every path

Exactly.

> - i suggest we prefix the id/path getNodes parameter value with ':id:'
> and ':hash:'

I'd leave the format up to the MK implementation to decide, with
oak-core just passing whatever the MK gave as the ":id" or ":hash"
attribute of a child object. For example, an MK that selects to use
":id:" and ":hash:" prefixes for such values, would work something
like this:

    getNodes("/") => { "a": { ":id": ":id:x" } }
    getNodes(":id:x") => { "b": { :id": ":id:y" } }
    getNodes(":id:y") => { "c": { :id": ":id:z"} }
    getNodes(":id:z") => {}

> - do you have a proposal for the suggested MicroKernel API (java doc)
>   changes?

I'll have one to share shortly...

BR,

Jukka Zitting

Re: Identifier- or hash-based access in the MicroKernel

Posted by Stefan Guggisberg <st...@gmail.com>.
hi jukka

On Tue, Nov 20, 2012 at 5:24 PM, Jukka Zitting <ju...@gmail.com> wrote:
> Hi,
>
> A lot of functionality in Oak (node states, the diff and hook
> mechanisms, etc.) are based on walking down the tree hierarchy one
> level at a time. To do this, for example to access changes below
> /a/b/c, oak-core will currently request paths /a, /a/b, /a/b/c and so
> on from the underlying MK implementation.
>
> This would work reasonably well with MK implementations that are
> essentially big hash table that map the full path (and revision) to
> the content at that location. Even then there's some space overhead as
> even tiny nodes (think of an ACL entry) get paired with the full path
> (and revision) of the node. The current MongoMK with its path keys
> works like this, though even there a secondary index is needed for the
> path lookups.
>
> The approach is less ideal for MK implementations (like the default
> H2-based one) that have to traverse the path when some content is
> accessed. For example, with the above oak-core access pattern, the
> sequence of accessed nodes would be [ a, a, b, a, b, c ], where
> ideally just [ a, b, c ] would suffice. The KernelNodeStore cache in
> oak-core prevents this from being too big an issue, but ideally we'd
> be able to avoid such extra levels of caching.
>
> To solve that mismatch without impacting the overall architecture too
> much I'd like to propose the following:
>
> * When requested using the filter argument, the getNodes() call may
> (but is not required to) return special ":hash" or ":id" properties as
> parts of the (possibly otherwise empty) child node objects included in
> the JSON response.
>
> * When returned by getNodes(), those values can be used by the client
> instead of the normal path argument when requesting the content of
> such child nodes using other getNodes() calls. The MK implementation
> is expected to automatically detect whether a given string argument is
> a path, a hash or an identifier, possibly as simply as looking at
> whether it starts with a slash.
>
> * Both ":hash" and ":id" values are expected to uniquely identify a
> specific immutable state of a node. The only difference is that the
> inequality of two hashes implies the inequality of the referenced
> nodes (which can be used by oak-core to optimize some operations),
> whereas it's possible for two different ids to refer to nodes with the
> exact same content.
>
> Such a solution would allow the following sequence
>
>    getNodes("/") => { "a": {} }
>    getNodes("/a") => { "b": {} }
>    getNodes("/a/b") => { "c": {} }
>    getNodes("/a/b/c") => {}
>
> to become something like
>
>    getNodes("/") => { "a": { ":id": "x" } }
>    getNodes("x") => { "b": { :id": "y" } }
>    getNodes("y") => { "c": { :id": "z"} }
>    getNodes("z") => {}
>
> with x, y and z being some implementation-specific identifiers, like
> ObjectIDs in MongoDB.
>
> In any case the MK implementation would still be required to support
> access by full path.

makes sense, +1 in general.

some comments:

- returning an :id and/or :hash should be optional, i.e. we shouldn't
  require an implementation to return an :id or :hash for every path
  (an implementation might e.g. want to persist an entire subtree as
  one single persistence entity)
- i suggest we prefix the id/path getNodes parameter value with ':id:'
and ':hash:'
  (or some other scheme) when requesting nodes by hash or identifier
  to avoid a potential ambiguity (an implementation might support
  both access by hash and id).
- do you have a proposal for the suggested MicroKernel API (java doc)
  changes?

cheers
stefan

>
> BR,
>
> Jukka Zitting