You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Ian Boston <ie...@tfd.co.uk> on 2012/10/14 02:49:24 UTC

Question: wide hierachies

Hi,
IIUC Oak has the potential to handle wide hierarchies where there are
millions of child nodes to a parent and hence support some of the use
cases commonly found in Social Media and cloud applications. This
question is about the implementation to support that.

IIUC from reading the code child nodes are accessed from the parent
node, and to avoid the situation in JR2 where child nodes were
represented as an array of objects stored with the parent node, child
nodes in Oak are represented in a internal tree structure where the
width of the tree is controlled to ensure concurrency and performance
is maintained. (did I understand that correctly ?).

In choosing this approach, was a possible alternative considered where
the child node contains a single pointer to its parent, and the child
node is retrieved by hashing it's path rather ie  hash-f(path) ===
child-node-storage-id, (eg sha1(path) == child-node-storage-id )
rather than navigating from the parent.

In my exposure to use cases for social media and cloud like systems I
have observed several things. Listing child nodes from a parent
becomes rarer the closer the application moves towards social media.
Direct pointer based access derived from a hierarchical url is the
main use case. For non hierarchal pointer, searching becomes dominant.
Occasionally listing is required but only as a variant of search.
Scans or sorted scans as nearly always avoided as being impractical.

In social media content, the cardinality of all parents is low making
finding children of parents amenable to an inverted index approach. Ie
the parent becomes a search keyword term.

(the closer to Enterprise Content Management the application is the
inverse of the above is true)

If I understood all of that correctly, my question:

In circumstances where the use cases are more to the social media end
of the spectrum, will it be possible to invert the child-parent
relationship within Oak to be a pointer from the child with an
optional, additional inverted index should finding children of a
specific parent be required ?

Apologies if this is already covered and I missed it. The archives at
Markmail didn't turn anything up.

Thanks
Ian

Re: Question: wide hierachies

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi Stefan,
comments inline:

On 18 October 2012 02:46, Stefan Guggisberg <st...@gmail.com> wrote:
> hi ian,
>
> On Sun, Oct 14, 2012 at 2:49 AM, Ian Boston <ie...@tfd.co.uk> wrote:
>> Hi,
>> IIUC Oak has the potential to handle wide hierarchies where there are
>> millions of child nodes to a parent and hence support some of the use
>> cases commonly found in Social Media and cloud applications. This
>> question is about the implementation to support that.
>>
>> IIUC from reading the code child nodes are accessed from the parent
>> node, and to avoid the situation in JR2 where child nodes were
>> represented as an array of objects stored with the parent node, child
>> nodes in Oak are represented in a internal tree structure where the
>> width of the tree is controlled to ensure concurrency and performance
>> is maintained. (did I understand that correctly ?).
>
> in JR2 the list of child node entries (name-id pairs) is stored within the
> parent node. while this is very efficient for small to medium sized child
> node lists (up to a couple of k entries) read and write performance suffers
> when the list grows very large.
>
> in the current MicroKernel implementation we use an htree based intermediate
> tree structure for storing large child node collections.

phew, thats what I understood.

>
>>
>> In choosing this approach, was a possible alternative considered where
>> the child node contains a single pointer to its parent, and the child
>> node is retrieved by hashing it's path rather ie  hash-f(path) ===
>> child-node-storage-id, (eg sha1(path) == child-node-storage-id )
>> rather than navigating from the parent.
>
> the data model of the current MicroKernel implementation is essentially
> just a DAG of node objects. the versioning model is git-like, i.e.
> every change results in a new immutable tree.
>
> i don't see how child->parent relationships could be implemented with
> this model. IIUC with path-based hashing you'd sacrifice MVCC
> or am i missing something?

Hmm, good point, your right, you cant MVCC the tree structure, only
the data within the tree, since the path -> pointer transformation is
immutable, thats a pity.

Thanks for explaining.
Ian


>
>>
>> In my exposure to use cases for social media and cloud like systems I
>> have observed several things. Listing child nodes from a parent
>> becomes rarer the closer the application moves towards social media.
>> Direct pointer based access derived from a hierarchical url is the
>> main use case. For non hierarchal pointer, searching becomes dominant.
>> Occasionally listing is required but only as a variant of search.
>> Scans or sorted scans as nearly always avoided as being impractical.
>>
>> In social media content, the cardinality of all parents is low making
>> finding children of parents amenable to an inverted index approach. Ie
>> the parent becomes a search keyword term.
>>
>> (the closer to Enterprise Content Management the application is the
>> inverse of the above is true)
>>
>> If I understood all of that correctly, my question:
>>
>> In circumstances where the use cases are more to the social media end
>> of the spectrum, will it be possible to invert the child-parent
>> relationship within Oak to be a pointer from the child with an
>> optional, additional inverted index should finding children of a
>> specific parent be required ?
>
> WRT the MicroKernel implementation:
>
> IMO that's unfortunately not possible with the current git-like
> approach (content-based identifiers, DAG of immutable versions).
>
> cheers
> stefan
>
>>
>> Apologies if this is already covered and I missed it. The archives at
>> Markmail didn't turn anything up.
>>
>> Thanks
>> Ian

Re: Question: wide hierachies

Posted by Stefan Guggisberg <st...@gmail.com>.

hi ian,

On Sun, Oct 14, 2012 at 2:49 AM, Ian Boston <ie...@tfd.co.uk> wrote:
> Hi,
> IIUC Oak has the potential to handle wide hierarchies where there are
> millions of child nodes to a parent and hence support some of the use
> cases commonly found in Social Media and cloud applications. This
> question is about the implementation to support that.
>
> IIUC from reading the code child nodes are accessed from the parent
> node, and to avoid the situation in JR2 where child nodes were
> represented as an array of objects stored with the parent node, child
> nodes in Oak are represented in a internal tree structure where the
> width of the tree is controlled to ensure concurrency and performance
> is maintained. (did I understand that correctly ?).

in JR2 the list of child node entries (name-id pairs) is stored within the
parent node. while this is very efficient for small to medium sized child
node lists (up to a couple of k entries) read and write performance suffers
when the list grows very large.

in the current MicroKernel implementation we use an htree based intermediate
tree structure for storing large child node collections.

>
> In choosing this approach, was a possible alternative considered where
> the child node contains a single pointer to its parent, and the child
> node is retrieved by hashing it's path rather ie  hash-f(path) ===
> child-node-storage-id, (eg sha1(path) == child-node-storage-id )
> rather than navigating from the parent.

the data model of the current MicroKernel implementation is essentially
just a DAG of node objects. the versioning model is git-like, i.e.
every change results in a new immutable tree.

i don't see how child->parent relationships could be implemented with
this model. IIUC with path-based hashing you'd sacrifice MVCC
or am i missing something?

>
> In my exposure to use cases for social media and cloud like systems I
> have observed several things. Listing child nodes from a parent
> becomes rarer the closer the application moves towards social media.
> Direct pointer based access derived from a hierarchical url is the
> main use case. For non hierarchal pointer, searching becomes dominant.
> Occasionally listing is required but only as a variant of search.
> Scans or sorted scans as nearly always avoided as being impractical.
>
> In social media content, the cardinality of all parents is low making
> finding children of parents amenable to an inverted index approach. Ie
> the parent becomes a search keyword term.
>
> (the closer to Enterprise Content Management the application is the
> inverse of the above is true)
>
> If I understood all of that correctly, my question:
>
> In circumstances where the use cases are more to the social media end
> of the spectrum, will it be possible to invert the child-parent
> relationship within Oak to be a pointer from the child with an
> optional, additional inverted index should finding children of a
> specific parent be required ?

WRT the MicroKernel implementation:

IMO that's unfortunately not possible with the current git-like
approach (content-based identifiers, DAG of immutable versions).

cheers
stefan

>
> Apologies if this is already covered and I missed it. The archives at
> Markmail didn't turn anything up.
>
> Thanks
> Ian