You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Ian Boston <ie...@tfd.co.uk> on 2015/07/08 18:18:01 UTC

/oak:index (DocumentNodeStore)

Hi,
I am confused at how /oak:index works and why it is needed in a MongoDB
setting which has native database indexes that appear to cover the same
functionality. Could the Oak Query engine use DB indexes directly for all
indexes that are built into Oak, and Lucene indexes for all custom indexes ?

I am asking this because in MongoDB I observe that 60% of the size of the
nodes collection is attributable to /oak:index, and that the 60% increases
every non sparse MongoDB index by about 3x. An _id + _modified compound
index in MongoDB comes out at about 70GB for 100M documents (in part due to
the size of _id). Without the duplication /oak:index it could be closer to
25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
neither is page fault IO.

I fully understand why TarMK needs /oak:index, but I can't understand
(conceptually) the need to implement an index inside an database table.
It's like trying to implement an inverted index in an RDBMS table, which
everyone who has ever tried (or used) that approach doesn't scale nearly as
far as Lucene bitmaps.

Could /oak:index be replaced by something that doesn't generate
Documents/db rows as fast as it does ?

Best Regards
Ian

Re: /oak:index (DocumentNodeStore)

Posted by Norberto Leite <no...@norbertoleite.com>.

A collection per index (or a separate one for indexes only), specially the
asynchronous ones, will translate in a big benefit if the following occurs:
- when querying on index nodes we don't need to get all related node
documents (which is happening)
- the write operations are distinct between indexes and nodes (which I
think is also happening)

N.

On Thu, Jul 9, 2015 at 11:33 AM, Thomas Mueller <mu...@adobe.com> wrote:

> Hi,
>
> Using MongoDB indexes directly doesn't work because of the MVCC model.
> What we could do is add special collections (basically one collection per
> index). This would requires some work, which then would need to be
> repeated for RDBMK. It would be quite some work.
>
> > I observe that 60% of the size of the nodes collection is attributable
> >to /oak:index
>
> Could you try to find out which index(es) are responsible for that? There
> would be multiple ways to reduce the number of nodes:
>
> 0) remove unused indexes
> 1) convert some indexes to Lucene property indexes
> 2) convert to unique index if possible (as this uses less space)
> 3) add a feature to only index a subset of the keys (only index what we
> need)
> 4) convert the last x levels of the index structure as a property instead
> of as a node
>
>
> 3) and 4) would require changes in Oak. For 4), the change should reduce
> the number of nodes, but might cause merge conflicts (not sure). With
> level = 1, it would be:
>
>   /content/products/a @color=red
>   /content/products/b @color=red
>
>   /oak:index/color/red/content
>   /oak:index/color/red/content/products @a = true, @b = true
>
> instead of
>
>   /oak:index/color/red/content
>   /oak:index/color/red/content/products
>   /oak:index/color/red/content/products/a @match = true
>   /oak:index/color/red/content/products/b @match = true
>
> With level > 1, it would require some escaping magic, but we could save
> some more nodes, and basically it would be:
>
> level = 2:
>
>   /oak:index/color/red/content @products_a = true, @products_b = true
>
>
> level = 3:
>
>   /oak:index/color/red @content_products_a = true, @content_products_b =
> true
>
>
>
>
> Regards,
> Thomas
>
>
>
>
>
> On 08/07/15 18:18, "Ian Boston" <ie...@tfd.co.uk> wrote:
>
> >Hi,
> >I am confused at how /oak:index works and why it is needed in a MongoDB
> >setting which has native database indexes that appear to cover the same
> >functionality. Could the Oak Query engine use DB indexes directly for all
> >indexes that are built into Oak, and Lucene indexes for all custom
> >indexes ?
> >
> >I am asking this because in MongoDB I observe that 60% of the size of the
> >nodes collection is attributable to /oak:index, and that the 60% increases
> >every non sparse MongoDB index by about 3x. An _id + _modified compound
> >index in MongoDB comes out at about 70GB for 100M documents (in part due
> >to
> >the size of _id). Without the duplication /oak:index it could be closer to
> >25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
> >neither is page fault IO.
> >
> >I fully understand why TarMK needs /oak:index, but I can't understand
> >(conceptually) the need to implement an index inside an database table.
> >It's like trying to implement an inverted index in an RDBMS table, which
> >everyone who has ever tried (or used) that approach doesn't scale nearly
> >as
> >far as Lucene bitmaps.
> >
> >Could /oak:index be replaced by something that doesn't generate
> >Documents/db rows as fast as it does ?
> >
> >Best Regards
> >Ian
>
>

Re: /oak:index (DocumentNodeStore)

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi,

On 9 July 2015 at 10:33, Thomas Mueller <mu...@adobe.com> wrote:

> Hi,
>
> Using MongoDB indexes directly doesn't work because of the MVCC model.
> What we could do is add special collections (basically one collection per
> index). This would requires some work, which then would need to be
> repeated for RDBMK. It would be quite some work.
>

ok, understood.


>
> > I observe that 60% of the size of the nodes collection is attributable
> >to /oak:index
>
> Could you try to find out which index(es) are responsible for that?


Marcel and Chetan have been working on the repository I was observing. I am
sure they can point you to the details offline, if you are not aware of it
already. They were able to remove about 25% of the 60% under /oak:index,
but IIUC most of the remainder and not local customisations, and perhaps
40% of what remains is not local customisations and must be synchronous,
which indicates a 1:2 ratio between real content nodes and MongoDB
documents before any MongoDB indexes are considered. That ratio was the
motivation for asking the question. Chetan thought I should discuss on
oak-dev.

Marcel and Chetan have executed 0) and 1) below, far more knowledgable than
I in this area.

Best Regards
Ian



> There
> would be multiple ways to reduce the number of nodes:
>
> 0) remove unused indexes
> 1) convert some indexes to Lucene property indexes

2) convert to unique index if possible (as this uses less space)

3) add a feature to only index a subset of the keys (only index what we
> need)
> 4) convert the last x levels of the index structure as a property instead
> of as a node
>
>
> 3) and 4) would require changes in Oak. For 4), the change should reduce
> the number of nodes, but might cause merge conflicts (not sure). With
> level = 1, it would be:
>
>   /content/products/a @color=red
>   /content/products/b @color=red
>
>   /oak:index/color/red/content
>   /oak:index/color/red/content/products @a = true, @b = true
>
> instead of
>
>   /oak:index/color/red/content
>   /oak:index/color/red/content/products
>   /oak:index/color/red/content/products/a @match = true
>   /oak:index/color/red/content/products/b @match = true
>
> With level > 1, it would require some escaping magic, but we could save
> some more nodes, and basically it would be:
>
> level = 2:
>
>   /oak:index/color/red/content @products_a = true, @products_b = true
>
>
> level = 3:
>
>   /oak:index/color/red @content_products_a = true, @content_products_b =
> true
>
>
>
>
> Regards,
> Thomas
>
>
>
>
>
> On 08/07/15 18:18, "Ian Boston" <ie...@tfd.co.uk> wrote:
>
> >Hi,
> >I am confused at how /oak:index works and why it is needed in a MongoDB
> >setting which has native database indexes that appear to cover the same
> >functionality. Could the Oak Query engine use DB indexes directly for all
> >indexes that are built into Oak, and Lucene indexes for all custom
> >indexes ?
> >
> >I am asking this because in MongoDB I observe that 60% of the size of the
> >nodes collection is attributable to /oak:index, and that the 60% increases
> >every non sparse MongoDB index by about 3x. An _id + _modified compound
> >index in MongoDB comes out at about 70GB for 100M documents (in part due
> >to
> >the size of _id). Without the duplication /oak:index it could be closer to
> >25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
> >neither is page fault IO.
> >
> >I fully understand why TarMK needs /oak:index, but I can't understand
> >(conceptually) the need to implement an index inside an database table.
> >It's like trying to implement an inverted index in an RDBMS table, which
> >everyone who has ever tried (or used) that approach doesn't scale nearly
> >as
> >far as Lucene bitmaps.
> >
> >Could /oak:index be replaced by something that doesn't generate
> >Documents/db rows as fast as it does ?
> >
> >Best Regards
> >Ian
>
>

Re: /oak:index (DocumentNodeStore)

Posted by Thomas Mueller <mu...@adobe.com>.

Hi,

Using MongoDB indexes directly doesn't work because of the MVCC model.
What we could do is add special collections (basically one collection per
index). This would requires some work, which then would need to be
repeated for RDBMK. It would be quite some work.

> I observe that 60% of the size of the nodes collection is attributable
>to /oak:index

Could you try to find out which index(es) are responsible for that? There
would be multiple ways to reduce the number of nodes:

0) remove unused indexes
1) convert some indexes to Lucene property indexes
2) convert to unique index if possible (as this uses less space)
3) add a feature to only index a subset of the keys (only index what we
need)
4) convert the last x levels of the index structure as a property instead
of as a node


3) and 4) would require changes in Oak. For 4), the change should reduce
the number of nodes, but might cause merge conflicts (not sure). With
level = 1, it would be:

  /content/products/a @color=red
  /content/products/b @color=red

  /oak:index/color/red/content
  /oak:index/color/red/content/products @a = true, @b = true

instead of

  /oak:index/color/red/content
  /oak:index/color/red/content/products
  /oak:index/color/red/content/products/a @match = true
  /oak:index/color/red/content/products/b @match = true

With level > 1, it would require some escaping magic, but we could save
some more nodes, and basically it would be:

level = 2:

  /oak:index/color/red/content @products_a = true, @products_b = true


level = 3:

  /oak:index/color/red @content_products_a = true, @content_products_b =
true




Regards,
Thomas





On 08/07/15 18:18, "Ian Boston" <ie...@tfd.co.uk> wrote:

>Hi,
>I am confused at how /oak:index works and why it is needed in a MongoDB
>setting which has native database indexes that appear to cover the same
>functionality. Could the Oak Query engine use DB indexes directly for all
>indexes that are built into Oak, and Lucene indexes for all custom
>indexes ?
>
>I am asking this because in MongoDB I observe that 60% of the size of the
>nodes collection is attributable to /oak:index, and that the 60% increases
>every non sparse MongoDB index by about 3x. An _id + _modified compound
>index in MongoDB comes out at about 70GB for 100M documents (in part due
>to
>the size of _id). Without the duplication /oak:index it could be closer to
>25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
>neither is page fault IO.
>
>I fully understand why TarMK needs /oak:index, but I can't understand
>(conceptually) the need to implement an index inside an database table.
>It's like trying to implement an inverted index in an RDBMS table, which
>everyone who has ever tried (or used) that approach doesn't scale nearly
>as
>far as Lucene bitmaps.
>
>Could /oak:index be replaced by something that doesn't generate
>Documents/db rows as fast as it does ?
>
>Best Regards
>Ian

Re: /oak:index (DocumentNodeStore)

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi Marcel,
Thanks for the response, that makes sense.

I assume that there already > 64 indexes in /oak:index before any custom
ones are added, which makes it impossible to remove /oak:index for
MongoDB.  With that many it's going to be impractical for all RDBMS's.

Would there be any benefit in moving /oak:index out of the main document
collection so that any MongoDB indexes in the collection of no relevance to
/oak:index don't get bloated ?
or, more generally
Is there a different way of storing the data in /oak:index so that it
doesn't result in so many MongoDB documents ?


Best Regards
Ian

On 9 July 2015 at 08:15, Marcel Reutegger <mr...@adobe.com> wrote:

> Hi Ian,
>
> there are mainly two reasons why we cannot use DocumentStore
> based indexes for this purpose:
>
> - MongoDB only supports a limited number of indexes (64 per
>   collection) and applications usually have a need for more
>   indexes.
>
> - Data in Oak is multi-versioned. It must be possible to query
>   nodes at a specific revision of the tree.
>
> Lucene indexes are more efficient, but are only updated
> asynchronously. Whether this is acceptable usually depends on
> application requirements. Experience so far shows, many indexes
> can be asynchronous, because there was no hard requirement
> for synchronous index updates.
>
> Regards
>  Marcel
>
> On 08/07/15 18:18, "ianboston@gmail.com on behalf of Ian Boston" wrote:
>
> >Hi,
> >I am confused at how /oak:index works and why it is needed in a MongoDB
> >setting which has native database indexes that appear to cover the same
> >functionality. Could the Oak Query engine use DB indexes directly for all
> >indexes that are built into Oak, and Lucene indexes for all custom
> >indexes ?
> >
> >I am asking this because in MongoDB I observe that 60% of the size of the
> >nodes collection is attributable to /oak:index, and that the 60% increases
> >every non sparse MongoDB index by about 3x. An _id + _modified compound
> >index in MongoDB comes out at about 70GB for 100M documents (in part due
> >to
> >the size of _id). Without the duplication /oak:index it could be closer to
> >25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
> >neither is page fault IO.
> >
> >I fully understand why TarMK needs /oak:index, but I can't understand
> >(conceptually) the need to implement an index inside an database table.
> >It's like trying to implement an inverted index in an RDBMS table, which
> >everyone who has ever tried (or used) that approach doesn't scale nearly
> >as
> >far as Lucene bitmaps.
> >
> >Could /oak:index be replaced by something that doesn't generate
> >Documents/db rows as fast as it does ?
> >
> >Best Regards
> >Ian
>
>

Re: /oak:index (DocumentNodeStore)

Posted by Ian Boston <ie...@tfd.co.uk>.

On 9 July 2015 at 09:16, Chetan Mehrotra <ch...@gmail.com> wrote:

> On Thu, Jul 9, 2015 at 12:45 PM, Marcel Reutegger <mr...@adobe.com>
> wrote:
> > - Data in Oak is multi-versioned. It must be possible to query
> >   nodes at a specific revision of the tree.
>
> To add - That also makes it difficult to use Mongo indexes as the
> index itself is versioned. So instead of just indexing property 'foo'
> you need to index it for every revision
>

Won't compound indexes work ?

{ _id : 1, _modified: 1, _revision: 1 } ?

They are bigger.
_id is 211 bytes per entry average
_modified: _id is 233
_revision, _modified, _id is probably close to 400 bytes as _revision is a
string.

I guess the way of telling is to generate the index on a test database and
see what impact it has.

Best Regards
Ian

>
> Chetan Mehrotra
>

Re: /oak:index (DocumentNodeStore)

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Thu, Jul 9, 2015 at 12:45 PM, Marcel Reutegger <mr...@adobe.com> wrote:
> - Data in Oak is multi-versioned. It must be possible to query
>   nodes at a specific revision of the tree.

To add - That also makes it difficult to use Mongo indexes as the
index itself is versioned. So instead of just indexing property 'foo'
you need to index it for every revision

Chetan Mehrotra

Re: /oak:index (DocumentNodeStore)

Posted by Julian Reschke <ju...@gmx.de>.

On 2015-07-09 09:15, Marcel Reutegger wrote:
> Hi Ian,
>
> there are mainly two reasons why we cannot use DocumentStore
> based indexes for this purpose:
>
> - MongoDB only supports a limited number of indexes (64 per
>    collection) and applications usually have a need for more
>    indexes.
>
> - Data in Oak is multi-versioned. It must be possible to query
>    nodes at a specific revision of the tree.
>
> Lucene indexes are more efficient, but are only updated
> asynchronously. Whether this is acceptable usually depends on
> application requirements. Experience so far shows, many indexes
> can be asynchronous, because there was no hard requirement
> for synchronous index updates.
>
> Regards
>   Marcel

Do the above considerations also apply to the UUID index?

Best regards, Julian

Re: /oak:index (DocumentNodeStore)

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi Ian,

there are mainly two reasons why we cannot use DocumentStore
based indexes for this purpose:

- MongoDB only supports a limited number of indexes (64 per
  collection) and applications usually have a need for more
  indexes. 

- Data in Oak is multi-versioned. It must be possible to query
  nodes at a specific revision of the tree.

Lucene indexes are more efficient, but are only updated
asynchronously. Whether this is acceptable usually depends on
application requirements. Experience so far shows, many indexes
can be asynchronous, because there was no hard requirement
for synchronous index updates.

Regards
 Marcel

On 08/07/15 18:18, "ianboston@gmail.com on behalf of Ian Boston" wrote:

>Hi,
>I am confused at how /oak:index works and why it is needed in a MongoDB
>setting which has native database indexes that appear to cover the same
>functionality. Could the Oak Query engine use DB indexes directly for all
>indexes that are built into Oak, and Lucene indexes for all custom
>indexes ?
>
>I am asking this because in MongoDB I observe that 60% of the size of the
>nodes collection is attributable to /oak:index, and that the 60% increases
>every non sparse MongoDB index by about 3x. An _id + _modified compound
>index in MongoDB comes out at about 70GB for 100M documents (in part due
>to
>the size of _id). Without the duplication /oak:index it could be closer to
>25GB. Disk space is cheap, but MongoDB working set RAM is not cheap,
>neither is page fault IO.
>
>I fully understand why TarMK needs /oak:index, but I can't understand
>(conceptually) the need to implement an index inside an database table.
>It's like trying to implement an inverted index in an RDBMS table, which
>everyone who has ever tried (or used) that approach doesn't scale nearly
>as
>far as Lucene bitmaps.
>
>Could /oak:index be replaced by something that doesn't generate
>Documents/db rows as fast as it does ?
>
>Best Regards
>Ian