You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Ian Boston <ie...@tfd.co.uk> on 2015/06/12 12:01:07 UTC

MongoDB collections in MongoDocumentStore

Hi,
Is there a fundamental reason why data stored in MongoDB for
MongoDocumentStore cant be stored in more than the 3 MondoDB collections
currently used ?

I am thinking that the collection name is a fn(key). What problems would
that cause elsewhere ?

Best Regards
Ian

Re: MongoDB collections in MongoDocumentStore

Posted by Norberto Leite <no...@norbertoleite.com>.

There are plenty of improvements :)
I'm actively working on some implementation details that should benefit the
overall performance of MongoMK.
Switching / supporting WT would be beneficial but there are other
improvements being discussed.

N.

On Fri, Jun 12, 2015 at 8:01 PM, Ian Boston <ie...@tfd.co.uk> wrote:

> Hi Norberto,
>
> Thank you. That saved me a lot of time, and I learnt something in the
> process.
>
> So in your opinion, is there anything that can or should be done in the
> DocumentNodeStore from a schema point of view to improve the read or write
> performance of Oak on MongoDB without resorting to sharding or upgrading to
> 3.0 and WiredTiger ?
> I am interested in JCR nodes not including blobs.
>
> Best Regards
> Ian
>
> On 12 June 2015 at 18:54, Norberto Leite <no...@norbertoleite.com>
> wrote:
>
> > Hi Ian,
> >
> > indexes are bound per collection. That means that if you have a large
> > collection that index will be correspondingly large. In the case of *_id*
> > which
> > is the primary key of all collections on MongoDB this is proportional to
> > the number of documents that you contain per collection.
> > Having a large data spread across different collections makes those
> indexes
> > individually smaller but in combination larger (we need to account for
> the
> > overhead of each index entries and some header information that composes
> > the indexes).
> > Also take into account that every time you switch between collections to
> > perform different queries (there are no joins in MongoDB) you will need
> to
> > reload to memory the index structure of all individual collections
> affected
> > by your query, which comes with some penalties, if you do not have enough
> > space in ram for the full amount.
> > That said, in MongoDB all information is handled using one single big
> file
> > per database (although spread across different extensions on disk) on
> > storage engine MMApv1 (current default for both 3.0 and 2.6). With
> > WiredTiger this is broke down to individual files per collection and per
> > index structure.
> >
> > Bottom line is, if there would be a marginal benefit for insert rates if
> > you break the JCR nodes collection into different collections due to the
> > fact that per insert you would have smaller index and data structures to
> > transverse and update, but a lot more inefficiencies on the query part
> > since you would be page faulting more often to address the traverse
> > required on both indexes and collection data.
> >
> > So yes, Chetan is right by stating that the actual size occupied by the
> > indexes would not be smaller, it would actually increase.
> >
> > What is important to mention is that sharding takes care of this by
> > spreading the load between instances and this reflects immediately both
> on
> > the size of the data that each individual shard would have to handle
> > (smaller data collections = smaller indexes) and allows paralleled
> workload
> > while retrieving back the query requests.
> >
> > Another aspect to considered is that fragmentation of the data set will
> > affect reads and writes on the long term. I'm going to be delivering a
> talk
> > soon at http://www.connectcon.ch/2015/en.html where I address this (If
> you
> > are interested on attending) on how to handled and detect these
> situations
> > on JCR implementations.
> >
> > To complete the description, the concurrency control mechanism (often
> > quoted by locking) is more granular in 3.0 MMApv1 implementation, going
> > from database level to collection.
> >
> >
> > N.
> >
> > On Fri, Jun 12, 2015 at 7:31 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> >
> > > H Norberto,
> > >
> > > Thank you for the feedback on the questions. I see you work for as an
> > > Evangelist for MongoDB, so will probably know the answers, and can save
> > me
> > > time. I agree it's not worth doing anything about concurrency even if
> > logs
> > > indicate there is contention on locks in 2.6, as the added complexity
> > would
> > > make read things worse. If an upgrade to 3.0 has been done, anything
> > > collection based makes is a waste of time due to the availability of
> > > WiredTiger.
> > >
> > > Could you confirm that separating one large collection into a number of
> > > smaller collections will not reduce the size of the indexes that have
> to
> > be
> > > consulted for queries of the form that Chetan shared earlier ?
> > >
> > > I'll try and clarify that question. DocumentNodeStore has 1 collection
> > > containing all Documents "nodes". Some queries are only interested in a
> > key
> > > space representing a certain part of the "nodes" collection, eg
> > > n:/largelystatic/**. If those Documents were stored in nodes_x, and
> > > count(nodes_x) <= 0.001*count(nodes), would there be any performance
> > > advantage or does MongoDB, under the covers, treat all collections as a
> > > single massive collection from an index and query point of view ?
> > >
> > > If you have any pointer to how 2.6 scale relative to collection size,
> > > number of collections and index size that would help me understand more
> > > about its behaviour.
> > >
> > > Best Regards
> > > Ian
> > >
> > >
> > >
> > >
> > > On 12 June 2015 at 17:08, Norberto Leite <no...@norbertoleite.com>
> > > wrote:
> > >
> > > > Hi Ian,
> > > >
> > > > Your proposal would not be very efficient.
> > > > The concurrency control mechanism that 2.6 offers (current supported
> > > > version), although not neglectable, would not be that beneficial on
> the
> > > > write load. On the reading part, which we can assume is the gross
> > > workload
> > > > that JCR will be doing, is not affected by that.
> > > > One needs to consider that every time you would be reading from the
> JCR
> > > you
> > > > either would be providing a complex M/R operation, which is designed
> to
> > > > span out to the full amount of documents existing in a given
> > collection,
> > > > and would need to recur all affected collections. Not very effective.
> > > >
> > > > The existing mechanism is way more simple and more efficient.
> > > > With the upcoming support for wired tiger, the concurrency control
> > > > (potential issue) becomes totally irrelevant.
> > > >
> > > > Also don't forget that you cannot predict the number of child nodes
> > that
> > > a
> > > > given system would implement to define their content tree.
> > > > If you do have a very nested (on specific level) number of documents
> > you
> > > > would need to treat that collection separately(when needing to scale
> > just
> > > > shard that collection and not the others) bringing in more
> operational
> > > > complexity.
> > > >
> > > > What can be a good discussion point would be to separate the blobs
> > > > collection into its own database given the flexibility that JCR
> offers
> > > when
> > > > treating these 2 different data types.
> > > > Actually, this reminded me that I was pending on submitting a jira
> > > request
> > > > on this matter <https://issues.apache.org/jira/browse/OAK-2984>.
> > > >
> > > > As Chetan is mentioning, sharding comes into play once we have to
> scale
> > > the
> > > > write throughput of the system.
> > > >
> > > > N.
> > > >
> > > >
> > > > On Fri, Jun 12, 2015 at 4:15 PM, Chetan Mehrotra <
> > > > chetan.mehrotra@gmail.com>
> > > > wrote:
> > > >
> > > > > On Fri, Jun 12, 2015 at 7:32 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > > > > > Initially I was thinking about the locking behaviour but I
> realises
> > > > 2.6.*
> > > > > > is still locking at a database level, and that only changes to
> at a
> > > > > > collection level 3.0 with MMAPv1 and row if you switch to
> > WiredTiger
> > > > [1].
> > > > >
> > > > > I initially thought the same and then we benchmarked the throughput
> > by
> > > > > placing the BlobStore in a separate database (OAK-1153). But did
> not
> > > > > observed any significant gains. So that approach was not pursued
> > > > > further. If we have some benchmark which can demonstrate that write
> > > > > throughput increases if we _shard_ node collection into separate
> > > > > database on same server then we can look further there
> > > > >
> > > > > Chetan Mehrotra
> > > > >
> > > >
> > >
> >
>

Re: MongoDB collections in MongoDocumentStore

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi Norberto,

Thank you. That saved me a lot of time, and I learnt something in the
process.

So in your opinion, is there anything that can or should be done in the
DocumentNodeStore from a schema point of view to improve the read or write
performance of Oak on MongoDB without resorting to sharding or upgrading to
3.0 and WiredTiger ?
I am interested in JCR nodes not including blobs.

Best Regards
Ian

On 12 June 2015 at 18:54, Norberto Leite <no...@norbertoleite.com> wrote:

> Hi Ian,
>
> indexes are bound per collection. That means that if you have a large
> collection that index will be correspondingly large. In the case of *_id*
> which
> is the primary key of all collections on MongoDB this is proportional to
> the number of documents that you contain per collection.
> Having a large data spread across different collections makes those indexes
> individually smaller but in combination larger (we need to account for the
> overhead of each index entries and some header information that composes
> the indexes).
> Also take into account that every time you switch between collections to
> perform different queries (there are no joins in MongoDB) you will need to
> reload to memory the index structure of all individual collections affected
> by your query, which comes with some penalties, if you do not have enough
> space in ram for the full amount.
> That said, in MongoDB all information is handled using one single big file
> per database (although spread across different extensions on disk) on
> storage engine MMApv1 (current default for both 3.0 and 2.6). With
> WiredTiger this is broke down to individual files per collection and per
> index structure.
>
> Bottom line is, if there would be a marginal benefit for insert rates if
> you break the JCR nodes collection into different collections due to the
> fact that per insert you would have smaller index and data structures to
> transverse and update, but a lot more inefficiencies on the query part
> since you would be page faulting more often to address the traverse
> required on both indexes and collection data.
>
> So yes, Chetan is right by stating that the actual size occupied by the
> indexes would not be smaller, it would actually increase.
>
> What is important to mention is that sharding takes care of this by
> spreading the load between instances and this reflects immediately both on
> the size of the data that each individual shard would have to handle
> (smaller data collections = smaller indexes) and allows paralleled workload
> while retrieving back the query requests.
>
> Another aspect to considered is that fragmentation of the data set will
> affect reads and writes on the long term. I'm going to be delivering a talk
> soon at http://www.connectcon.ch/2015/en.html where I address this (If you
> are interested on attending) on how to handled and detect these situations
> on JCR implementations.
>
> To complete the description, the concurrency control mechanism (often
> quoted by locking) is more granular in 3.0 MMApv1 implementation, going
> from database level to collection.
>
>
> N.
>
> On Fri, Jun 12, 2015 at 7:31 PM, Ian Boston <ie...@tfd.co.uk> wrote:
>
> > H Norberto,
> >
> > Thank you for the feedback on the questions. I see you work for as an
> > Evangelist for MongoDB, so will probably know the answers, and can save
> me
> > time. I agree it's not worth doing anything about concurrency even if
> logs
> > indicate there is contention on locks in 2.6, as the added complexity
> would
> > make read things worse. If an upgrade to 3.0 has been done, anything
> > collection based makes is a waste of time due to the availability of
> > WiredTiger.
> >
> > Could you confirm that separating one large collection into a number of
> > smaller collections will not reduce the size of the indexes that have to
> be
> > consulted for queries of the form that Chetan shared earlier ?
> >
> > I'll try and clarify that question. DocumentNodeStore has 1 collection
> > containing all Documents "nodes". Some queries are only interested in a
> key
> > space representing a certain part of the "nodes" collection, eg
> > n:/largelystatic/**. If those Documents were stored in nodes_x, and
> > count(nodes_x) <= 0.001*count(nodes), would there be any performance
> > advantage or does MongoDB, under the covers, treat all collections as a
> > single massive collection from an index and query point of view ?
> >
> > If you have any pointer to how 2.6 scale relative to collection size,
> > number of collections and index size that would help me understand more
> > about its behaviour.
> >
> > Best Regards
> > Ian
> >
> >
> >
> >
> > On 12 June 2015 at 17:08, Norberto Leite <no...@norbertoleite.com>
> > wrote:
> >
> > > Hi Ian,
> > >
> > > Your proposal would not be very efficient.
> > > The concurrency control mechanism that 2.6 offers (current supported
> > > version), although not neglectable, would not be that beneficial on the
> > > write load. On the reading part, which we can assume is the gross
> > workload
> > > that JCR will be doing, is not affected by that.
> > > One needs to consider that every time you would be reading from the JCR
> > you
> > > either would be providing a complex M/R operation, which is designed to
> > > span out to the full amount of documents existing in a given
> collection,
> > > and would need to recur all affected collections. Not very effective.
> > >
> > > The existing mechanism is way more simple and more efficient.
> > > With the upcoming support for wired tiger, the concurrency control
> > > (potential issue) becomes totally irrelevant.
> > >
> > > Also don't forget that you cannot predict the number of child nodes
> that
> > a
> > > given system would implement to define their content tree.
> > > If you do have a very nested (on specific level) number of documents
> you
> > > would need to treat that collection separately(when needing to scale
> just
> > > shard that collection and not the others) bringing in more operational
> > > complexity.
> > >
> > > What can be a good discussion point would be to separate the blobs
> > > collection into its own database given the flexibility that JCR offers
> > when
> > > treating these 2 different data types.
> > > Actually, this reminded me that I was pending on submitting a jira
> > request
> > > on this matter <https://issues.apache.org/jira/browse/OAK-2984>.
> > >
> > > As Chetan is mentioning, sharding comes into play once we have to scale
> > the
> > > write throughput of the system.
> > >
> > > N.
> > >
> > >
> > > On Fri, Jun 12, 2015 at 4:15 PM, Chetan Mehrotra <
> > > chetan.mehrotra@gmail.com>
> > > wrote:
> > >
> > > > On Fri, Jun 12, 2015 at 7:32 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > > > > Initially I was thinking about the locking behaviour but I realises
> > > 2.6.*
> > > > > is still locking at a database level, and that only changes to at a
> > > > > collection level 3.0 with MMAPv1 and row if you switch to
> WiredTiger
> > > [1].
> > > >
> > > > I initially thought the same and then we benchmarked the throughput
> by
> > > > placing the BlobStore in a separate database (OAK-1153). But did not
> > > > observed any significant gains. So that approach was not pursued
> > > > further. If we have some benchmark which can demonstrate that write
> > > > throughput increases if we _shard_ node collection into separate
> > > > database on same server then we can look further there
> > > >
> > > > Chetan Mehrotra
> > > >
> > >
> >
>

Re: MongoDB collections in MongoDocumentStore

Posted by Norberto Leite <no...@norbertoleite.com>.

Hi Ian,

indexes are bound per collection. That means that if you have a large
collection that index will be correspondingly large. In the case of *_id* which
is the primary key of all collections on MongoDB this is proportional to
the number of documents that you contain per collection.
Having a large data spread across different collections makes those indexes
individually smaller but in combination larger (we need to account for the
overhead of each index entries and some header information that composes
the indexes).
Also take into account that every time you switch between collections to
perform different queries (there are no joins in MongoDB) you will need to
reload to memory the index structure of all individual collections affected
by your query, which comes with some penalties, if you do not have enough
space in ram for the full amount.
That said, in MongoDB all information is handled using one single big file
per database (although spread across different extensions on disk) on
storage engine MMApv1 (current default for both 3.0 and 2.6). With
WiredTiger this is broke down to individual files per collection and per
index structure.

Bottom line is, if there would be a marginal benefit for insert rates if
you break the JCR nodes collection into different collections due to the
fact that per insert you would have smaller index and data structures to
transverse and update, but a lot more inefficiencies on the query part
since you would be page faulting more often to address the traverse
required on both indexes and collection data.

So yes, Chetan is right by stating that the actual size occupied by the
indexes would not be smaller, it would actually increase.

What is important to mention is that sharding takes care of this by
spreading the load between instances and this reflects immediately both on
the size of the data that each individual shard would have to handle
(smaller data collections = smaller indexes) and allows paralleled workload
while retrieving back the query requests.

Another aspect to considered is that fragmentation of the data set will
affect reads and writes on the long term. I'm going to be delivering a talk
soon at http://www.connectcon.ch/2015/en.html where I address this (If you
are interested on attending) on how to handled and detect these situations
on JCR implementations.

To complete the description, the concurrency control mechanism (often
quoted by locking) is more granular in 3.0 MMApv1 implementation, going
from database level to collection.

N.

On Fri, Jun 12, 2015 at 7:31 PM, Ian Boston <ie...@tfd.co.uk> wrote:

> H Norberto,
>
> Thank you for the feedback on the questions. I see you work for as an
> Evangelist for MongoDB, so will probably know the answers, and can save me
> time. I agree it's not worth doing anything about concurrency even if logs
> indicate there is contention on locks in 2.6, as the added complexity would
> make read things worse. If an upgrade to 3.0 has been done, anything
> collection based makes is a waste of time due to the availability of
> WiredTiger.
>
> Could you confirm that separating one large collection into a number of
> smaller collections will not reduce the size of the indexes that have to be
> consulted for queries of the form that Chetan shared earlier ?
>
> I'll try and clarify that question. DocumentNodeStore has 1 collection
> containing all Documents "nodes". Some queries are only interested in a key
> space representing a certain part of the "nodes" collection, eg
> n:/largelystatic/**. If those Documents were stored in nodes_x, and
> count(nodes_x) <= 0.001*count(nodes), would there be any performance
> advantage or does MongoDB, under the covers, treat all collections as a
> single massive collection from an index and query point of view ?
>
> If you have any pointer to how 2.6 scale relative to collection size,
> number of collections and index size that would help me understand more
> about its behaviour.
>
> Best Regards
> Ian
>
>
>
>
> On 12 June 2015 at 17:08, Norberto Leite <no...@norbertoleite.com>
> wrote:
>
> > Hi Ian,
> >
> > Your proposal would not be very efficient.
> > The concurrency control mechanism that 2.6 offers (current supported
> > version), although not neglectable, would not be that beneficial on the
> > write load. On the reading part, which we can assume is the gross
> workload
> > that JCR will be doing, is not affected by that.
> > One needs to consider that every time you would be reading from the JCR
> you
> > either would be providing a complex M/R operation, which is designed to
> > span out to the full amount of documents existing in a given collection,
> > and would need to recur all affected collections. Not very effective.
> >
> > The existing mechanism is way more simple and more efficient.
> > With the upcoming support for wired tiger, the concurrency control
> > (potential issue) becomes totally irrelevant.
> >
> > Also don't forget that you cannot predict the number of child nodes that
> a
> > given system would implement to define their content tree.
> > If you do have a very nested (on specific level) number of documents you
> > would need to treat that collection separately(when needing to scale just
> > shard that collection and not the others) bringing in more operational
> > complexity.
> >
> > What can be a good discussion point would be to separate the blobs
> > collection into its own database given the flexibility that JCR offers
> when
> > treating these 2 different data types.
> > Actually, this reminded me that I was pending on submitting a jira
> request
> > on this matter <https://issues.apache.org/jira/browse/OAK-2984>.
> >
> > As Chetan is mentioning, sharding comes into play once we have to scale
> the
> > write throughput of the system.
> >
> > N.
> >
> >
> > On Fri, Jun 12, 2015 at 4:15 PM, Chetan Mehrotra <
> > chetan.mehrotra@gmail.com>
> > wrote:
> >
> > > On Fri, Jun 12, 2015 at 7:32 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > > > Initially I was thinking about the locking behaviour but I realises
> > 2.6.*
> > > > is still locking at a database level, and that only changes to at a
> > > > collection level 3.0 with MMAPv1 and row if you switch to WiredTiger
> > [1].
> > >
> > > I initially thought the same and then we benchmarked the throughput by
> > > placing the BlobStore in a separate database (OAK-1153). But did not
> > > observed any significant gains. So that approach was not pursued
> > > further. If we have some benchmark which can demonstrate that write
> > > throughput increases if we _shard_ node collection into separate
> > > database on same server then we can look further there
> > >
> > > Chetan Mehrotra
> > >
> >
>

Re: MongoDB collections in MongoDocumentStore

Posted by Ian Boston <ie...@tfd.co.uk>.

H Norberto,

Thank you for the feedback on the questions. I see you work for as an
Evangelist for MongoDB, so will probably know the answers, and can save me
time. I agree it's not worth doing anything about concurrency even if logs
indicate there is contention on locks in 2.6, as the added complexity would
make read things worse. If an upgrade to 3.0 has been done, anything
collection based makes is a waste of time due to the availability of
WiredTiger.

Could you confirm that separating one large collection into a number of
smaller collections will not reduce the size of the indexes that have to be
consulted for queries of the form that Chetan shared earlier ?

I'll try and clarify that question. DocumentNodeStore has 1 collection
containing all Documents "nodes". Some queries are only interested in a key
space representing a certain part of the "nodes" collection, eg
n:/largelystatic/**. If those Documents were stored in nodes_x, and
count(nodes_x) <= 0.001*count(nodes), would there be any performance
advantage or does MongoDB, under the covers, treat all collections as a
single massive collection from an index and query point of view ?

If you have any pointer to how 2.6 scale relative to collection size,
number of collections and index size that would help me understand more
about its behaviour.

Best Regards
Ian

On 12 June 2015 at 17:08, Norberto Leite <no...@norbertoleite.com> wrote:

> Hi Ian,
>
> Your proposal would not be very efficient.
> The concurrency control mechanism that 2.6 offers (current supported
> version), although not neglectable, would not be that beneficial on the
> write load. On the reading part, which we can assume is the gross workload
> that JCR will be doing, is not affected by that.
> One needs to consider that every time you would be reading from the JCR you
> either would be providing a complex M/R operation, which is designed to
> span out to the full amount of documents existing in a given collection,
> and would need to recur all affected collections. Not very effective.
>
> The existing mechanism is way more simple and more efficient.
> With the upcoming support for wired tiger, the concurrency control
> (potential issue) becomes totally irrelevant.
>
> Also don't forget that you cannot predict the number of child nodes that a
> given system would implement to define their content tree.
> If you do have a very nested (on specific level) number of documents you
> would need to treat that collection separately(when needing to scale just
> shard that collection and not the others) bringing in more operational
> complexity.
>
> What can be a good discussion point would be to separate the blobs
> collection into its own database given the flexibility that JCR offers when
> treating these 2 different data types.
> Actually, this reminded me that I was pending on submitting a jira request
> on this matter <https://issues.apache.org/jira/browse/OAK-2984>.
>
> As Chetan is mentioning, sharding comes into play once we have to scale the
> write throughput of the system.
>
> N.
>
>
> On Fri, Jun 12, 2015 at 4:15 PM, Chetan Mehrotra <
> chetan.mehrotra@gmail.com>
> wrote:
>
> > On Fri, Jun 12, 2015 at 7:32 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > > Initially I was thinking about the locking behaviour but I realises
> 2.6.*
> > > is still locking at a database level, and that only changes to at a
> > > collection level 3.0 with MMAPv1 and row if you switch to WiredTiger
> [1].
> >
> > I initially thought the same and then we benchmarked the throughput by
> > placing the BlobStore in a separate database (OAK-1153). But did not
> > observed any significant gains. So that approach was not pursued
> > further. If we have some benchmark which can demonstrate that write
> > throughput increases if we _shard_ node collection into separate
> > database on same server then we can look further there
> >
> > Chetan Mehrotra
> >
>

Re: MongoDB collections in MongoDocumentStore

Posted by Norberto Leite <no...@norbertoleite.com>.

Hi Ian,

Your proposal would not be very efficient.
The concurrency control mechanism that 2.6 offers (current supported
version), although not neglectable, would not be that beneficial on the
write load. On the reading part, which we can assume is the gross workload
that JCR will be doing, is not affected by that.
One needs to consider that every time you would be reading from the JCR you
either would be providing a complex M/R operation, which is designed to
span out to the full amount of documents existing in a given collection,
and would need to recur all affected collections. Not very effective.

The existing mechanism is way more simple and more efficient.
With the upcoming support for wired tiger, the concurrency control
(potential issue) becomes totally irrelevant.

Also don't forget that you cannot predict the number of child nodes that a
given system would implement to define their content tree.
If you do have a very nested (on specific level) number of documents you
would need to treat that collection separately(when needing to scale just
shard that collection and not the others) bringing in more operational
complexity.

What can be a good discussion point would be to separate the blobs
collection into its own database given the flexibility that JCR offers when
treating these 2 different data types.
Actually, this reminded me that I was pending on submitting a jira request
on this matter <https://issues.apache.org/jira/browse/OAK-2984>.

As Chetan is mentioning, sharding comes into play once we have to scale the
write throughput of the system.

N.

On Fri, Jun 12, 2015 at 4:15 PM, Chetan Mehrotra <ch...@gmail.com>
wrote:

> On Fri, Jun 12, 2015 at 7:32 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > Initially I was thinking about the locking behaviour but I realises 2.6.*
> > is still locking at a database level, and that only changes to at a
> > collection level 3.0 with MMAPv1 and row if you switch to WiredTiger [1].
>
> I initially thought the same and then we benchmarked the throughput by
> placing the BlobStore in a separate database (OAK-1153). But did not
> observed any significant gains. So that approach was not pursued
> further. If we have some benchmark which can demonstrate that write
> throughput increases if we _shard_ node collection into separate
> database on same server then we can look further there
>
> Chetan Mehrotra
>

Re: MongoDB collections in MongoDocumentStore

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Fri, Jun 12, 2015 at 7:32 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> Initially I was thinking about the locking behaviour but I realises 2.6.*
> is still locking at a database level, and that only changes to at a
> collection level 3.0 with MMAPv1 and row if you switch to WiredTiger [1].

I initially thought the same and then we benchmarked the throughput by
placing the BlobStore in a separate database (OAK-1153). But did not
observed any significant gains. So that approach was not pursued
further. If we have some benchmark which can demonstrate that write
throughput increases if we _shard_ node collection into separate
database on same server then we can look further there

Chetan Mehrotra

Re: MongoDB collections in MongoDocumentStore

Posted by Ian Boston <ie...@tfd.co.uk>.

On 12 June 2015 at 14:13, Chetan Mehrotra <ch...@gmail.com> wrote:

> On Fri, Jun 12, 2015 at 5:20 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > Are all queries expected to query all keys within a collection as it is
> > now, or is there some logical structure to the querying ?
>
> Not sure if I get your question. The queries are always for immediate
> children. For for 1:/a the query is like
>
> $query: { _id: { $gt: "2:/a/", $lt: "2:/a0" }
>

So, knowing that /a and all its children was in collection nodes_a, you
would only need to query nodes_a ?

But if /a was stored in nodes_root and its children were stored in
nodes_[a-z] (26 collections), then you would need to map reduce over all 26
collections ?

Initially I was thinking about the locking behaviour but I realises 2.6.*
is still locking at a database level, and that only changes to at a
collection level 3.0 with MMAPv1 and row if you switch to WiredTiger [1].

Even so, would increasing the number of collections have an impact on query
costs. ie put /oak:index in its own collection and isolate its indexes ?

Best Regards
Ian

1 http://www.wiredtiger.com/

>
> Chetan Mehrotra
>

Re: MongoDB collections in MongoDocumentStore

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Fri, Jun 12, 2015 at 5:20 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> Are all queries expected to query all keys within a collection as it is
> now, or is there some logical structure to the querying ?

Not sure if I get your question. The queries are always for immediate
children. For for 1:/a the query is like

$query: { _id: { $gt: "2:/a/", $lt: "2:/a0" }

Chetan Mehrotra

Re: MongoDB collections in MongoDocumentStore

Posted by Ian Boston <ie...@tfd.co.uk>.

Hi,

On 12 June 2015 at 11:07, Chetan Mehrotra <ch...@gmail.com> wrote:

> On Fri, Jun 12, 2015 at 3:31 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> > I am thinking that the collection name is a fn(key). What problems would
> > that cause elsewhere ?
>
> One potential problem is when querying for children. If 2:/a/b and
> 2:/a/c are mapped to different collection then querying for children
> of 1:/a would be become tricky
>

Are all queries expected to query all keys within a collection as it is
now, or is there some logical structure to the querying ?

If there is no logical structure to queries that hit MongoDB, the would a
map reduce type query hitting all collections work ?

Best Regards
Ian

>
> Chetan Mehrotra
>

Re: MongoDB collections in MongoDocumentStore

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Fri, Jun 12, 2015 at 3:31 PM, Ian Boston <ie...@tfd.co.uk> wrote:
> I am thinking that the collection name is a fn(key). What problems would
> that cause elsewhere ?

One potential problem is when querying for children. If 2:/a/b and
2:/a/c are mapped to different collection then querying for children
of 1:/a would be become tricky

Chetan Mehrotra