You are viewing a plain text version of this content. The canonical link for it is here.
Posted to oak-dev@jackrabbit.apache.org by Julian Reschke <ju...@gmx.de> on 2015/02/03 14:55:35 UTC
DocumentMK: improving the query API
The DocumentMK (formerly "MongoMK") uses the DocumentStore API
(org.apache.jackrabbit.oak.plugins.document) for persistence. We
currently have three implementations of this API:
1) MemoryDocumentStore (mainly for testing),
2) MongoDocumentStore, and
3) RDBDocumentStore (only in trunk for now).
In theory, the DocumentMK code should be persistence-agnostic; in
practice it has a few hardwired optimizations for Mongo. These are used
for recovery and maintenance tasks.
Mongo-specific optimizations are mainly there because of the way the
DocumentStore API handles queries:
/**
* Get a list of documents where the key is greater than a start
value and
* less than an end value <em>and</em> the given "indexed property"
is greater
* or equals the specified value.
* <p>
* The indexed property can either be a {@link Long} value, in which
case numeric
* comparison applies, or a {@link Boolean} value, in which case
"false" is mapped
* to "0" and "true" is mapped to "1".
* <p>
* The returned documents are sorted by key and are immutable.
*
* @param <T> the document type
* @param collection the collection
* @param fromKey the start value (excluding)
* @param toKey the end value (excluding)
* @param indexedProperty the name of the indexed property (optional)
* @param startValue the minimum value of the indexed property
* @param limit the maximum number of entries to return
* @return the list (possibly empty)
*/
@Nonnull
<T extends Document> List<T> query(Collection<T> collection,
String fromKey,
String toKey,
String indexedProperty,
long startValue,
int limit);
So the following criteria can be used to constrain a query:
a) range of IDs
b) a single greater-Or-equals condition
In the maintenance tasks however we need additional constraints, such as:
- a condition other than greater-or-equals
- a conjunction of multiple constraints
Also, for big result sets the response type (a list) is sub-optimal
because a store might contain large NodeDocuments. Finally, there are
filter criteria that are hard/impossible to express declaratively.
Marcel and I chatted about this, and here are two API improvements we
could do; these are independent, and add some complexity - in the
optimal case we'll find out that doing one of these two would be sufficient.
Proposal #1: improve declarative constraints
Add a variant of query() such as:
<T extends Document> List<T> query(Collection<T> collection,
List<Constraint> constraints,
int limit);
This would return all documents where all of the listed constraints are
true (we currently do not seem to have a use case for a disjunction). A
constraint would apply to an indexed property (such as "_id") and would
allow the common comparisons, plus an "in" clause.
This would be straightforward to support both in the Mongo- and
RDBDocumentStore.
Proposal #2: add Java-based filtering and "sparse" documents
This would add a "QueryFilter" parameter to queries. A filter would have
- an optional way of selecting certain properties, and
- an accept(Docucment) method
Advantages:
- if the filter only selects certain properties (say "_id",
"_deletedOnce", and "_modified"), the persistence may not need to fetch
the complete document representation from storage (in RDB, this would be
true for any system property that has it's own column)
- the accept method could have "arbitrary" complexity and would be
responsible for generating the result set; for instance, it might only
build a list of Strings containing the identifiers of matching documents
(which would be sufficient for a subsequent delete operation).
Note: Proposal #2 is more flexible, but as it's only partly declarative
it makes it impossible to pass the selection constraints down to the
persistence.
Feedback appreciated...
Re: DocumentMK: improving the query API
Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,
On 17/02/15 11:42, "Julian Reschke" <ju...@greenbytes.de> wrote:
>On 2015-02-17 09:49, Marcel Reutegger wrote:
>>I also don't particularly like the List as return value. we used it
>> for other methods and it always turns out to be problematic to find
>> a reasonable number for the limit (aka batch size). the number depends
>> very much on the size of the returned documents.
>
>Well, there are two questions here: List vs Iterator, and what type to
>actually use.
If we want to go with Proposal #1, my preference would be a
ClosableIterator (extends Iterator, Closable).
>>we could also implement a combination of both. something like this:
>>
>> <T extends Document> void query(Collection<T> collection,
>> List<Constraint> constraints,
>> ResultCollector<T> collector);
>>
>> interface ResultCollector<T extends Document> {
>> public boolean collect(T document);
>> }
>>
>>
>> Advantages:
>>
>> - no need for limit and closeable. a client either collects
>> all results or interrupts by returning false in collect(). this
>> indicates to the DocumentStore that resources can be freed.
>
>But it lacks the declarative part of "limit" (it's useful to be able to
>tell the DB upfront how many results we want to see).
If needed we could add this to the signature as well...
Regards
Marcel
Re: DocumentMK: improving the query API
Posted by Julian Reschke <ju...@greenbytes.de>.
On 2015-02-17 09:49, Marcel Reutegger wrote:
> Hi,
>
> On 03/02/15 14:55, "Julian Reschke" <ju...@gmx.de> wrote:
>> Marcel and I chatted about this, and here are two API improvements we
>> could do; these are independent, and add some complexity - in the
>> optimal case we'll find out that doing one of these two would be
>> sufficient.
>>
>> Proposal #1: improve declarative constraints
>>
>> Add a variant of query() such as:
>>
>> <T extends Document> List<T> query(Collection<T> collection,
>> List<Constraint> constraints,
>> int limit);
>>
>> This would return all documents where all of the listed constraints are
>> true (we currently do not seem to have a use case for a disjunction). A
>> constraint would apply to an indexed property (such as "_id") and would
>> allow the common comparisons, plus an "in" clause.
>>
>> This would be straightforward to support both in the Mongo- and
>> RDBDocumentStore.
>
> even though not strictly required, above method signature does not
> have a start id for paging through a bigger set of matching documents.
> the start id for the next batch needs to be added as a constraint, just
> like any other regular constraint. From a client POV, I would probably
> prefer an explicit parameter.
OK.
> I also don't particularly like the List as return value. we used it
> for other methods and it always turns out to be problematic to find
> a reasonable number for the limit (aka batch size). the number depends
> very much on the size of the returned documents.
Well, there are two questions here: List vs Iterator, and what type to
actually use.
>> Proposal #2: add Java-based filtering and "sparse" documents
>>
>> This would add a "QueryFilter" parameter to queries. A filter would have
>>
>> - an optional way of selecting certain properties, and
>> - an accept(Docucment) method
>>
>> Advantages:
>>
>> - if the filter only selects certain properties (say "_id",
>> "_deletedOnce", and "_modified"), the persistence may not need to fetch
>> the complete document representation from storage (in RDB, this would be
>> true for any system property that has it's own column)
>>
>> - the accept method could have "arbitrary" complexity and would be
>> responsible for generating the result set; for instance, it might only
>> build a list of Strings containing the identifiers of matching documents
>> (which would be sufficient for a subsequent delete operation).
>>
>>
>> Note: Proposal #2 is more flexible, but as it's only partly declarative
>> it makes it impossible to pass the selection constraints down to the
>> persistence.
>
> I think this is a major drawback of this approach. depending on the
> selectivity of the filter, we may have to read a lot of documents
> from the store just to find out they don't match.
Indeed.
> we could also implement a combination of both. something like this:
>
> <T extends Document> void query(Collection<T> collection,
> List<Constraint> constraints,
> ResultCollector<T> collector);
>
> interface ResultCollector<T extends Document> {
> public boolean collect(T document);
> }
>
>
> Advantages:
>
> - no need for limit and closeable. a client either collects
> all results or interrupts by returning false in collect(). this
> indicates to the DocumentStore that resources can be freed.
But it lacks the declarative part of "limit" (it's useful to be able to
tell the DB upfront how many results we want to see).
> Drawback:
>
> - does not work well with clients exposing results through
> an iterator (pull vs. push).
Indeed.
Best regards, Julian
Re: DocumentMK: improving the query API
Posted by Marcel Reutegger <mr...@adobe.com>.
Hi,
On 03/02/15 14:55, "Julian Reschke" <ju...@gmx.de> wrote:
>Marcel and I chatted about this, and here are two API improvements we
>could do; these are independent, and add some complexity - in the
>optimal case we'll find out that doing one of these two would be
>sufficient.
>
>Proposal #1: improve declarative constraints
>
>Add a variant of query() such as:
>
> <T extends Document> List<T> query(Collection<T> collection,
> List<Constraint> constraints,
> int limit);
>
>This would return all documents where all of the listed constraints are
>true (we currently do not seem to have a use case for a disjunction). A
>constraint would apply to an indexed property (such as "_id") and would
>allow the common comparisons, plus an "in" clause.
>
>This would be straightforward to support both in the Mongo- and
>RDBDocumentStore.
even though not strictly required, above method signature does not
have a start id for paging through a bigger set of matching documents.
the start id for the next batch needs to be added as a constraint, just
like any other regular constraint. From a client POV, I would probably
prefer an explicit parameter.
I also don't particularly like the List as return value. we used it
for other methods and it always turns out to be problematic to find
a reasonable number for the limit (aka batch size). the number depends
very much on the size of the returned documents.
>Proposal #2: add Java-based filtering and "sparse" documents
>
>This would add a "QueryFilter" parameter to queries. A filter would have
>
>- an optional way of selecting certain properties, and
>- an accept(Docucment) method
>
>Advantages:
>
>- if the filter only selects certain properties (say "_id",
>"_deletedOnce", and "_modified"), the persistence may not need to fetch
>the complete document representation from storage (in RDB, this would be
>true for any system property that has it's own column)
>
>- the accept method could have "arbitrary" complexity and would be
>responsible for generating the result set; for instance, it might only
>build a list of Strings containing the identifiers of matching documents
>(which would be sufficient for a subsequent delete operation).
>
>
>Note: Proposal #2 is more flexible, but as it's only partly declarative
>it makes it impossible to pass the selection constraints down to the
>persistence.
I think this is a major drawback of this approach. depending on the
selectivity of the filter, we may have to read a lot of documents
from the store just to find out they don't match.
we could also implement a combination of both. something like this:
<T extends Document> void query(Collection<T> collection,
List<Constraint> constraints,
ResultCollector<T> collector);
interface ResultCollector<T extends Document> {
public boolean collect(T document);
}
Advantages:
- no need for limit and closeable. a client either collects
all results or interrupts by returning false in collect(). this
indicates to the DocumentStore that resources can be freed.
Drawback:
- does not work well with clients exposing results through
an iterator (pull vs. push).
Regards
Marcel
Re: DocumentStore API extensibility, was: DocumentMK: improving the
query API
Posted by Chetan Mehrotra <ch...@gmail.com>.
On Fri, Feb 13, 2015 at 5:06 PM, Julian Reschke <ju...@gmx.de> wrote:
> As a short-term change, I'd like to get rid of CachingDocumentStore, move
> its single method into DocumentStore, and allow that method to return null.
+1
Chetan Mehrotra
DocumentStore API extensibility, was: DocumentMK: improving the query
API
Posted by Julian Reschke <ju...@gmx.de>.
On 2015-02-03 14:55, Julian Reschke wrote:
>...
...and, while at it...
As long as we only have a few hardwired implementations of
DocumentStore, we might as well clean up the API a bit more.
We currently have one extension, CachingDocumentStore, implemented by
Mongo and RDB.
Extension interfaces get tricky once you have too many; they also
interfere badly with wrappers (such as the LoggingDocumentStoreWrapper).
This is something we could address with a more dynamic approach such as
Sling's Adaptable or java.sql's Wrapper interfaces (but none of these
could be used here, right?).
As a short-term change, I'd like to get rid of CachingDocumentStore,
move its single method into DocumentStore, and allow that method to
return null.
WDYT?