You are viewing a plain text version of this content. The canonical link for it is here.

Posted to oak-dev@jackrabbit.apache.org by Julian Reschke <ju...@gmx.de> on 2015/02/03 14:55:35 UTC

DocumentMK: improving the query API

The DocumentMK (formerly "MongoMK") uses the DocumentStore API 
(org.apache.jackrabbit.oak.plugins.document) for persistence. We 
currently have three implementations of this API:

1) MemoryDocumentStore (mainly for testing),
2) MongoDocumentStore, and
3) RDBDocumentStore (only in trunk for now).

In theory, the DocumentMK code should be persistence-agnostic; in 
practice it has a few hardwired optimizations for Mongo. These are used 
for recovery and maintenance tasks.

Mongo-specific optimizations are mainly there because of the way the 
DocumentStore API handles queries:

   /**
    * Get a list of documents where the key is greater than a start 
value and
    * less than an end value <em>and</em> the given "indexed property" 
is greater
    * or equals the specified value.
    * <p>
    * The indexed property can either be a {@link Long} value, in which 
case numeric
    * comparison applies, or a {@link Boolean} value, in which case 
"false" is mapped
    * to "0" and "true" is mapped to "1".
    * <p>
    * The returned documents are sorted by key and are immutable.
    *
    * @param <T> the document type
    * @param collection the collection
    * @param fromKey the start value (excluding)
    * @param toKey the end value (excluding)
    * @param indexedProperty the name of the indexed property (optional)
    * @param startValue the minimum value of the indexed property
    * @param limit the maximum number of entries to return
    * @return the list (possibly empty)
    */
   @Nonnull
   <T extends Document> List<T> query(Collection<T> collection,
                                      String fromKey,
                                      String toKey,
                                      String indexedProperty,
                                      long startValue,
                                      int limit);

So the following criteria can be used to constrain a query:

a) range of IDs
b) a single greater-Or-equals condition

In the maintenance tasks however we need additional constraints, such as:

- a condition other than greater-or-equals
- a conjunction of multiple constraints

Also, for big result sets the response type (a list) is sub-optimal 
because a store might contain large NodeDocuments. Finally, there are 
filter criteria that are hard/impossible to express declaratively.

Marcel and I chatted about this, and here are two API improvements we 
could do; these are independent, and add some complexity - in the 
optimal case we'll find out that doing one of these two would be sufficient.


Proposal #1: improve declarative constraints

Add a variant of query() such as:

   <T extends Document> List<T> query(Collection<T> collection,
                                      List<Constraint> constraints,
                                      int limit);

This would return all documents where all of the listed constraints are 
true (we currently do not seem to have a use case for a disjunction). A 
constraint would apply to an indexed property (such as "_id") and would 
allow the common comparisons, plus an "in" clause.

This would be straightforward to support both in the Mongo- and 
RDBDocumentStore.


Proposal #2: add Java-based filtering and "sparse" documents

This would add a "QueryFilter" parameter to queries. A filter would have

- an optional way of selecting certain properties, and
- an accept(Docucment) method

Advantages:

- if the filter only selects certain properties (say "_id", 
"_deletedOnce", and "_modified"), the persistence may not need to fetch 
the complete document representation from storage (in RDB, this would be 
true for any system property that has it's own column)

- the accept method could have "arbitrary" complexity and would be 
responsible for generating the result set; for instance, it might only 
build a list of Strings containing the identifiers of matching documents 
(which would be sufficient for a subsequent delete operation).


Note: Proposal #2 is more flexible, but as it's only partly declarative 
it makes it impossible to pass the selection constraints down to the 
persistence.

Feedback appreciated...

Re: DocumentMK: improving the query API

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

On 17/02/15 11:42, "Julian Reschke" <ju...@greenbytes.de> wrote:
>On 2015-02-17 09:49, Marcel Reutegger wrote:
>>I also don't particularly like the List as return value. we used it
>> for other methods and it always turns out to be problematic to find
>> a reasonable number for the limit (aka batch size). the number depends
>> very much on the size of the returned documents.
>
>Well, there are two questions here: List vs Iterator, and what type to
>actually use.

If we want to go with Proposal #1, my preference would be a
ClosableIterator (extends Iterator, Closable).

>>we could also implement a combination of both. something like this:
>>
>> <T extends Document> void query(Collection<T> collection,
>>                                  List<Constraint> constraints,
>>                                  ResultCollector<T> collector);
>>
>> interface ResultCollector<T extends Document> {
>>      public boolean collect(T document);
>> }
>>
>>
>> Advantages:
>>
>> - no need for limit and closeable. a client either collects
>> all results or interrupts by returning false in collect(). this
>> indicates to the DocumentStore that resources can be freed.
>
>But it lacks the declarative part of "limit" (it's useful to be able to
>tell the DB upfront how many results we want to see).

If needed we could add this to the signature as well...

Regards
 Marcel

Re: DocumentMK: improving the query API

Posted by Julian Reschke <ju...@greenbytes.de>.

On 2015-02-17 09:49, Marcel Reutegger wrote:
> Hi,
>
> On 03/02/15 14:55, "Julian Reschke" <ju...@gmx.de> wrote:
>> Marcel and I chatted about this, and here are two API improvements we
>> could do; these are independent, and add some complexity - in the
>> optimal case we'll find out that doing one of these two would be
>> sufficient.
>>
>> Proposal #1: improve declarative constraints
>>
>> Add a variant of query() such as:
>>
>>    <T extends Document> List<T> query(Collection<T> collection,
>>                                       List<Constraint> constraints,
>>                                       int limit);
>>
>> This would return all documents where all of the listed constraints are
>> true (we currently do not seem to have a use case for a disjunction). A
>> constraint would apply to an indexed property (such as "_id") and would
>> allow the common comparisons, plus an "in" clause.
>>
>> This would be straightforward to support both in the Mongo- and
>> RDBDocumentStore.
>
> even though not strictly required, above method signature does not
> have a start id for paging through a bigger set of matching documents.
> the start id for the next batch needs to be added as a constraint, just
> like any other regular constraint. From a client POV, I would probably
> prefer an explicit parameter.

OK.

> I also don't particularly like the List as return value. we used it
> for other methods and it always turns out to be problematic to find
> a reasonable number for the limit (aka batch size). the number depends
> very much on the size of the returned documents.

Well, there are two questions here: List vs Iterator, and what type to 
actually use.

>> Proposal #2: add Java-based filtering and "sparse" documents
>>
>> This would add a "QueryFilter" parameter to queries. A filter would have
>>
>> - an optional way of selecting certain properties, and
>> - an accept(Docucment) method
>>
>> Advantages:
>>
>> - if the filter only selects certain properties (say "_id",
>> "_deletedOnce", and "_modified"), the persistence may not need to fetch
>> the complete document representation from storage (in RDB, this would be
>> true for any system property that has it's own column)
>>
>> - the accept method could have "arbitrary" complexity and would be
>> responsible for generating the result set; for instance, it might only
>> build a list of Strings containing the identifiers of matching documents
>> (which would be sufficient for a subsequent delete operation).
>>
>>
>> Note: Proposal #2 is more flexible, but as it's only partly declarative
>> it makes it impossible to pass the selection constraints down to the
>> persistence.
>
> I think this is a major drawback of this approach. depending on the
> selectivity of the filter, we may have to read a lot of documents
> from the store just to find out they don't match.

Indeed.

> we could also implement a combination of both. something like this:
>
> <T extends Document> void query(Collection<T> collection,
>                                  List<Constraint> constraints,
>                                  ResultCollector<T> collector);
>
> interface ResultCollector<T extends Document> {
>      public boolean collect(T document);
> }
>
>
> Advantages:
>
> - no need for limit and closeable. a client either collects
> all results or interrupts by returning false in collect(). this
> indicates to the DocumentStore that resources can be freed.

But it lacks the declarative part of "limit" (it's useful to be able to 
tell the DB upfront how many results we want to see).

> Drawback:
>
> - does not work well with clients exposing results through
> an iterator (pull vs. push).

Indeed.

Best regards, Julian

Re: DocumentMK: improving the query API

Posted by Marcel Reutegger <mr...@adobe.com>.

Hi,

On 03/02/15 14:55, "Julian Reschke" <ju...@gmx.de> wrote:
>Marcel and I chatted about this, and here are two API improvements we
>could do; these are independent, and add some complexity - in the
>optimal case we'll find out that doing one of these two would be
>sufficient.
>
>Proposal #1: improve declarative constraints
>
>Add a variant of query() such as:
>
>   <T extends Document> List<T> query(Collection<T> collection,
>                                      List<Constraint> constraints,
>                                      int limit);
>
>This would return all documents where all of the listed constraints are
>true (we currently do not seem to have a use case for a disjunction). A
>constraint would apply to an indexed property (such as "_id") and would
>allow the common comparisons, plus an "in" clause.
>
>This would be straightforward to support both in the Mongo- and
>RDBDocumentStore.

even though not strictly required, above method signature does not
have a start id for paging through a bigger set of matching documents.
the start id for the next batch needs to be added as a constraint, just
like any other regular constraint. From a client POV, I would probably
prefer an explicit parameter.

I also don't particularly like the List as return value. we used it
for other methods and it always turns out to be problematic to find
a reasonable number for the limit (aka batch size). the number depends
very much on the size of the returned documents.

>Proposal #2: add Java-based filtering and "sparse" documents
>
>This would add a "QueryFilter" parameter to queries. A filter would have
>
>- an optional way of selecting certain properties, and
>- an accept(Docucment) method
>
>Advantages:
>
>- if the filter only selects certain properties (say "_id",
>"_deletedOnce", and "_modified"), the persistence may not need to fetch
>the complete document representation from storage (in RDB, this would be
>true for any system property that has it's own column)
>
>- the accept method could have "arbitrary" complexity and would be
>responsible for generating the result set; for instance, it might only
>build a list of Strings containing the identifiers of matching documents
>(which would be sufficient for a subsequent delete operation).
>
>
>Note: Proposal #2 is more flexible, but as it's only partly declarative
>it makes it impossible to pass the selection constraints down to the
>persistence.

I think this is a major drawback of this approach. depending on the
selectivity of the filter, we may have to read a lot of documents
from the store just to find out they don't match.


we could also implement a combination of both. something like this:

<T extends Document> void query(Collection<T> collection,
                                List<Constraint> constraints,
                                ResultCollector<T> collector);

interface ResultCollector<T extends Document> {
    public boolean collect(T document);
}


Advantages:

- no need for limit and closeable. a client either collects
all results or interrupts by returning false in collect(). this
indicates to the DocumentStore that resources can be freed.

Drawback:

- does not work well with clients exposing results through
an iterator (pull vs. push).


Regards

 Marcel

Re: DocumentStore API extensibility, was: DocumentMK: improving the query API

Posted by Chetan Mehrotra <ch...@gmail.com>.

On Fri, Feb 13, 2015 at 5:06 PM, Julian Reschke <ju...@gmx.de> wrote:
> As a short-term change, I'd like to get rid of CachingDocumentStore, move
> its single method into DocumentStore, and allow that method to return null.

+1

Chetan Mehrotra

DocumentStore API extensibility, was: DocumentMK: improving the query API

Posted by Julian Reschke <ju...@gmx.de>.

On 2015-02-03 14:55, Julian Reschke wrote:
>...

...and, while at it...

As long as we only have a few hardwired implementations of 
DocumentStore, we might as well clean up the API a bit more.

We currently have one extension, CachingDocumentStore, implemented by 
Mongo and RDB.

Extension interfaces get tricky once you have too many; they also 
interfere badly with wrappers (such as the LoggingDocumentStoreWrapper). 
This is something we could address with a more dynamic approach such as 
Sling's Adaptable or java.sql's Wrapper interfaces (but none of these 
could be used here, right?).

As a short-term change, I'd like to get rid of CachingDocumentStore, 
move its single method into DocumentStore, and allow that method to 
return null.

WDYT?