You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jackrabbit.apache.org by Kevin Müller <Ke...@cib.de> on 2012/01/13 11:01:07 UTC

Bulk load of several nodes

Hi,

I think in some cases it would be useful to get all or some properties of a set of nodes in one database query.
Can somebody tell me if something like this is planned for future releases ?

One example for this usecase would be:
for (RowIterator it = qm.createQuery("//*[@prop='some_value']/(@prop2|@prop3)", Query.XPATH).execute().getRows(); it.hasNext(); ) {
        Row row = it.nextRow();
        Map map = new HashMap();
        for (String key : Arrays.asList("prop2", "prop3")) {
                Value val = row.getValue(key);
                map.put(key, val != null ? val.getString() : null);
        }
        res.put(row.getPath(), map);
}

Wouldn't be nice if this could be done with one database roundtrip - right now (2.2.10) there are at least n roundtrips (n == number of results in query) it seems to me.

Regards,
Kevin Müller

AW: Bulk load of several nodes

Posted by Kevin Müller <Ke...@cib.de>.

Are there any plans to maybe extend this in near future ?
When you have a Lucene query that should return some properties, you get 100 hits and none were cached I think it can make quite a difference performance-wise if there will be 100 DB queries or 1, couldn't it ?

Regards, Kevin

-----Ursprüngliche Nachricht-----
Von: Alexander Klimetschek [mailto:aklimets@adobe.com] 
Gesendet: Dienstag, 17. Januar 2012 23:54
An: users@jackrabbit.apache.org
Betreff: Re: Bulk load of several nodes

On 16.01.12 13:51, "Lukas Kahwe Smith" <ml...@pooteeweet.org> wrote:
>checkout http://java.net/jira/browse/JSR_333-38
>
>the necessary changes have already been applied in Jackrabbit 2.3.x

I don't know if they actually map to a similar batch method on the
persistence manager (^= database access in the OP case). In any case, it
does not change the situation for jcr queries - the nodes will already be
fetched (individually) when you ask for a row or its path or its node.

The above method was probably introduced for remote clients, such as
JCR-RMI or SPI/dav, which are all way above the persistence manager layer.

Cheers,
Alex

-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

AW: Bulk load of several nodes

Posted by Kevin Müller <Ke...@cib.de>.

That's what I was aiming for although from the description it's still unclear to me if:
1. the new method really leads to fewer database accesses (one would expect so)
2. it is used by QueryIterator without extra work for the API user
3. it's possible to construct the path array with a Row from a query without DB roundtrips

But I might just try that out, thanks for the hint !

Regards, Kevin

-----Ursprüngliche Nachricht-----
Von: Lukas Kahwe Smith [mailto:mls@pooteeweet.org] 
Gesendet: Montag, 16. Januar 2012 13:51
An: users@jackrabbit.apache.org
Betreff: Re: Bulk load of several nodes


On Jan 16, 2012, at 13:46 , Kevin Müller wrote:

> Thanks for your answer Alex.
> 
> "No, there will be one query. This acts against the lucene search index.
> For each result (= row = node) based on the search index, the node be
> loaded (= fetched from the persistence manager).
> That last step can be done lazily - i.e. only a number X of results is
> fetched at the beginning, the rest will be fetched when you iterate that
> far through the results (see resultFetchSize [0])."
> 
> The second step is what I was talking about, not the Lucene query but the SQL query that fetches the actual data for each node (I'm using a DatabasePersistenceManager). One separate query is executed for each result node.
> 
> "Now for the nodes itself: if you use a bundle persistence manager, nodes
> are stored as "node bundle" which consist of all properties (except for
> larger binaries in the data store). Thus if a node is fetched from the
> persistence manager, it will already have all properties in-memory."
> 
> Yes, that means I can get all properties of ONE node in ONE database query but I still can't get properties of N nodes in ONE query. If we communicate with a non local database and we get like 100 search results this could be quite a speedup I imagine ...


checkout http://java.net/jira/browse/JSR_333-38

the necessary changes have already been applied in Jackrabbit 2.3.x

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: Bulk load of several nodes

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 16.01.12 13:51, "Lukas Kahwe Smith" <ml...@pooteeweet.org> wrote:
>checkout http://java.net/jira/browse/JSR_333-38
>
>the necessary changes have already been applied in Jackrabbit 2.3.x

I don't know if they actually map to a similar batch method on the
persistence manager (^= database access in the OP case). In any case, it
does not change the situation for jcr queries - the nodes will already be
fetched (individually) when you ask for a row or its path or its node.

The above method was probably introduced for remote clients, such as
JCR-RMI or SPI/dav, which are all way above the persistence manager layer.

Cheers,
Alex

-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

Re: Bulk load of several nodes

Posted by Lukas Kahwe Smith <ml...@pooteeweet.org>.

On Jan 16, 2012, at 13:46 , Kevin Müller wrote:

> Thanks for your answer Alex.
> 
> "No, there will be one query. This acts against the lucene search index.
> For each result (= row = node) based on the search index, the node be
> loaded (= fetched from the persistence manager).
> That last step can be done lazily - i.e. only a number X of results is
> fetched at the beginning, the rest will be fetched when you iterate that
> far through the results (see resultFetchSize [0])."
> 
> The second step is what I was talking about, not the Lucene query but the SQL query that fetches the actual data for each node (I'm using a DatabasePersistenceManager). One separate query is executed for each result node.
> 
> "Now for the nodes itself: if you use a bundle persistence manager, nodes
> are stored as "node bundle" which consist of all properties (except for
> larger binaries in the data store). Thus if a node is fetched from the
> persistence manager, it will already have all properties in-memory."
> 
> Yes, that means I can get all properties of ONE node in ONE database query but I still can't get properties of N nodes in ONE query. If we communicate with a non local database and we get like 100 search results this could be quite a speedup I imagine ...


checkout http://java.net/jira/browse/JSR_333-38

the necessary changes have already been applied in Jackrabbit 2.3.x

regards,
Lukas Kahwe Smith
mls@pooteeweet.org

Re: AW: Bulk load of several nodes

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 16.01.12 13:46, "Kevin Müller" <Ke...@cib.de> wrote:
>"No, there will be one query. This acts against the lucene search index.
>For each result (= row = node) based on the search index, the node be
>loaded (= fetched from the persistence manager).
>That last step can be done lazily - i.e. only a number X of results is
>fetched at the beginning, the rest will be fetched when you iterate that
>far through the results (see resultFetchSize [0])."
>
>The second step is what I was talking about, not the Lucene query but the
>SQL query that fetches the actual data for each node (I'm using a
>DatabasePersistenceManager). One separate query is executed for each
>result node.

Ah, I see. jcr query vs. database pm query :-). Well, that's the
architecture of Jackrabbit. Some notes:

- make sure to use a BundleDbPersistenceManager [0]
(DatabasePersistenceManager does not sound like you do) - they are highly
recommended performance-wise, as I mentioned before.
- Jackrabbit will cache node bundles (after they have been loaded from the
PM). This can be configured [1] and might be more important than a bulk
query - if the db has a good index, individually fetching the nodes might
be fast already (but I don't know exactly)

[0] http://wiki.apache.org/jackrabbit/PersistenceManagerFAQ
[1] http://wiki.apache.org/jackrabbit/CacheManager

HTH,
Alex

-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

AW: Bulk load of several nodes

Posted by Kevin Müller <Ke...@cib.de>.

Thanks for your answer Alex.

"No, there will be one query. This acts against the lucene search index.
For each result (= row = node) based on the search index, the node be
loaded (= fetched from the persistence manager).
That last step can be done lazily - i.e. only a number X of results is
fetched at the beginning, the rest will be fetched when you iterate that
far through the results (see resultFetchSize [0])."

The second step is what I was talking about, not the Lucene query but the SQL query that fetches the actual data for each node (I'm using a DatabasePersistenceManager). One separate query is executed for each result node.

"Now for the nodes itself: if you use a bundle persistence manager, nodes
are stored as "node bundle" which consist of all properties (except for
larger binaries in the data store). Thus if a node is fetched from the
persistence manager, it will already have all properties in-memory."

Yes, that means I can get all properties of ONE node in ONE database query but I still can't get properties of N nodes in ONE query. If we communicate with a non local database and we get like 100 search results this could be quite a speedup I imagine ...

-----Ursprüngliche Nachricht-----
Von: Alexander Klimetschek [mailto:aklimets@adobe.com] 
Gesendet: Montag, 16. Januar 2012 13:25
An: users@jackrabbit.apache.org
Betreff: Re: Bulk load of several nodes

On 13.01.12 11:01, "Kevin Müller" <Ke...@cib.de> wrote:

>Hi,
>
>I think in some cases it would be useful to get all or some properties of
>a set of nodes in one database query.
>Can somebody tell me if something like this is planned for future
>releases ?
>
>One example for this usecase would be:
>for (RowIterator it =
>qm.createQuery("//*[@prop='some_value']/(@prop2|@prop3)",
>Query.XPATH).execute().getRows(); it.hasNext(); ) {
>        Row row = it.nextRow();
>        Map map = new HashMap();
>        for (String key : Arrays.asList("prop2", "prop3")) {
>                Value val = row.getValue(key);
>                map.put(key, val != null ? val.getString() : null);
>        }
>        res.put(row.getPath(), map);
>}
>
>Wouldn't be nice if this could be done with one database roundtrip -
>right now (2.2.10) there are at least n roundtrips (n == number of
>results in query) it seems to me.

No, there will be one query. This acts against the lucene search index.
For each result (= row = node) based on the search index, the node be
loaded (= fetched from the persistence manager). This needs to be done not
only for returning the node, but also for checking ACLs (i.e. if it can be
put in the result, because the user has read access). Note that the search
results do not store the results in any way other than using the plain JCR
nodes - the Row interface is just a wrapper around the Node in
Jackrabbit's search implementation.

That last step can be done lazily - i.e. only a number X of results is
fetched at the beginning, the rest will be fetched when you iterate that
far through the results (see resultFetchSize [0]).

Now for the nodes itself: if you use a bundle persistence manager, nodes
are stored as "node bundle" which consist of all properties (except for
larger binaries in the data store). Thus if a node is fetched from the
persistence manager, it will already have all properties in-memory.

[0] http://wiki.apache.org/jackrabbit/Search

HTH,
Alex

-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel

Re: Bulk load of several nodes

Posted by Alexander Klimetschek <ak...@adobe.com>.

On 13.01.12 11:01, "Kevin Müller" <Ke...@cib.de> wrote:

>Hi,
>
>I think in some cases it would be useful to get all or some properties of
>a set of nodes in one database query.
>Can somebody tell me if something like this is planned for future
>releases ?
>
>One example for this usecase would be:
>for (RowIterator it =
>qm.createQuery("//*[@prop='some_value']/(@prop2|@prop3)",
>Query.XPATH).execute().getRows(); it.hasNext(); ) {
>        Row row = it.nextRow();
>        Map map = new HashMap();
>        for (String key : Arrays.asList("prop2", "prop3")) {
>                Value val = row.getValue(key);
>                map.put(key, val != null ? val.getString() : null);
>        }
>        res.put(row.getPath(), map);
>}
>
>Wouldn't be nice if this could be done with one database roundtrip -
>right now (2.2.10) there are at least n roundtrips (n == number of
>results in query) it seems to me.

No, there will be one query. This acts against the lucene search index.
For each result (= row = node) based on the search index, the node be
loaded (= fetched from the persistence manager). This needs to be done not
only for returning the node, but also for checking ACLs (i.e. if it can be
put in the result, because the user has read access). Note that the search
results do not store the results in any way other than using the plain JCR
nodes - the Row interface is just a wrapper around the Node in
Jackrabbit's search implementation.

That last step can be done lazily - i.e. only a number X of results is
fetched at the beginning, the rest will be fetched when you iterate that
far through the results (see resultFetchSize [0]).

Now for the nodes itself: if you use a bundle persistence manager, nodes
are stored as "node bundle" which consist of all properties (except for
larger binaries in the data store). Thus if a node is fetched from the
persistence manager, it will already have all properties in-memory.

[0] http://wiki.apache.org/jackrabbit/Search

HTH,
Alex

-- 
Alexander Klimetschek
Developer // Adobe (Day) // Berlin - Basel