You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Nils Breunese <N....@vpro.nl> on 2016/11/04 16:50:35 UTC

Performance issues with QueryResultImpl with larger offset values

Hello,

I just joined this mailinglist and this is my first post.

We are having some performance issues and believe some of them can be traced into Jackrabbit's org.apache.jackrabbit.core.query.lucene.QueryResultImpl class. We have updated to Jackrabbit 2.10.3 to be able to enable the 'sizeEstimate' option [0] and got some performance improvement out of that, but we still have an issue with queries with large offset values. An offset of 12000 causes QueryResultImpl to build an offsetNodes list with 12000 entries, which when using sizeEstimate is immediately discarded afterwards. We'd love to see the performance difference with an implementation that just does skip(offset) before getting the resultNodes from the query hits. Would it make sense to have that as the default implementation, at least with sizeEstimate enabled? Should I create a JIRA issue for this?

I have created a gist [1] which demonstrates this issue. You'll need to set a breakpoint though, since the issue is in internal state of the QueryResultImpl class.

I'd like to modify the QueryResultImpl class to see if there is indeed a big performance gain to be had there for us. It seems the use of QueryResultImpl is buried pretty deeply though. QueryResultImpl is an abstract class with two concrete implementations: org.apache.jackrabbit.core.query.lucene.SingleColumnQueryResult and org.apache.jackrabbit.core.query.lucene.MultiColumnQueryResult, of which only the first seems to be used and it is explicitly created by org.apache.jackrabbit.core.query.lucene.QueryImpl#execute. So, the use of SingleColumnQueryResult, and therefor also its parent class QueryResultImpl, is hardcoded in org.apache.jackrabbit.core.query.lucene.QueryImpl.

Instances of org.apache.jackrabbit.core.query.lucene.QueryImpl (which implement ExecutableQuery) are created by SearchIndex#createExecutableQuery (SearchIndex is the only implementation of the QueryHandler interface shipping in Jackrabbit), which is a member of the SearchManager class. The SearchManager constructor gets its handler by calling QueryHandlerFactory#getQueryHandler. The two implementations of QueryHandlerFactory are WorkspaceConfig and RepositoryConfig, both of which have a QueryHandlerFactory as a member...?!

Our workspace and repository XML files currently have org.apache.jackrabbit.core.query.lucene.SearchIndex configured as the SearchIndex, with org.apache.jackrabbit.core.query.QueryImpl as the queryClass. So, we'd have to change the SearchIndex in the configuration to a class which doesn't create instances of org.apache.jackrabbit.core.query.lucene.QueryImpl, because those create instances of SingleColumnQueryResult, which are QueryResultImpl implementations. That's a lot of classes to redo for just this one change in QueryResultImpl.

Is our best bet to either put a patched QueryResultImpl on the classpath or make a custom Jackrabbit build if we'd quickly like to evaluate the performance difference for our setup?

Thanks, Nils.

[0] https://issues.apache.org/jira/browse/JCR-3858
[1] https://gist.github.com/breun/7d2072b3b6ae8c2a66e3057a603ebcdc

Re: Performance issues with QueryResultImpl with larger offset values

Posted by Nils Breunese <N....@vpro.nl>.
> Op 11 nov. 2016, om 13:21 heeft Ard Schrijvers <a....@onehippo.com> het volgende geschreven:
> 
> On Fri, Nov 4, 2016 at 5:50 PM, Nils Breunese <N....@vpro.nl> wrote:
>> Hello,
>> 
>> I just joined this mailinglist and this is my first post.
>> 
>> We are having some performance issues and believe some of them can be traced into Jackrabbit's org.apache.jackrabbit.core.query.lucene.QueryResultImpl class. We have updated to Jackrabbit 2.10.3 to be able to enable the 'sizeEstimate' option [0] and got some performance improvement out of that, but we still have an issue with queries with large offset values. An offset of 12000 causes QueryResultImpl to build an offsetNodes list with 12000 entries, which when using sizeEstimate is immediately discarded afterwards. We'd love to see the performance difference with an implementation that just does skip(offset) before getting the resultNodes from the query hits.
> 
> I've already replied in the jira issue as well, but the line above is
> exactly where the reasoning fails: It namely would bypass
> authorization completely (a search hit from Lucene does not mean the
> current jcr session has read access to the node)

In our use case the current JCR session always has read access to the node, but I understand this is not generally true. The performance difference is so dramatic that for now I guess we'll have to use a patched QueryResultImpl on our end.

Nils.

Re: Performance issues with QueryResultImpl with larger offset values

Posted by Ard Schrijvers <a....@onehippo.com>.
Hey,

On Fri, Nov 4, 2016 at 5:50 PM, Nils Breunese <N....@vpro.nl> wrote:
> Hello,
>
> I just joined this mailinglist and this is my first post.
>
> We are having some performance issues and believe some of them can be traced into Jackrabbit's org.apache.jackrabbit.core.query.lucene.QueryResultImpl class. We have updated to Jackrabbit 2.10.3 to be able to enable the 'sizeEstimate' option [0] and got some performance improvement out of that, but we still have an issue with queries with large offset values. An offset of 12000 causes QueryResultImpl to build an offsetNodes list with 12000 entries, which when using sizeEstimate is immediately discarded afterwards. We'd love to see the performance difference with an implementation that just does skip(offset) before getting the resultNodes from the query hits.

I've already replied in the jira issue as well, but the line above is
exactly where the reasoning fails: It namely would bypass
authorization completely (a search hit from Lucene does not mean the
current jcr session has read access to the node)

HTH,

Regards Ard


> Would it make sense to have that as the default implementation, at least with sizeEstimate enabled? Should I create a JIRA issue for this?
>
> I have created a gist [1] which demonstrates this issue. You'll need to set a breakpoint though, since the issue is in internal state of the QueryResultImpl class.
>
> I'd like to modify the QueryResultImpl class to see if there is indeed a big performance gain to be had there for us. It seems the use of QueryResultImpl is buried pretty deeply though. QueryResultImpl is an abstract class with two concrete implementations: org.apache.jackrabbit.core.query.lucene.SingleColumnQueryResult and org.apache.jackrabbit.core.query.lucene.MultiColumnQueryResult, of which only the first seems to be used and it is explicitly created by org.apache.jackrabbit.core.query.lucene.QueryImpl#execute. So, the use of SingleColumnQueryResult, and therefor also its parent class QueryResultImpl, is hardcoded in org.apache.jackrabbit.core.query.lucene.QueryImpl.
>
> Instances of org.apache.jackrabbit.core.query.lucene.QueryImpl (which implement ExecutableQuery) are created by SearchIndex#createExecutableQuery (SearchIndex is the only implementation of the QueryHandler interface shipping in Jackrabbit), which is a member of the SearchManager class. The SearchManager constructor gets its handler by calling QueryHandlerFactory#getQueryHandler. The two implementations of QueryHandlerFactory are WorkspaceConfig and RepositoryConfig, both of which have a QueryHandlerFactory as a member...?!
>
> Our workspace and repository XML files currently have org.apache.jackrabbit.core.query.lucene.SearchIndex configured as the SearchIndex, with org.apache.jackrabbit.core.query.QueryImpl as the queryClass. So, we'd have to change the SearchIndex in the configuration to a class which doesn't create instances of org.apache.jackrabbit.core.query.lucene.QueryImpl, because those create instances of SingleColumnQueryResult, which are QueryResultImpl implementations. That's a lot of classes to redo for just this one change in QueryResultImpl.
>
> Is our best bet to either put a patched QueryResultImpl on the classpath or make a custom Jackrabbit build if we'd quickly like to evaluate the performance difference for our setup?
>
> Thanks, Nils.
>
> [0] https://issues.apache.org/jira/browse/JCR-3858
> [1] https://gist.github.com/breun/7d2072b3b6ae8c2a66e3057a603ebcdc



-- 
Hippo Netherlands, Oosteinde 11, 1017 WT Amsterdam, Netherlands
Hippo USA, Inc. 71 Summer Street, 2nd Floor Boston, MA 02110, United
states of America.

US +1 877 414 4776 (toll free)
Europe +31(0)20 522 4466
www.onehippo.com

Re: Performance issues with QueryResultImpl with larger offset values

Posted by Nils Breunese <N....@vpro.nl>.
I have created a patch which seems to make a big difference in performance for us (query times of tens of seconds down to under 2 seconds) and created a JIRA issue for this: https://issues.apache.org/jira/browse/JCR-4057

Nils.

> Nils Breunese <N....@vpro.nl> wrote:
> 
> Hello,
> 
> I just joined this mailinglist and this is my first post.
> 
> We are having some performance issues and believe some of them can be traced into Jackrabbit's org.apache.jackrabbit.core.query.lucene.QueryResultImpl class. We have updated to Jackrabbit 2.10.3 to be able to enable the 'sizeEstimate' option [0] and got some performance improvement out of that, but we still have an issue with queries with large offset values. An offset of 12000 causes QueryResultImpl to build an offsetNodes list with 12000 entries, which when using sizeEstimate is immediately discarded afterwards. We'd love to see the performance difference with an implementation that just does skip(offset) before getting the resultNodes from the query hits. Would it make sense to have that as the default implementation, at least with sizeEstimate enabled? Should I create a JIRA issue for this?
> 
> I have created a gist [1] which demonstrates this issue. You'll need to set a breakpoint though, since the issue is in internal state of the QueryResultImpl class.
> 
> I'd like to modify the QueryResultImpl class to see if there is indeed a big performance gain to be had there for us. It seems the use of QueryResultImpl is buried pretty deeply though. QueryResultImpl is an abstract class with two concrete implementations: org.apache.jackrabbit.core.query.lucene.SingleColumnQueryResult and org.apache.jackrabbit.core.query.lucene.MultiColumnQueryResult, of which only the first seems to be used and it is explicitly created by org.apache.jackrabbit.core.query.lucene.QueryImpl#execute. So, the use of SingleColumnQueryResult, and therefor also its parent class QueryResultImpl, is hardcoded in org.apache.jackrabbit.core.query.lucene.QueryImpl.
> 
> Instances of org.apache.jackrabbit.core.query.lucene.QueryImpl (which implement ExecutableQuery) are created by SearchIndex#createExecutableQuery (SearchIndex is the only implementation of the QueryHandler interface shipping in Jackrabbit), which is a member of the SearchManager class. The SearchManager constructor gets its handler by calling QueryHandlerFactory#getQueryHandler. The two implementations of QueryHandlerFactory are WorkspaceConfig and RepositoryConfig, both of which have a QueryHandlerFactory as a member...?!
> 
> Our workspace and repository XML files currently have org.apache.jackrabbit.core.query.lucene.SearchIndex configured as the SearchIndex, with org.apache.jackrabbit.core.query.QueryImpl as the queryClass. So, we'd have to change the SearchIndex in the configuration to a class which doesn't create instances of org.apache.jackrabbit.core.query.lucene.QueryImpl, because those create instances of SingleColumnQueryResult, which are QueryResultImpl implementations. That's a lot of classes to redo for just this one change in QueryResultImpl.
> 
> Is our best bet to either put a patched QueryResultImpl on the classpath or make a custom Jackrabbit build if we'd quickly like to evaluate the performance difference for our setup?
> 
> Thanks, Nils.
> 
> [0] https://issues.apache.org/jira/browse/JCR-3858
> [1] https://gist.github.com/breun/7d2072b3b6ae8c2a66e3057a603ebcdc