You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Stephen Allen (JIRA)" <ji...@apache.org> on 2015/08/06 01:33:05 UTC
[jira] [Commented] (JENA-999) Poor jena-text query performance when a bound subject is used

    [ https://issues.apache.org/jira/browse/JENA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14659197#comment-14659197 ] 

Stephen Allen commented on JENA-999:
------------------------------------

I'm having some difficulties, and maybe it is my understanding of how property functions work.  The code I recently commited [1] works well for some queries, but then fails for others.  Basically it seems that some queries will build a new property function object for every iteration of the LHS of the query.  So that even with the code attempting to cache the lucene results, it ends up issuing a bunch of lucene queries.

I thought I could work around this by checking in the {{build()}} method whether or not the search term was bound [2] at that point, then doing the caching there instead of in the {{exec()}} method.  But since a new property function instance is constructed on each query iteration instead of a single one at plan construction time, that doesn't seem to work either.

My question is whether this is intended behavior for property functions?  If it is, then I may have to fall back to caching the results in the Context object, with all of the negative downsides outlined above.

I've attached a test program that shows this behavior.  It runs well unless it is inside of a union clause.

[1] https://github.com/apache/jena/blob/master/jena-text/src/main/java/org/apache/jena/query/text/TextQueryPF.java
[2] The current code has the "feature" of allowing the search term to be bound by the query itself.  I don't know how useful this is in the real world.



> Poor jena-text query performance when a bound subject is used
> -------------------------------------------------------------
>
>                 Key: JENA-999
>                 URL: https://issues.apache.org/jira/browse/JENA-999
>             Project: Apache Jena
>          Issue Type: Improvement
>            Reporter: Stephen Allen
>            Assignee: Stephen Allen
>            Priority: Minor
>         Attachments: jena-text benchmarks.png
>
>
> When executing a jena-text query, the performance is terrible if the subject is already bound to a variable.  This is because the current code will execute a new lucene query that does not have the subject/entity bound on every iteration and then iterate through the lucene results to join against the subject.  This is quite inefficient.
> Example query:
> {code}
> select *
> where {
>   ?s rdf:type <http://example.org/Entity> .
>   ?s text:query ( rdfs:label "test" ) .
> }
> {code}
> This would be quite slow if there were a lot of entities in the system.
> Two potential solutions present themselves:
> # Craft a more explicit lucene query that specifies the entity URI, so that the results coming back from lucene are much smaller.  However, this would cause problems with the score not being correct across multiple iterations.  Additionally we are still potentially running a lot of lucene queries, each of which has a probably non-negligble constant cost (parsing the query string, etc).
> # Execute the more general lucene query the first time it is encountered, then caching the results somewhere.  From there, we can then perform a hash table lookup against those cached results.
> I would like to pursue option 2, but there is a problem.  Because jena-text is implemented as a property function instead of a query op in and of itself (like QueryIterMinus is for example), we have to find a place to stash the lucene results.  I believe this can be done by placing it in the ExecutionContext object, using the lucene query as a cache key.  Updates provide a slightly troubling case because you could have an update request like:
> {code}
> insert data { <urn:test1> rdf:type <http://example.org/Entity> ; rdfs:label "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label "test" ) . ?p ?o . } ;
> insert data { <urn:test2> rdf:type <http://example.org/Entity> ; rdfs:label "test" } ;
> delete { ?s ?p ?o }
> where { ?s rdf:type <http://example.org/Entity> ; text:query ( rdfs:label "test" ) ; ?p ?o . }
> {code}
> And then the end result should be an empty database.  But if the ExecutionContext was the same for both delete queries, you would be using the cached results from the first delete query in the second delete query, which would result in {{<urn:test2>}} not being deleted properly.
> If the ExecutionContext is indeed shared between the two update queries in the situation above, I think this can be solved by making the cache key for the lucene resultset be a combination of both the lucene query and the QueryIterRoot or BindingRoot.  I need to investigate this.  An alternative, if there was a way to be notified when a query has finished executing, we could clear the cache in the ExecutionContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)