You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@aalto.fi> on 2012/08/15 11:31:59 UTC
Re: LARQ prefix search results missing hits
Hi Paolo!
Thanks for your reply and sorry for the delay.
I tested this again with today's svn snapshot and it's still a problem.
However, after digging a bit further I found this in
jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
--clip--
// The number of results returned by default
public static final int NUM_RESULTS = 1000 ; // should
we increase this? -- PC
--clip--
I changed NUM_RESULTS to 100000 (added two zeros), built and installed
my modified LARQ with mvn install (NB this required tweaking arq.ver and
tdb.ver in jena-larq/pom.xml to match the current svn versions), rebuilt
Fuseki and now the problem is gone!
I would suggest that this constant be increased to something larger than
1000. Based on the code comment, you seem to have had similar thoughts
sometime in the past :)
Thanks,
Osma
15.07.2012 11:21, Paolo Castagna kirjoitti:
> Hi Osma,
> first of all, thanks for sharing your experience and clearly describing
> your problem.
> Further comments inline.
>
> On 13/07/12 14:13, Osma Suominen wrote:
>> Hello!
>>
>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>> create a system for accessing SKOS thesauri. The user interface
>> includes an autocompletion widget. The idea is to use the LARQ index
>> to make fast prefix queries on the concept labels.
>>
>> However, I've noticed that in some situations I get less results from
>> the index than what I'd expect. This seems to happen when the LARQ
>> part of the query internally produces many hits, such as when doing a
>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>
>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>> dependency to pom.xml and running mvn package. Other than this issue,
>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>> Ubuntu packages.
>>
>>
>> Steps to repeat:
>>
>> 1. package Fuseki with LARQ, as described above
>>
>> 2. start Fuseki with the attached configuration file, i.e.
>> ./fuseki-server --config=larq-config.ttl
>>
>> 3. I'm using the STW thesaurus as an easily accessible example data
>> set (though the problem was originally found with other data sets):
>> - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>> - unzip so you have stw.rdf
>>
>> 4. load the thesaurus file into the endpoint:
>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>
>> 6. build the LARQ index, e.g. this way:
>> - kill Fuseki
>> - rm -r /tmp/lucene
>> - start Fuseki again, so the index will be built
>>
>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>
>> First try this SPARQL query:
>>
>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>> SELECT DISTINCT * WHERE {
>> ?lit pf:textMatch "ar*" .
>> ?conc skos:prefLabel ?lit .
>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>> } ORDER BY ?lit
>>
>> I get 120 hits, including "Arab"@en.
>>
>> Now try the same query, but change the pf:textMatch argument to "a*".
>> This way I get only 32 results, not including "Arab"@en, even though
>> the shorter prefix query should match a superset of what was matched
>> by the first query (the regex should still filter it down to the same
>> result set).
>>
>>
>> This issue is not just about single character prefix queries. With
>> enough data sets loaded into the same index, this happens with longer
>> prefix queries as well.
>>
>> I think that the problem might be related to Lucene's default
>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>> prefix query matches), as described in the Lucene FAQ:
>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>
>
> Yes, I think your hypothesis might be correct (I've not verified it yet).
>
>> In case this is the problem, is there any way to tell LARQ to use a
>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>> not triggered? I find it a bit disturbing that hits are silently being
>> lost. I couldn't see any special output on the Fuseki log.
>
> Not sure about this.
>
> Paolo
>
>>
>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>> can of course make a bug report.
>>
>>
>> Thanks and best regards,
>> Osma Suominen
>>
>
>
--
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland
Re: LARQ prefix search results missing hits
Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma,
thanks for your help and feedback.
Does your problem go away without changing the code and using:
?lit pf:textMatch ( 'a*' 100000 )
It's not a problem adding a couple of '0'...
However, I am thinking that this would just shift the problem, isn't it?
Paolo
On 15/08/12 10:31, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks for your reply and sorry for the delay.
>
> I tested this again with today's svn snapshot and it's still a problem.
>
> However, after digging a bit further I found this in
> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>
> --clip--
> // The number of results returned by default
> public static final int NUM_RESULTS = 1000 ; // should
> we increase this? -- PC
> --clip--
>
> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
> my modified LARQ with mvn install (NB this required tweaking arq.ver
> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
> rebuilt Fuseki and now the problem is gone!
>
> I would suggest that this constant be increased to something larger
> than 1000. Based on the code comment, you seem to have had similar
> thoughts sometime in the past :)
>
> Thanks,
> Osma
>
>
> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>> Hi Osma,
>> first of all, thanks for sharing your experience and clearly describing
>> your problem.
>> Further comments inline.
>>
>> On 13/07/12 14:13, Osma Suominen wrote:
>>> Hello!
>>>
>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>> create a system for accessing SKOS thesauri. The user interface
>>> includes an autocompletion widget. The idea is to use the LARQ index
>>> to make fast prefix queries on the concept labels.
>>>
>>> However, I've noticed that in some situations I get less results from
>>> the index than what I'd expect. This seems to happen when the LARQ
>>> part of the query internally produces many hits, such as when doing a
>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>
>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>> dependency to pom.xml and running mvn package. Other than this issue,
>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>> Ubuntu packages.
>>>
>>>
>>> Steps to repeat:
>>>
>>> 1. package Fuseki with LARQ, as described above
>>>
>>> 2. start Fuseki with the attached configuration file, i.e.
>>> ./fuseki-server --config=larq-config.ttl
>>>
>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>> set (though the problem was originally found with other data sets):
>>> - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>> - unzip so you have stw.rdf
>>>
>>> 4. load the thesaurus file into the endpoint:
>>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>>
>>> 6. build the LARQ index, e.g. this way:
>>> - kill Fuseki
>>> - rm -r /tmp/lucene
>>> - start Fuseki again, so the index will be built
>>>
>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>
>>> First try this SPARQL query:
>>>
>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>> SELECT DISTINCT * WHERE {
>>> ?lit pf:textMatch "ar*" .
>>> ?conc skos:prefLabel ?lit .
>>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>>> } ORDER BY ?lit
>>>
>>> I get 120 hits, including "Arab"@en.
>>>
>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>> This way I get only 32 results, not including "Arab"@en, even though
>>> the shorter prefix query should match a superset of what was matched
>>> by the first query (the regex should still filter it down to the same
>>> result set).
>>>
>>>
>>> This issue is not just about single character prefix queries. With
>>> enough data sets loaded into the same index, this happens with longer
>>> prefix queries as well.
>>>
>>> I think that the problem might be related to Lucene's default
>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>> prefix query matches), as described in the Lucene FAQ:
>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>
>>>
>>
>> Yes, I think your hypothesis might be correct (I've not verified it
>> yet).
>>
>>> In case this is the problem, is there any way to tell LARQ to use a
>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>> not triggered? I find it a bit disturbing that hits are silently being
>>> lost. I couldn't see any special output on the Fuseki log.
>>
>> Not sure about this.
>>
>> Paolo
>>
>>>
>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>> can of course make a bug report.
>>>
>>>
>>> Thanks and best regards,
>>> Osma Suominen
>>>
>>
>>
>
>
Re: LARQ prefix search results missing hits
Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma
On 20/08/12 11:10, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks for your quick reply.
>
> 17.08.2012 20:16, Paolo Castagna wrote:
>> Does your problem go away without changing the code and using:
>> ?lit pf:textMatch ( 'a*' 100000 )
>
> I tested this but it didn't help. If I use a parameter less than 1000
> then I get even fewer hits, but values above 1000 don't have any effect.
Right.
> I think the problem is this line in IndexLARQ.java:
>
> TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;
>
> As you can see the parameter for maximum number of hits is taken
> directly from the NUM_RESULTS constant. The value specified in the query
> has no effect on this level.
Correct.
>> It's not a problem adding a couple of '0'...
>> However, I am thinking that this would just shift the problem, isn't it?
>
> You're right, it would just shift the problem but a sufficiently large
> value could be used that never caused problems in practice. Maybe you
> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
A lot of use cases about search are to used to drive a UI for people and
often only the first few results are necessary.
Try to continue hit 'next >>' on Google, how many results can you get?
;-)
Anyway, I increased the NUM_RESULT constant.
> Or maybe LARQ should use another variant of Lucene's
> IndexSearcher.search(), one which takes a Collector object instead of
> the integer n parameter. E.g. this:
> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
Yes. That would be the thing to use if we want to retrieve all the
results from Lucene.
More thinking is necessary here...
In the meantime, you can find a LARQ SNAPSHOT here:
https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
Paolo
>
>
> Thanks,
> Osma
>
>
>> On 15/08/12 10:31, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your reply and sorry for the delay.
>>>
>>> I tested this again with today's svn snapshot and it's still a problem.
>>>
>>> However, after digging a bit further I found this in
>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>
>>> --clip--
>>> // The number of results returned by default
>>> public static final int NUM_RESULTS = 1000 ; // should
>>> we increase this? -- PC
>>> --clip--
>>>
>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>> rebuilt Fuseki and now the problem is gone!
>>>
>>> I would suggest that this constant be increased to something larger
>>> than 1000. Based on the code comment, you seem to have had similar
>>> thoughts sometime in the past :)
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>> Hi Osma,
>>>> first of all, thanks for sharing your experience and clearly describing
>>>> your problem.
>>>> Further comments inline.
>>>>
>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>> Hello!
>>>>>
>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>> to make fast prefix queries on the concept labels.
>>>>>
>>>>> However, I've noticed that in some situations I get less results from
>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>> part of the query internally produces many hits, such as when doing a
>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>
>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>>>> dependency to pom.xml and running mvn package. Other than this issue,
>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>> Ubuntu packages.
>>>>>
>>>>>
>>>>> Steps to repeat:
>>>>>
>>>>> 1. package Fuseki with LARQ, as described above
>>>>>
>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>> ./fuseki-server --config=larq-config.ttl
>>>>>
>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>> set (though the problem was originally found with other data sets):
>>>>> - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>> - unzip so you have stw.rdf
>>>>>
>>>>> 4. load the thesaurus file into the endpoint:
>>>>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>
>>>>> 6. build the LARQ index, e.g. this way:
>>>>> - kill Fuseki
>>>>> - rm -r /tmp/lucene
>>>>> - start Fuseki again, so the index will be built
>>>>>
>>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>>>
>>>>> First try this SPARQL query:
>>>>>
>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>> SELECT DISTINCT * WHERE {
>>>>> ?lit pf:textMatch "ar*" .
>>>>> ?conc skos:prefLabel ?lit .
>>>>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>> } ORDER BY ?lit
>>>>>
>>>>> I get 120 hits, including "Arab"@en.
>>>>>
>>>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>> the shorter prefix query should match a superset of what was matched
>>>>> by the first query (the regex should still filter it down to the same
>>>>> result set).
>>>>>
>>>>>
>>>>> This issue is not just about single character prefix queries. With
>>>>> enough data sets loaded into the same index, this happens with longer
>>>>> prefix queries as well.
>>>>>
>>>>> I think that the problem might be related to Lucene's default
>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>
>>>>>
>>>>>
>>>>
>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>> yet).
>>>>
>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>> not triggered? I find it a bit disturbing that hits are silently being
>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>
>>>> Not sure about this.
>>>>
>>>> Paolo
>>>>
>>>>>
>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>> can of course make a bug report.
>>>>>
>>>>>
>>>>> Thanks and best regards,
>>>>> Osma Suominen
>>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
Re: LARQ prefix search results missing hits
Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma
On 28/08/12 14:22, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks a lot for the fix! I have tested the latest snapshot and it now
> works as expected. At least until I add lots of new data and hit the new
> limit :)
>
>
> You're of course right about the search use case. I think the problem
> here is that the LARQ index can be used for two very different use cases:
>
> A. Traditional IR, in which the user cares about only the first few
> results. Lucene is obviously very good at this, though full advantage
> (especially for non-English languages) of it can only be achieved by
> using specific Analyzer implementations, which appears not to be
> supported in LARQ, at least not without writing some Java code.
>
> B. Speeding up queries on literals for e.g. autocomplete search. While
> this can be done without a text index using FILTER(REGEX()), the queries
> tend to be quite slow, as the filter is applied only afterwards. In this
> case it is important that the text index returns all possible hits, not
> just the first ones.
>
> I have no idea which is the more important use case for LARQ, but I'm
> currently only interested in B because of the requirements of the
> application I'm building (ONKI Light, a SKOS vocabulary browser for
> SPARQL endpoints).
Do you have any idea/proposal to make LARQ be good for both these
use cases?
> Currently the benefits of LARQ (at least for the out-of-the-box
> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
> by these limitations:
>
> 1. The index is global and contains data from all named graphs mixed up.
> This means that when you have many named graphs with different data (as
> I do), and try to query only one graph, the LARQ query part will still
> return hits from all the other graphs, slowing down later parts of the
> query.
Yep.
I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.
I think this could be a good contribution, if you need it.
> 2. Similarly, the index does not allow filtering by language on the
> query level. With multilingual data, you cannot make a query matching
> e.g. only English labels but will get hits from all the other languages
> as well.
Yep.
I have no proposal for this, but I understand the user need.
> 3. The default implementation also doesn't store much context for the
> literal, meaning that you cannot restrict the search only to e.g.
> skos:prefLabel literal values in skos:Concept type resources. This will
> again increase the number of hits returned by the index internally.
I am not sure I follow this or I completely agree with you.
What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:
{
?l pf:textMatch '...' .
?s skos:prefLabel ?l .
?s rdf:type skos:Concept .
}
Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?
The syntax is very easy to remember and the property function
very easy to use.
The Lucene index can be kept quite simple and small.
> There may also be problems with prefix queries if you happen to hit the
> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
> problem myself with LARQ. Another problem for use case B might be that
> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
> common English stop words from the index and the query, which might
> interfer with the exact matching required for B.
Yep.
Any ideas/proposals?
> To be fair, other SPARQL text index implementations are not that good
> for prefix searches either. Virtuoso [1] requires at least 4 character
> prefixes to be specified (this can be changed by recompiling). AFAICT
> the 4store text index [2] doesn't support prefix queries at all, as the
> index structure requires whole words to be used (though possibly some
> creative use of subqueries and FILTER(REGEX()) could be used to still
> get some benefit of the index).
It's good to provide feedback, maybe with your help we can further
improve LARQ. :-)
Paolo
>
> Osma
>
> [1]
> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>
> [2] http://4store.org/trac/wiki/TextIndexing
>
> 26.08.2012 22:49, Paolo Castagna wrote:
>> Hi Osma
>>
>> On 20/08/12 11:10, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your quick reply.
>>>
>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>> Does your problem go away without changing the code and using:
>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>
>>> I tested this but it didn't help. If I use a parameter less than 1000
>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>
>> Right.
>>
>>> I think the problem is this line in IndexLARQ.java:
>>>
>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>> LARQ.NUM_RESULTS ) ;
>>>
>>> As you can see the parameter for maximum number of hits is taken
>>> directly from the NUM_RESULTS constant. The value specified in the query
>>> has no effect on this level.
>>
>> Correct.
>>
>>>> It's not a problem adding a couple of '0'...
>>>> However, I am thinking that this would just shift the problem, isn't
>>>> it?
>>>
>>> You're right, it would just shift the problem but a sufficiently large
>>> value could be used that never caused problems in practice. Maybe you
>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>
>> A lot of use cases about search are to used to drive a UI for people and
>> often only the first few results are necessary.
>>
>> Try to continue hit 'next >>' on Google, how many results can you get?
>>
>> ;-)
>>
>> Anyway, I increased the NUM_RESULT constant.
>>
>>> Or maybe LARQ should use another variant of Lucene's
>>> IndexSearcher.search(), one which takes a Collector object instead of
>>> the integer n parameter. E.g. this:
>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>
>>
>> Yes. That would be the thing to use if we want to retrieve all the
>> results from Lucene.
>>
>> More thinking is necessary here...
>>
>> In the meantime, you can find a LARQ SNAPSHOT here:
>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>
>>
>> Paolo
>>
>>>
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>> Hi Paolo!
>>>>>
>>>>> Thanks for your reply and sorry for the delay.
>>>>>
>>>>> I tested this again with today's svn snapshot and it's still a
>>>>> problem.
>>>>>
>>>>> However, after digging a bit further I found this in
>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>
>>>>> --clip--
>>>>> // The number of results returned by default
>>>>> public static final int NUM_RESULTS = 1000 ; //
>>>>> should
>>>>> we increase this? -- PC
>>>>> --clip--
>>>>>
>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>
>>>>> I would suggest that this constant be increased to something larger
>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>> thoughts sometime in the past :)
>>>>>
>>>>> Thanks,
>>>>> Osma
>>>>>
>>>>>
>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>> Hi Osma,
>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>> describing
>>>>>> your problem.
>>>>>> Further comments inline.
>>>>>>
>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>
>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>> from
>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>> part of the query internally produces many hits, such as when
>>>>>>> doing a
>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>
>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>> 2012-07-10 and
>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>> LARQ
>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>> issue,
>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>> Ubuntu packages.
>>>>>>>
>>>>>>>
>>>>>>> Steps to repeat:
>>>>>>>
>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>
>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>> ./fuseki-server --config=larq-config.ttl
>>>>>>>
>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>> - download
>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>> - unzip so you have stw.rdf
>>>>>>>
>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>
>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>> - kill Fuseki
>>>>>>> - rm -r /tmp/lucene
>>>>>>> - start Fuseki again, so the index will be built
>>>>>>>
>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>> http://localhost:3030
>>>>>>>
>>>>>>> First try this SPARQL query:
>>>>>>>
>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>> ?lit pf:textMatch "ar*" .
>>>>>>> ?conc skos:prefLabel ?lit .
>>>>>>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>> } ORDER BY ?lit
>>>>>>>
>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>
>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>> "a*".
>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>> same
>>>>>>> result set).
>>>>>>>
>>>>>>>
>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>> longer
>>>>>>> prefix queries as well.
>>>>>>>
>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>> yet).
>>>>>>
>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>> being
>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>
>>>>>> Not sure about this.
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>>>
>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>> can of course make a bug report.
>>>>>>>
>>>>>>>
>>>>>>> Thanks and best regards,
>>>>>>> Osma Suominen
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
Re: LARQ prefix search results missing hits
Posted by Paolo Castagna <ca...@gmail.com>.
Apologies, this was a mistake.
Paolo
On 10 September 2012 23:07, Paolo Castagna <ca...@gmail.com> wrote:
> Hi Osma
>
> On 28/08/12 14:22, Osma Suominen wrote:
>> Hi Paolo!
>>
>> Thanks a lot for the fix! I have tested the latest snapshot and it now
>> works as expected. At least until I add lots of new data and hit the new
>> limit :)
>>
>>
>> You're of course right about the search use case. I think the problem
>> here is that the LARQ index can be used for two very different use cases:
>>
>> A. Traditional IR, in which the user cares about only the first few
>> results. Lucene is obviously very good at this, though full advantage
>> (especially for non-English languages) of it can only be achieved by
>> using specific Analyzer implementations, which appears not to be
>> supported in LARQ, at least not without writing some Java code.
>>
>> B. Speeding up queries on literals for e.g. autocomplete search. While
>> this can be done without a text index using FILTER(REGEX()), the queries
>> tend to be quite slow, as the filter is applied only afterwards. In this
>> case it is important that the text index returns all possible hits, not
>> just the first ones.
>>
>> I have no idea which is the more important use case for LARQ, but I'm
>> currently only interested in B because of the requirements of the
>> application I'm building (ONKI Light, a SKOS vocabulary browser for
>> SPARQL endpoints).
>
> Do you have any idea/proposal to make LARQ be good for both these
> use cases?
>
>> Currently the benefits of LARQ (at least for the out-of-the-box
>> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
>> by these limitations:
>>
>> 1. The index is global and contains data from all named graphs mixed up.
>> This means that when you have many named graphs with different data (as
>> I do), and try to query only one graph, the LARQ query part will still
>> return hits from all the other graphs, slowing down later parts of the
>> query.
>
> Yep.
>
> I though about this while ago, but I haven't actually tried to implement
> it. The changes to the index are trivial. The most
> difficult part perhaps is on the property function side, but
> maybe it's easy that as well.
>
> I think this could be a good contribution, if you need it.
>
>> 2. Similarly, the index does not allow filtering by language on the
>> query level. With multilingual data, you cannot make a query matching
>> e.g. only English labels but will get hits from all the other languages
>> as well.
>
> Yep.
>
> I have no proposal for this, but I understand the user need.
>
>> 3. The default implementation also doesn't store much context for the
>> literal, meaning that you cannot restrict the search only to e.g.
>> skos:prefLabel literal values in skos:Concept type resources. This will
>> again increase the number of hits returned by the index internally.
>
> I am not sure I follow this or I completely agree with you.
>
> What you say is true, but LARQ provides a property function and you
> can use it together with other triple patterns:
>
> {
> ?l pf:textMatch '...' .
> ?s skos:prefLabel ?l .
> ?s rdf:type skos:Concept .
> }
>
> Now, we can argue on what a clever optimizer should/could do,
> but from a point of view of the user, this is quite good and
> powerful and it gets you what you want. Isn't it?
>
> The syntax is very easy to remember and the property function
> very easy to use.
>
> The Lucene index can be kept quite simple and small.
>
>>
>> There may also be problems with prefix queries if you happen to hit the
>> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
>> problem myself with LARQ. Another problem for use case B might be that
>> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
>> common English stop words from the index and the query, which might
>> interfer with the exact matching required for B.
>>
>> To be fair, other SPARQL text index implementations are not that good
>> for prefix searches either. Virtuoso [1] requires at least 4 character
>> prefixes to be specified (this can be changed by recompiling). AFAICT
>> the 4store text index [2] doesn't support prefix queries at all, as the
>> index structure requires whole words to be used (though possibly some
>> creative use of subqueries and FILTER(REGEX()) could be used to still
>> get some benefit of the index).
>>
>> Osma
>>
>> [1]
>> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>>
>> [2] http://4store.org/trac/wiki/TextIndexing
>>
>> 26.08.2012 22:49, Paolo Castagna wrote:
>>> Hi Osma
>>>
>>> On 20/08/12 11:10, Osma Suominen wrote:
>>>> Hi Paolo!
>>>>
>>>> Thanks for your quick reply.
>>>>
>>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>>> Does your problem go away without changing the code and using:
>>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>>
>>>> I tested this but it didn't help. If I use a parameter less than 1000
>>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>>
>>> Right.
>>>
>>>> I think the problem is this line in IndexLARQ.java:
>>>>
>>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>>> LARQ.NUM_RESULTS ) ;
>>>>
>>>> As you can see the parameter for maximum number of hits is taken
>>>> directly from the NUM_RESULTS constant. The value specified in the query
>>>> has no effect on this level.
>>>
>>> Correct.
>>>
>>>>> It's not a problem adding a couple of '0'...
>>>>> However, I am thinking that this would just shift the problem, isn't
>>>>> it?
>>>>
>>>> You're right, it would just shift the problem but a sufficiently large
>>>> value could be used that never caused problems in practice. Maybe you
>>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>>
>>> A lot of use cases about search are to used to drive a UI for people and
>>> often only the first few results are necessary.
>>>
>>> Try to continue hit 'next >>' on Google, how many results can you get?
>>>
>>> ;-)
>>>
>>> Anyway, I increased the NUM_RESULT constant.
>>>
>>>> Or maybe LARQ should use another variant of Lucene's
>>>> IndexSearcher.search(), one which takes a Collector object instead of
>>>> the integer n parameter. E.g. this:
>>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>>
>>>
>>> Yes. That would be the thing to use if we want to retrieve all the
>>> results from Lucene.
>>>
>>> More thinking is necessary here...
>>>
>>> In the meantime, you can find a LARQ SNAPSHOT here:
>>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>>
>>>
>>> Paolo
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Osma
>>>>
>>>>
>>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>>> Hi Paolo!
>>>>>>
>>>>>> Thanks for your reply and sorry for the delay.
>>>>>>
>>>>>> I tested this again with today's svn snapshot and it's still a
>>>>>> problem.
>>>>>>
>>>>>> However, after digging a bit further I found this in
>>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>>
>>>>>> --clip--
>>>>>> // The number of results returned by default
>>>>>> public static final int NUM_RESULTS = 1000 ; //
>>>>>> should
>>>>>> we increase this? -- PC
>>>>>> --clip--
>>>>>>
>>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>>
>>>>>> I would suggest that this constant be increased to something larger
>>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>>> thoughts sometime in the past :)
>>>>>>
>>>>>> Thanks,
>>>>>> Osma
>>>>>>
>>>>>>
>>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>>> Hi Osma,
>>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>>> describing
>>>>>>> your problem.
>>>>>>> Further comments inline.
>>>>>>>
>>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>>
>>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>>> from
>>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>>> part of the query internally produces many hits, such as when
>>>>>>>> doing a
>>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>>
>>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>>> 2012-07-10 and
>>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>>> LARQ
>>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>>> issue,
>>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>>> Ubuntu packages.
>>>>>>>>
>>>>>>>>
>>>>>>>> Steps to repeat:
>>>>>>>>
>>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>>
>>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>>> ./fuseki-server --config=larq-config.ttl
>>>>>>>>
>>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>>> - download
>>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>>> - unzip so you have stw.rdf
>>>>>>>>
>>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>>
>>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>>> - kill Fuseki
>>>>>>>> - rm -r /tmp/lucene
>>>>>>>> - start Fuseki again, so the index will be built
>>>>>>>>
>>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>>> http://localhost:3030
>>>>>>>>
>>>>>>>> First try this SPARQL query:
>>>>>>>>
>>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>> ?lit pf:textMatch "ar*" .
>>>>>>>> ?conc skos:prefLabel ?lit .
>>>>>>>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>>> } ORDER BY ?lit
>>>>>>>>
>>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>>
>>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>>> "a*".
>>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>>> same
>>>>>>>> result set).
>>>>>>>>
>>>>>>>>
>>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>>> longer
>>>>>>>> prefix queries as well.
>>>>>>>>
>>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>>> yet).
>>>>>>>
>>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>>> being
>>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>>
>>>>>>> Not sure about this.
>>>>>>>
>>>>>>> Paolo
>>>>>>>
>>>>>>>>
>>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>>> can of course make a bug report.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks and best regards,
>>>>>>>> Osma Suominen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
Re: LARQ prefix search results missing hits
Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma
On 28/08/12 14:22, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks a lot for the fix! I have tested the latest snapshot and it now
> works as expected. At least until I add lots of new data and hit the new
> limit :)
>
>
> You're of course right about the search use case. I think the problem
> here is that the LARQ index can be used for two very different use cases:
>
> A. Traditional IR, in which the user cares about only the first few
> results. Lucene is obviously very good at this, though full advantage
> (especially for non-English languages) of it can only be achieved by
> using specific Analyzer implementations, which appears not to be
> supported in LARQ, at least not without writing some Java code.
>
> B. Speeding up queries on literals for e.g. autocomplete search. While
> this can be done without a text index using FILTER(REGEX()), the queries
> tend to be quite slow, as the filter is applied only afterwards. In this
> case it is important that the text index returns all possible hits, not
> just the first ones.
>
> I have no idea which is the more important use case for LARQ, but I'm
> currently only interested in B because of the requirements of the
> application I'm building (ONKI Light, a SKOS vocabulary browser for
> SPARQL endpoints).
Do you have any idea/proposal to make LARQ be good for both these
use cases?
> Currently the benefits of LARQ (at least for the out-of-the-box
> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
> by these limitations:
>
> 1. The index is global and contains data from all named graphs mixed up.
> This means that when you have many named graphs with different data (as
> I do), and try to query only one graph, the LARQ query part will still
> return hits from all the other graphs, slowing down later parts of the
> query.
Yep.
I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.
I think this could be a good contribution, if you need it.
> 2. Similarly, the index does not allow filtering by language on the
> query level. With multilingual data, you cannot make a query matching
> e.g. only English labels but will get hits from all the other languages
> as well.
Yep.
I have no proposal for this, but I understand the user need.
> 3. The default implementation also doesn't store much context for the
> literal, meaning that you cannot restrict the search only to e.g.
> skos:prefLabel literal values in skos:Concept type resources. This will
> again increase the number of hits returned by the index internally.
I am not sure I follow this or I completely agree with you.
What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:
{
?l pf:textMatch '...' .
?s skos:prefLabel ?l .
?s rdf:type skos:Concept .
}
Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?
The syntax is very easy to remember and the property function
very easy to use.
The Lucene index can be kept quite simple and small.
>
> There may also be problems with prefix queries if you happen to hit the
> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
> problem myself with LARQ. Another problem for use case B might be that
> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
> common English stop words from the index and the query, which might
> interfer with the exact matching required for B.
>
> To be fair, other SPARQL text index implementations are not that good
> for prefix searches either. Virtuoso [1] requires at least 4 character
> prefixes to be specified (this can be changed by recompiling). AFAICT
> the 4store text index [2] doesn't support prefix queries at all, as the
> index structure requires whole words to be used (though possibly some
> creative use of subqueries and FILTER(REGEX()) could be used to still
> get some benefit of the index).
>
> Osma
>
> [1]
> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>
> [2] http://4store.org/trac/wiki/TextIndexing
>
> 26.08.2012 22:49, Paolo Castagna wrote:
>> Hi Osma
>>
>> On 20/08/12 11:10, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your quick reply.
>>>
>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>> Does your problem go away without changing the code and using:
>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>
>>> I tested this but it didn't help. If I use a parameter less than 1000
>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>
>> Right.
>>
>>> I think the problem is this line in IndexLARQ.java:
>>>
>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>> LARQ.NUM_RESULTS ) ;
>>>
>>> As you can see the parameter for maximum number of hits is taken
>>> directly from the NUM_RESULTS constant. The value specified in the query
>>> has no effect on this level.
>>
>> Correct.
>>
>>>> It's not a problem adding a couple of '0'...
>>>> However, I am thinking that this would just shift the problem, isn't
>>>> it?
>>>
>>> You're right, it would just shift the problem but a sufficiently large
>>> value could be used that never caused problems in practice. Maybe you
>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>
>> A lot of use cases about search are to used to drive a UI for people and
>> often only the first few results are necessary.
>>
>> Try to continue hit 'next >>' on Google, how many results can you get?
>>
>> ;-)
>>
>> Anyway, I increased the NUM_RESULT constant.
>>
>>> Or maybe LARQ should use another variant of Lucene's
>>> IndexSearcher.search(), one which takes a Collector object instead of
>>> the integer n parameter. E.g. this:
>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>
>>
>> Yes. That would be the thing to use if we want to retrieve all the
>> results from Lucene.
>>
>> More thinking is necessary here...
>>
>> In the meantime, you can find a LARQ SNAPSHOT here:
>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>
>>
>> Paolo
>>
>>>
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>> Hi Paolo!
>>>>>
>>>>> Thanks for your reply and sorry for the delay.
>>>>>
>>>>> I tested this again with today's svn snapshot and it's still a
>>>>> problem.
>>>>>
>>>>> However, after digging a bit further I found this in
>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>
>>>>> --clip--
>>>>> // The number of results returned by default
>>>>> public static final int NUM_RESULTS = 1000 ; //
>>>>> should
>>>>> we increase this? -- PC
>>>>> --clip--
>>>>>
>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>
>>>>> I would suggest that this constant be increased to something larger
>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>> thoughts sometime in the past :)
>>>>>
>>>>> Thanks,
>>>>> Osma
>>>>>
>>>>>
>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>> Hi Osma,
>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>> describing
>>>>>> your problem.
>>>>>> Further comments inline.
>>>>>>
>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>
>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>> from
>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>> part of the query internally produces many hits, such as when
>>>>>>> doing a
>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>
>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>> 2012-07-10 and
>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>> LARQ
>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>> issue,
>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>> Ubuntu packages.
>>>>>>>
>>>>>>>
>>>>>>> Steps to repeat:
>>>>>>>
>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>
>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>> ./fuseki-server --config=larq-config.ttl
>>>>>>>
>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>> - download
>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>> - unzip so you have stw.rdf
>>>>>>>
>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>
>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>> - kill Fuseki
>>>>>>> - rm -r /tmp/lucene
>>>>>>> - start Fuseki again, so the index will be built
>>>>>>>
>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>> http://localhost:3030
>>>>>>>
>>>>>>> First try this SPARQL query:
>>>>>>>
>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>> ?lit pf:textMatch "ar*" .
>>>>>>> ?conc skos:prefLabel ?lit .
>>>>>>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>> } ORDER BY ?lit
>>>>>>>
>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>
>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>> "a*".
>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>> same
>>>>>>> result set).
>>>>>>>
>>>>>>>
>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>> longer
>>>>>>> prefix queries as well.
>>>>>>>
>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>> yet).
>>>>>>
>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>> being
>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>
>>>>>> Not sure about this.
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>>>
>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>> can of course make a bug report.
>>>>>>>
>>>>>>>
>>>>>>> Thanks and best regards,
>>>>>>> Osma Suominen
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>
Re: LARQ prefix search results missing hits
Posted by Osma Suominen <os...@aalto.fi>.
Hi Paolo!
31.08.2012 21:58, Paolo Castagna kirjoitti:
>> A. Traditional IR, in which the user cares about only the first few
>> results. Lucene is obviously very good at this, though full advantage
>> (especially for non-English languages) of it can only be achieved by
>> using specific Analyzer implementations, which appears not to be
>> supported in LARQ, at least not without writing some Java code.
>>
>> B. Speeding up queries on literals for e.g. autocomplete search. While
>> this can be done without a text index using FILTER(REGEX()), the queries
>> tend to be quite slow, as the filter is applied only afterwards. In this
>> case it is important that the text index returns all possible hits, not
>> just the first ones.
[...]
> Do you have any idea/proposal to make LARQ be good for both these
> use cases?
For A, I think LARQ is quite good already, though I note that the
current implementation is hardcoded to use Lucene StandardAnalyzer which
is pretty good for English text, fine for most European languages, but
maybe not that great for some other languages. Making it configurable to
support other Analyzers such as different language stemmers might be
useful. 4store allows a German stemmer to be used, for example [1].
For B, see below.
>> 1. The index is global and contains data from all named graphs mixed up.
>> This means that when you have many named graphs with different data (as
>> I do), and try to query only one graph, the LARQ query part will still
>> return hits from all the other graphs, slowing down later parts of the
>> query.
>
> Yep.
>
> I though about this while ago, but I haven't actually tried to implement
> it. The changes to the index are trivial. The most
> difficult part perhaps is on the property function side, but
> maybe it's easy that as well.
>
> I think this could be a good contribution, if you need it.
This would we good for my application as it would speed up queries,
sometimes by a lot I think. But I'm not that familiar with the Jena
codebase so I won't volunteer to implement it...
>> 2. Similarly, the index does not allow filtering by language on the
>> query level. With multilingual data, you cannot make a query matching
>> e.g. only English labels but will get hits from all the other languages
>> as well.
>
> Yep.
>
> I have no proposal for this, but I understand the user need.
I tried a single line change to LARQ.java to support querying by
language. Patch attached.
I tested this with the STW thesaurus dataset mentioned in the beginning
of this thread. This query against the current unpatched LARQ searches
for all concepts whose English language skos:prefLabel begins with A:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
?lit pf:textMatch "a*" .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^a.*', 'i') && langMatches(LANG(?lit), 'en'))
} ORDER BY ?lit
I benchmarked this query a few dozen times using apachebench. It takes
at minimum 35 ms on my machine.
With the patch applied, I can instead use this query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
?lit pf:textMatch "+a* +lang:en" .
?conc skos:prefLabel ?lit .
FILTER(REGEX(?lit, '^a.*', 'i'))
} ORDER BY ?lit
Note that I no longer need to filter the results by language as the
index only provides hits with the correct language tag. This query now
takes 25ms, so it's about 30% faster than the original. The Lucene index
size went from 4352 kb to 4444 kb, a 2% increase.
I admit this is a quite small dataset, but I haven't yet had time to
test with larger ones.
What do you think?
A possible refinement would be to support a syntax where the language
tag is taken from the literal in the query, e.g.
?lit pf:textMatch "a*"@en .
>> 3. The default implementation also doesn't store much context for the
>> literal, meaning that you cannot restrict the search only to e.g.
>> skos:prefLabel literal values in skos:Concept type resources. This will
>> again increase the number of hits returned by the index internally.
>
> I am not sure I follow this or I completely agree with you.
>
> What you say is true, but LARQ provides a property function and you
> can use it together with other triple patterns:
>
> {
> ?l pf:textMatch '...' .
> ?s skos:prefLabel ?l .
> ?s rdf:type skos:Concept .
> }
>
> Now, we can argue on what a clever optimizer should/could do,
> but from a point of view of the user, this is quite good and
> powerful and it gets you what you want. Isn't it?
>
> The syntax is very easy to remember and the property function
> very easy to use.
>
> The Lucene index can be kept quite simple and small.
You're right here, the syntax is perfectly fine. It is only an
optimization issue.
>> There may also be problems with prefix queries if you happen to hit the
>> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
>> problem myself with LARQ. Another problem for use case B might be that
>> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
>> common English stop words from the index and the query, which might
>> interfer with the exact matching required for B.
>
> Yep.
>
> Any ideas/proposals?
For the BooleanQuery issue, I would suggest adding this somewhere in the
LARQ code:
BooleanQuery.setMaxClauseCount(newMax)
where newMax is a sufficiently large value (could be 100000 or
Integer.MAX_VALUE).
For the other issues, I think use case B would benefit a lot if there
was a way to make the field "index" in the Lucene index use a simpler
Analyzer such as SimpleAnalyzer or TokenAnalyzer. Or alternatively,
perhaps the "lex" field could be processed with another analyzer. For my
application, something like LowerCaseKeywordAnalyzer would be perfect,
but it doesn't exist in the Lucene distribution. A quick web search
finds many such implementations though.
(BTW, I don't quite understand why there's both "index" and "lex" fields
in the index, I think one field should be enough for both retrieving
exact strings and for performing text searches using keywords).
-Osma
[1] http://4store.org/trac/wiki/TextIndexing
--
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland
Re: LARQ prefix search results missing hits
Posted by Osma Suominen <os...@aalto.fi>.
Hi Paolo!
Thanks a lot for the fix! I have tested the latest snapshot and it now
works as expected. At least until I add lots of new data and hit the new
limit :)
You're of course right about the search use case. I think the problem
here is that the LARQ index can be used for two very different use cases:
A. Traditional IR, in which the user cares about only the first few
results. Lucene is obviously very good at this, though full advantage
(especially for non-English languages) of it can only be achieved by
using specific Analyzer implementations, which appears not to be
supported in LARQ, at least not without writing some Java code.
B. Speeding up queries on literals for e.g. autocomplete search. While
this can be done without a text index using FILTER(REGEX()), the queries
tend to be quite slow, as the filter is applied only afterwards. In this
case it is important that the text index returns all possible hits, not
just the first ones.
I have no idea which is the more important use case for LARQ, but I'm
currently only interested in B because of the requirements of the
application I'm building (ONKI Light, a SKOS vocabulary browser for
SPARQL endpoints).
Currently the benefits of LARQ (at least for the out-of-the-box
configuration for Fuseki+LARQ) for both A and B are somewhat diminished
by these limitations:
1. The index is global and contains data from all named graphs mixed up.
This means that when you have many named graphs with different data (as
I do), and try to query only one graph, the LARQ query part will still
return hits from all the other graphs, slowing down later parts of the
query.
2. Similarly, the index does not allow filtering by language on the
query level. With multilingual data, you cannot make a query matching
e.g. only English labels but will get hits from all the other languages
as well.
3. The default implementation also doesn't store much context for the
literal, meaning that you cannot restrict the search only to e.g.
skos:prefLabel literal values in skos:Concept type resources. This will
again increase the number of hits returned by the index internally.
There may also be problems with prefix queries if you happen to hit the
default BooleanQuery limit of 1024 clauses, but I haven't yet had this
problem myself with LARQ. Another problem for use case B might be that
the default Lucene StandardAnalyzer, which LARQ seems to use, filters
common English stop words from the index and the query, which might
interfer with the exact matching required for B.
To be fair, other SPARQL text index implementations are not that good
for prefix searches either. Virtuoso [1] requires at least 4 character
prefixes to be specified (this can be changed by recompiling). AFAICT
the 4store text index [2] doesn't support prefix queries at all, as the
index structure requires whole words to be used (though possibly some
creative use of subqueries and FILTER(REGEX()) could be used to still
get some benefit of the index).
Osma
[1]
http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
[2] http://4store.org/trac/wiki/TextIndexing
26.08.2012 22:49, Paolo Castagna wrote:
> Hi Osma
>
> On 20/08/12 11:10, Osma Suominen wrote:
>> Hi Paolo!
>>
>> Thanks for your quick reply.
>>
>> 17.08.2012 20:16, Paolo Castagna wrote:
>>> Does your problem go away without changing the code and using:
>>> ?lit pf:textMatch ( 'a*' 100000 )
>>
>> I tested this but it didn't help. If I use a parameter less than 1000
>> then I get even fewer hits, but values above 1000 don't have any effect.
>
> Right.
>
>> I think the problem is this line in IndexLARQ.java:
>>
>> TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;
>>
>> As you can see the parameter for maximum number of hits is taken
>> directly from the NUM_RESULTS constant. The value specified in the query
>> has no effect on this level.
>
> Correct.
>
>>> It's not a problem adding a couple of '0'...
>>> However, I am thinking that this would just shift the problem, isn't it?
>>
>> You're right, it would just shift the problem but a sufficiently large
>> value could be used that never caused problems in practice. Maybe you
>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>
> A lot of use cases about search are to used to drive a UI for people and
> often only the first few results are necessary.
>
> Try to continue hit 'next >>' on Google, how many results can you get?
>
> ;-)
>
> Anyway, I increased the NUM_RESULT constant.
>
>> Or maybe LARQ should use another variant of Lucene's
>> IndexSearcher.search(), one which takes a Collector object instead of
>> the integer n parameter. E.g. this:
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>
> Yes. That would be the thing to use if we want to retrieve all the
> results from Lucene.
>
> More thinking is necessary here...
>
> In the meantime, you can find a LARQ SNAPSHOT here:
> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>
> Paolo
>
>>
>>
>> Thanks,
>> Osma
>>
>>
>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>> Hi Paolo!
>>>>
>>>> Thanks for your reply and sorry for the delay.
>>>>
>>>> I tested this again with today's svn snapshot and it's still a problem.
>>>>
>>>> However, after digging a bit further I found this in
>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>
>>>> --clip--
>>>> // The number of results returned by default
>>>> public static final int NUM_RESULTS = 1000 ; // should
>>>> we increase this? -- PC
>>>> --clip--
>>>>
>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>> rebuilt Fuseki and now the problem is gone!
>>>>
>>>> I would suggest that this constant be increased to something larger
>>>> than 1000. Based on the code comment, you seem to have had similar
>>>> thoughts sometime in the past :)
>>>>
>>>> Thanks,
>>>> Osma
>>>>
>>>>
>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>> Hi Osma,
>>>>> first of all, thanks for sharing your experience and clearly describing
>>>>> your problem.
>>>>> Further comments inline.
>>>>>
>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>> Hello!
>>>>>>
>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>> to make fast prefix queries on the concept labels.
>>>>>>
>>>>>> However, I've noticed that in some situations I get less results from
>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>> part of the query internally produces many hits, such as when doing a
>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>
>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>>>>> dependency to pom.xml and running mvn package. Other than this issue,
>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>> Ubuntu packages.
>>>>>>
>>>>>>
>>>>>> Steps to repeat:
>>>>>>
>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>
>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>> ./fuseki-server --config=larq-config.ttl
>>>>>>
>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>> set (though the problem was originally found with other data sets):
>>>>>> - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>> - unzip so you have stw.rdf
>>>>>>
>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>
>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>> - kill Fuseki
>>>>>> - rm -r /tmp/lucene
>>>>>> - start Fuseki again, so the index will be built
>>>>>>
>>>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>>>>
>>>>>> First try this SPARQL query:
>>>>>>
>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>> SELECT DISTINCT * WHERE {
>>>>>> ?lit pf:textMatch "ar*" .
>>>>>> ?conc skos:prefLabel ?lit .
>>>>>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>> } ORDER BY ?lit
>>>>>>
>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>
>>>>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>> by the first query (the regex should still filter it down to the same
>>>>>> result set).
>>>>>>
>>>>>>
>>>>>> This issue is not just about single character prefix queries. With
>>>>>> enough data sets loaded into the same index, this happens with longer
>>>>>> prefix queries as well.
>>>>>>
>>>>>> I think that the problem might be related to Lucene's default
>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>> yet).
>>>>>
>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>> not triggered? I find it a bit disturbing that hits are silently being
>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>
>>>>> Not sure about this.
>>>>>
>>>>> Paolo
>>>>>
>>>>>>
>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>> can of course make a bug report.
>>>>>>
>>>>>>
>>>>>> Thanks and best regards,
>>>>>> Osma Suominen
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>
--
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland
Re: LARQ prefix search results missing hits
Posted by Osma Suominen <os...@aalto.fi>.
Hi Paolo!
Thanks for your quick reply.
17.08.2012 20:16, Paolo Castagna wrote:
> Does your problem go away without changing the code and using:
> ?lit pf:textMatch ( 'a*' 100000 )
I tested this but it didn't help. If I use a parameter less than 1000
then I get even fewer hits, but values above 1000 don't have any effect.
I think the problem is this line in IndexLARQ.java:
TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;
As you can see the parameter for maximum number of hits is taken
directly from the NUM_RESULTS constant. The value specified in the query
has no effect on this level.
> It's not a problem adding a couple of '0'...
> However, I am thinking that this would just shift the problem, isn't it?
You're right, it would just shift the problem but a sufficiently large
value could be used that never caused problems in practice. Maybe you
could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
Or maybe LARQ should use another variant of Lucene's
IndexSearcher.search(), one which takes a Collector object instead of
the integer n parameter. E.g. this:
http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
Thanks,
Osma
> On 15/08/12 10:31, Osma Suominen wrote:
>> Hi Paolo!
>>
>> Thanks for your reply and sorry for the delay.
>>
>> I tested this again with today's svn snapshot and it's still a problem.
>>
>> However, after digging a bit further I found this in
>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>
>> --clip--
>> // The number of results returned by default
>> public static final int NUM_RESULTS = 1000 ; // should
>> we increase this? -- PC
>> --clip--
>>
>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>> rebuilt Fuseki and now the problem is gone!
>>
>> I would suggest that this constant be increased to something larger
>> than 1000. Based on the code comment, you seem to have had similar
>> thoughts sometime in the past :)
>>
>> Thanks,
>> Osma
>>
>>
>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>> Hi Osma,
>>> first of all, thanks for sharing your experience and clearly describing
>>> your problem.
>>> Further comments inline.
>>>
>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>> Hello!
>>>>
>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>> create a system for accessing SKOS thesauri. The user interface
>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>> to make fast prefix queries on the concept labels.
>>>>
>>>> However, I've noticed that in some situations I get less results from
>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>> part of the query internally produces many hits, such as when doing a
>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>
>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>>> dependency to pom.xml and running mvn package. Other than this issue,
>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>> Ubuntu packages.
>>>>
>>>>
>>>> Steps to repeat:
>>>>
>>>> 1. package Fuseki with LARQ, as described above
>>>>
>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>> ./fuseki-server --config=larq-config.ttl
>>>>
>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>> set (though the problem was originally found with other data sets):
>>>> - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>> - unzip so you have stw.rdf
>>>>
>>>> 4. load the thesaurus file into the endpoint:
>>>> ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>
>>>> 6. build the LARQ index, e.g. this way:
>>>> - kill Fuseki
>>>> - rm -r /tmp/lucene
>>>> - start Fuseki again, so the index will be built
>>>>
>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>>
>>>> First try this SPARQL query:
>>>>
>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>> SELECT DISTINCT * WHERE {
>>>> ?lit pf:textMatch "ar*" .
>>>> ?conc skos:prefLabel ?lit .
>>>> FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>> } ORDER BY ?lit
>>>>
>>>> I get 120 hits, including "Arab"@en.
>>>>
>>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>> the shorter prefix query should match a superset of what was matched
>>>> by the first query (the regex should still filter it down to the same
>>>> result set).
>>>>
>>>>
>>>> This issue is not just about single character prefix queries. With
>>>> enough data sets loaded into the same index, this happens with longer
>>>> prefix queries as well.
>>>>
>>>> I think that the problem might be related to Lucene's default
>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>> prefix query matches), as described in the Lucene FAQ:
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>
>>>>
>>>
>>> Yes, I think your hypothesis might be correct (I've not verified it
>>> yet).
>>>
>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>> not triggered? I find it a bit disturbing that hits are silently being
>>>> lost. I couldn't see any special output on the Fuseki log.
>>>
>>> Not sure about this.
>>>
>>> Paolo
>>>
>>>>
>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>> can of course make a bug report.
>>>>
>>>>
>>>> Thanks and best regards,
>>>> Osma Suominen
>>>>
>>>
>>>
>>
>>
>
--
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076
Aalto, Finland