You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@aalto.fi> on 2012/08/15 11:31:59 UTC

Re: LARQ prefix search results missing hits

Hi Paolo!

Thanks for your reply and sorry for the delay.

I tested this again with today's svn snapshot and it's still a problem.

However, after digging a bit further I found this in 
jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:

--clip--
     // The number of results returned by default
     public static final int NUM_RESULTS             = 1000 ; // should 
we increase this? -- PC
--clip--

I changed NUM_RESULTS to 100000 (added two zeros), built and installed 
my modified LARQ with mvn install (NB this required tweaking arq.ver and 
tdb.ver in jena-larq/pom.xml to match the current svn versions), rebuilt 
Fuseki and now the problem is gone!

I would suggest that this constant be increased to something larger than 
1000. Based on the code comment, you seem to have had similar thoughts 
sometime in the past :)

Thanks,
Osma


15.07.2012 11:21, Paolo Castagna kirjoitti:
> Hi Osma,
> first of all, thanks for sharing your experience and clearly describing
> your problem.
> Further comments inline.
>
> On 13/07/12 14:13, Osma Suominen wrote:
>> Hello!
>>
>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>> create a system for accessing SKOS thesauri. The user interface
>> includes an autocompletion widget. The idea is to use the LARQ index
>> to make fast prefix queries on the concept labels.
>>
>> However, I've noticed that in some situations I get less results from
>> the index than what I'd expect. This seems to happen when the LARQ
>> part of the query internally produces many hits, such as when doing a
>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>
>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>> dependency to pom.xml and running mvn package. Other than this issue,
>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>> Ubuntu packages.
>>
>>
>> Steps to repeat:
>>
>> 1. package Fuseki with LARQ, as described above
>>
>> 2. start Fuseki with the attached configuration file, i.e.
>>     ./fuseki-server --config=larq-config.ttl
>>
>> 3. I'm using the STW thesaurus as an easily accessible example data
>> set (though the problem was originally found with other data sets):
>>     - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>     - unzip so you have stw.rdf
>>
>> 4. load the thesaurus file into the endpoint:
>>     ./s-put http://localhost:3030/ds/data default stw.rdf
>>
>> 6. build the LARQ index, e.g. this way:
>>     - kill Fuseki
>>     - rm -r /tmp/lucene
>>     - start Fuseki again, so the index will be built
>>
>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>
>> First try this SPARQL query:
>>
>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>> SELECT DISTINCT * WHERE {
>>    ?lit pf:textMatch "ar*" .
>>    ?conc skos:prefLabel ?lit .
>>    FILTER(REGEX(?lit, '^ar.*', 'i'))
>> } ORDER BY ?lit
>>
>> I get 120 hits, including "Arab"@en.
>>
>> Now try the same query, but change the pf:textMatch argument to "a*".
>> This way I get only 32 results, not including "Arab"@en, even though
>> the shorter prefix query should match a superset of what was matched
>> by the first query (the regex should still filter it down to the same
>> result set).
>>
>>
>> This issue is not just about single character prefix queries. With
>> enough data sets loaded into the same index, this happens with longer
>> prefix queries as well.
>>
>> I think that the problem might be related to Lucene's default
>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>> prefix query matches), as described in the Lucene FAQ:
>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>
>
> Yes, I think your hypothesis might be correct (I've not verified it yet).
>
>> In case this is the problem, is there any way to tell LARQ to use a
>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>> not triggered? I find it a bit disturbing that hits are silently being
>> lost. I couldn't see any special output on the Fuseki log.
>
> Not sure about this.
>
> Paolo
>
>>
>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>> can of course make a bug report.
>>
>>
>> Thanks and best regards,
>> Osma Suominen
>>
>
>


-- 
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing 
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 
Aalto, Finland

Re: LARQ prefix search results missing hits

Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma,
thanks for your help and feedback.

Does your problem go away without changing the code and using:
?lit pf:textMatch ( 'a*' 100000 )

It's not a problem adding a couple of '0'...
However, I am thinking that this would just shift the problem, isn't it?

Paolo

On 15/08/12 10:31, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks for your reply and sorry for the delay.
>
> I tested this again with today's svn snapshot and it's still a problem.
>
> However, after digging a bit further I found this in
> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>
> --clip--
>     // The number of results returned by default
>     public static final int NUM_RESULTS             = 1000 ; // should
> we increase this? -- PC
> --clip--
>
> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
> my modified LARQ with mvn install (NB this required tweaking arq.ver
> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
> rebuilt Fuseki and now the problem is gone!
>
> I would suggest that this constant be increased to something larger
> than 1000. Based on the code comment, you seem to have had similar
> thoughts sometime in the past :)
>
> Thanks,
> Osma
>
>
> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>> Hi Osma,
>> first of all, thanks for sharing your experience and clearly describing
>> your problem.
>> Further comments inline.
>>
>> On 13/07/12 14:13, Osma Suominen wrote:
>>> Hello!
>>>
>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>> create a system for accessing SKOS thesauri. The user interface
>>> includes an autocompletion widget. The idea is to use the LARQ index
>>> to make fast prefix queries on the concept labels.
>>>
>>> However, I've noticed that in some situations I get less results from
>>> the index than what I'd expect. This seems to happen when the LARQ
>>> part of the query internally produces many hits, such as when doing a
>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>
>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>> dependency to pom.xml and running mvn package. Other than this issue,
>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>> Ubuntu packages.
>>>
>>>
>>> Steps to repeat:
>>>
>>> 1. package Fuseki with LARQ, as described above
>>>
>>> 2. start Fuseki with the attached configuration file, i.e.
>>>     ./fuseki-server --config=larq-config.ttl
>>>
>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>> set (though the problem was originally found with other data sets):
>>>     - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>     - unzip so you have stw.rdf
>>>
>>> 4. load the thesaurus file into the endpoint:
>>>     ./s-put http://localhost:3030/ds/data default stw.rdf
>>>
>>> 6. build the LARQ index, e.g. this way:
>>>     - kill Fuseki
>>>     - rm -r /tmp/lucene
>>>     - start Fuseki again, so the index will be built
>>>
>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>
>>> First try this SPARQL query:
>>>
>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>> SELECT DISTINCT * WHERE {
>>>    ?lit pf:textMatch "ar*" .
>>>    ?conc skos:prefLabel ?lit .
>>>    FILTER(REGEX(?lit, '^ar.*', 'i'))
>>> } ORDER BY ?lit
>>>
>>> I get 120 hits, including "Arab"@en.
>>>
>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>> This way I get only 32 results, not including "Arab"@en, even though
>>> the shorter prefix query should match a superset of what was matched
>>> by the first query (the regex should still filter it down to the same
>>> result set).
>>>
>>>
>>> This issue is not just about single character prefix queries. With
>>> enough data sets loaded into the same index, this happens with longer
>>> prefix queries as well.
>>>
>>> I think that the problem might be related to Lucene's default
>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>> prefix query matches), as described in the Lucene FAQ:
>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>
>>>
>>
>> Yes, I think your hypothesis might be correct (I've not verified it
>> yet).
>>
>>> In case this is the problem, is there any way to tell LARQ to use a
>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>> not triggered? I find it a bit disturbing that hits are silently being
>>> lost. I couldn't see any special output on the Fuseki log.
>>
>> Not sure about this.
>>
>> Paolo
>>
>>>
>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>> can of course make a bug report.
>>>
>>>
>>> Thanks and best regards,
>>> Osma Suominen
>>>
>>
>>
>
>


Re: LARQ prefix search results missing hits

Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma

On 20/08/12 11:10, Osma Suominen wrote:
> Hi Paolo!
> 
> Thanks for your quick reply.
> 
> 17.08.2012 20:16, Paolo Castagna wrote:
>> Does your problem go away without changing the code and using:
>> ?lit pf:textMatch ( 'a*' 100000 )
> 
> I tested this but it didn't help. If I use a parameter less than 1000
> then I get even fewer hits, but values above 1000 don't have any effect.

Right.

> I think the problem is this line in IndexLARQ.java:
> 
> TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;
> 
> As you can see the parameter for maximum number of hits is taken
> directly from the NUM_RESULTS constant. The value specified in the query
> has no effect on this level.

Correct.

>> It's not a problem adding a couple of '0'...
>> However, I am thinking that this would just shift the problem, isn't it?
> 
> You're right, it would just shift the problem but a sufficiently large
> value could be used that never caused problems in practice. Maybe you
> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)

A lot of use cases about search are to used to drive a UI for people and
often only the first few results are necessary.

Try to continue hit 'next >>' on Google, how many results can you get?

;-)

Anyway, I increased the NUM_RESULT constant.

> Or maybe LARQ should use another variant of Lucene's
> IndexSearcher.search(), one which takes a Collector object instead of
> the integer n parameter. E.g. this:
> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29

Yes. That would be the thing to use if we want to retrieve all the
results from Lucene.

More thinking is necessary here...

In the meantime, you can find a LARQ SNAPSHOT here:
https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/

Paolo

> 
> 
> Thanks,
> Osma
> 
> 
>> On 15/08/12 10:31, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your reply and sorry for the delay.
>>>
>>> I tested this again with today's svn snapshot and it's still a problem.
>>>
>>> However, after digging a bit further I found this in
>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>
>>> --clip--
>>>      // The number of results returned by default
>>>      public static final int NUM_RESULTS             = 1000 ; // should
>>> we increase this? -- PC
>>> --clip--
>>>
>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>> rebuilt Fuseki and now the problem is gone!
>>>
>>> I would suggest that this constant be increased to something larger
>>> than 1000. Based on the code comment, you seem to have had similar
>>> thoughts sometime in the past :)
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>> Hi Osma,
>>>> first of all, thanks for sharing your experience and clearly describing
>>>> your problem.
>>>> Further comments inline.
>>>>
>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>> Hello!
>>>>>
>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>> to make fast prefix queries on the concept labels.
>>>>>
>>>>> However, I've noticed that in some situations I get less results from
>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>> part of the query internally produces many hits, such as when doing a
>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>
>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>>>> dependency to pom.xml and running mvn package. Other than this issue,
>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>> Ubuntu packages.
>>>>>
>>>>>
>>>>> Steps to repeat:
>>>>>
>>>>> 1. package Fuseki with LARQ, as described above
>>>>>
>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>      ./fuseki-server --config=larq-config.ttl
>>>>>
>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>> set (though the problem was originally found with other data sets):
>>>>>      - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>      - unzip so you have stw.rdf
>>>>>
>>>>> 4. load the thesaurus file into the endpoint:
>>>>>      ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>
>>>>> 6. build the LARQ index, e.g. this way:
>>>>>      - kill Fuseki
>>>>>      - rm -r /tmp/lucene
>>>>>      - start Fuseki again, so the index will be built
>>>>>
>>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>>>
>>>>> First try this SPARQL query:
>>>>>
>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>> SELECT DISTINCT * WHERE {
>>>>>     ?lit pf:textMatch "ar*" .
>>>>>     ?conc skos:prefLabel ?lit .
>>>>>     FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>> } ORDER BY ?lit
>>>>>
>>>>> I get 120 hits, including "Arab"@en.
>>>>>
>>>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>> the shorter prefix query should match a superset of what was matched
>>>>> by the first query (the regex should still filter it down to the same
>>>>> result set).
>>>>>
>>>>>
>>>>> This issue is not just about single character prefix queries. With
>>>>> enough data sets loaded into the same index, this happens with longer
>>>>> prefix queries as well.
>>>>>
>>>>> I think that the problem might be related to Lucene's default
>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>
>>>>>
>>>>>
>>>>
>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>> yet).
>>>>
>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>> not triggered? I find it a bit disturbing that hits are silently being
>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>
>>>> Not sure about this.
>>>>
>>>> Paolo
>>>>
>>>>>
>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>> can of course make a bug report.
>>>>>
>>>>>
>>>>> Thanks and best regards,
>>>>> Osma Suominen
>>>>>
>>>>
>>>>
>>>
>>>
>>
> 
> 


Re: LARQ prefix search results missing hits

Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma

On 28/08/12 14:22, Osma Suominen wrote:
> Hi Paolo!
> 
> Thanks a lot for the fix! I have tested the latest snapshot and it now
> works as expected. At least until I add lots of new data and hit the new
> limit :)
> 
> 
> You're of course right about the search use case. I think the problem
> here is that the LARQ index can be used for two very different use cases:
> 
> A. Traditional IR, in which the user cares about only the first few
> results. Lucene is obviously very good at this, though full advantage
> (especially for non-English languages) of it can only be achieved by
> using specific Analyzer implementations, which appears not to be
> supported in LARQ, at least not without writing some Java code.
> 
> B. Speeding up queries on literals for e.g. autocomplete search. While
> this can be done without a text index using FILTER(REGEX()), the queries
> tend to be quite slow, as the filter is applied only afterwards. In this
> case it is important that the text index returns all possible hits, not
> just the first ones.
> 
> I have no idea which is the more important use case for LARQ, but I'm
> currently only interested in B because of the requirements of the
> application I'm building (ONKI Light, a SKOS vocabulary browser for
> SPARQL endpoints).

Do you have any idea/proposal to make LARQ be good for both these
use cases?

> Currently the benefits of LARQ (at least for the out-of-the-box
> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
> by these limitations:
> 
> 1. The index is global and contains data from all named graphs mixed up.
> This means that when you have many named graphs with different data (as
> I do), and try to query only one graph, the LARQ query part will still
> return hits from all the other graphs, slowing down later parts of the
> query.

Yep.

I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.

I think this could be a good contribution, if you need it.

> 2. Similarly, the index does not allow filtering by language on the
> query level. With multilingual data, you cannot make a query matching
> e.g. only English labels but will get hits from all the other languages
> as well.

Yep.

I have no proposal for this, but I understand the user need.

> 3. The default implementation also doesn't store much context for the
> literal, meaning that you cannot restrict the search only to e.g.
> skos:prefLabel literal values in skos:Concept type resources. This will
> again increase the number of hits returned by the index internally.

I am not sure I follow this or I completely agree with you.

What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:

 {
   ?l pf:textMatch '...' .
   ?s skos:prefLabel ?l .
   ?s rdf:type skos:Concept .
 }

Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?

The syntax is very easy to remember and the property function
very easy to use.

The Lucene index can be kept quite simple and small.

> There may also be problems with prefix queries if you happen to hit the
> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
> problem myself with LARQ. Another problem for use case B might be that
> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
> common English stop words from the index and the query, which might
> interfer with the exact matching required for B.

Yep.

Any ideas/proposals?

> To be fair, other SPARQL text index implementations are not that good
> for prefix searches either. Virtuoso [1] requires at least 4 character
> prefixes to be specified (this can be changed by recompiling). AFAICT
> the 4store text index [2] doesn't support prefix queries at all, as the
> index structure requires whole words to be used (though possibly some
> creative use of subqueries and FILTER(REGEX()) could be used to still
> get some benefit of the index).

It's good to provide feedback, maybe with your help we can further
improve LARQ. :-)

Paolo

> 
> Osma
> 
> [1]
> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
> 
> [2] http://4store.org/trac/wiki/TextIndexing
> 
> 26.08.2012 22:49, Paolo Castagna wrote:
>> Hi Osma
>>
>> On 20/08/12 11:10, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your quick reply.
>>>
>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>> Does your problem go away without changing the code and using:
>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>
>>> I tested this but it didn't help. If I use a parameter less than 1000
>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>
>> Right.
>>
>>> I think the problem is this line in IndexLARQ.java:
>>>
>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>> LARQ.NUM_RESULTS ) ;
>>>
>>> As you can see the parameter for maximum number of hits is taken
>>> directly from the NUM_RESULTS constant. The value specified in the query
>>> has no effect on this level.
>>
>> Correct.
>>
>>>> It's not a problem adding a couple of '0'...
>>>> However, I am thinking that this would just shift the problem, isn't
>>>> it?
>>>
>>> You're right, it would just shift the problem but a sufficiently large
>>> value could be used that never caused problems in practice. Maybe you
>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>
>> A lot of use cases about search are to used to drive a UI for people and
>> often only the first few results are necessary.
>>
>> Try to continue hit 'next >>' on Google, how many results can you get?
>>
>> ;-)
>>
>> Anyway, I increased the NUM_RESULT constant.
>>
>>> Or maybe LARQ should use another variant of Lucene's
>>> IndexSearcher.search(), one which takes a Collector object instead of
>>> the integer n parameter. E.g. this:
>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>
>>
>> Yes. That would be the thing to use if we want to retrieve all the
>> results from Lucene.
>>
>> More thinking is necessary here...
>>
>> In the meantime, you can find a LARQ SNAPSHOT here:
>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>
>>
>> Paolo
>>
>>>
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>> Hi Paolo!
>>>>>
>>>>> Thanks for your reply and sorry for the delay.
>>>>>
>>>>> I tested this again with today's svn snapshot and it's still a
>>>>> problem.
>>>>>
>>>>> However, after digging a bit further I found this in
>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>
>>>>> --clip--
>>>>>       // The number of results returned by default
>>>>>       public static final int NUM_RESULTS             = 1000 ; //
>>>>> should
>>>>> we increase this? -- PC
>>>>> --clip--
>>>>>
>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>
>>>>> I would suggest that this constant be increased to something larger
>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>> thoughts sometime in the past :)
>>>>>
>>>>> Thanks,
>>>>> Osma
>>>>>
>>>>>
>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>> Hi Osma,
>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>> describing
>>>>>> your problem.
>>>>>> Further comments inline.
>>>>>>
>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>
>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>> from
>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>> part of the query internally produces many hits, such as when
>>>>>>> doing a
>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>
>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>> 2012-07-10 and
>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>> LARQ
>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>> issue,
>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>> Ubuntu packages.
>>>>>>>
>>>>>>>
>>>>>>> Steps to repeat:
>>>>>>>
>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>
>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>>       ./fuseki-server --config=larq-config.ttl
>>>>>>>
>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>>       - download
>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>>       - unzip so you have stw.rdf
>>>>>>>
>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>>       ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>
>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>>       - kill Fuseki
>>>>>>>       - rm -r /tmp/lucene
>>>>>>>       - start Fuseki again, so the index will be built
>>>>>>>
>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>> http://localhost:3030
>>>>>>>
>>>>>>> First try this SPARQL query:
>>>>>>>
>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>      ?lit pf:textMatch "ar*" .
>>>>>>>      ?conc skos:prefLabel ?lit .
>>>>>>>      FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>> } ORDER BY ?lit
>>>>>>>
>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>
>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>> "a*".
>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>> same
>>>>>>> result set).
>>>>>>>
>>>>>>>
>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>> longer
>>>>>>> prefix queries as well.
>>>>>>>
>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>> yet).
>>>>>>
>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>> being
>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>
>>>>>> Not sure about this.
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>>>
>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>> can of course make a bug report.
>>>>>>>
>>>>>>>
>>>>>>> Thanks and best regards,
>>>>>>> Osma Suominen
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
> 
> 


Re: LARQ prefix search results missing hits

Posted by Paolo Castagna <ca...@gmail.com>.
Apologies, this was a mistake.

Paolo

On 10 September 2012 23:07, Paolo Castagna <ca...@gmail.com> wrote:
> Hi Osma
>
> On 28/08/12 14:22, Osma Suominen wrote:
>> Hi Paolo!
>>
>> Thanks a lot for the fix! I have tested the latest snapshot and it now
>> works as expected. At least until I add lots of new data and hit the new
>> limit :)
>>
>>
>> You're of course right about the search use case. I think the problem
>> here is that the LARQ index can be used for two very different use cases:
>>
>> A. Traditional IR, in which the user cares about only the first few
>> results. Lucene is obviously very good at this, though full advantage
>> (especially for non-English languages) of it can only be achieved by
>> using specific Analyzer implementations, which appears not to be
>> supported in LARQ, at least not without writing some Java code.
>>
>> B. Speeding up queries on literals for e.g. autocomplete search. While
>> this can be done without a text index using FILTER(REGEX()), the queries
>> tend to be quite slow, as the filter is applied only afterwards. In this
>> case it is important that the text index returns all possible hits, not
>> just the first ones.
>>
>> I have no idea which is the more important use case for LARQ, but I'm
>> currently only interested in B because of the requirements of the
>> application I'm building (ONKI Light, a SKOS vocabulary browser for
>> SPARQL endpoints).
>
> Do you have any idea/proposal to make LARQ be good for both these
> use cases?
>
>> Currently the benefits of LARQ (at least for the out-of-the-box
>> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
>> by these limitations:
>>
>> 1. The index is global and contains data from all named graphs mixed up.
>> This means that when you have many named graphs with different data (as
>> I do), and try to query only one graph, the LARQ query part will still
>> return hits from all the other graphs, slowing down later parts of the
>> query.
>
> Yep.
>
> I though about this while ago, but I haven't actually tried to implement
> it. The changes to the index are trivial. The most
> difficult part perhaps is on the property function side, but
> maybe it's easy that as well.
>
> I think this could be a good contribution, if you need it.
>
>> 2. Similarly, the index does not allow filtering by language on the
>> query level. With multilingual data, you cannot make a query matching
>> e.g. only English labels but will get hits from all the other languages
>> as well.
>
> Yep.
>
> I have no proposal for this, but I understand the user need.
>
>> 3. The default implementation also doesn't store much context for the
>> literal, meaning that you cannot restrict the search only to e.g.
>> skos:prefLabel literal values in skos:Concept type resources. This will
>> again increase the number of hits returned by the index internally.
>
> I am not sure I follow this or I completely agree with you.
>
> What you say is true, but LARQ provides a property function and you
> can use it together with other triple patterns:
>
>  {
>    ?l pf:textMatch '...' .
>    ?s skos:prefLabel ?l .
>    ?s rdf:type skos:Concept .
>  }
>
> Now, we can argue on what a clever optimizer should/could do,
> but from a point of view of the user, this is quite good and
> powerful and it gets you what you want. Isn't it?
>
> The syntax is very easy to remember and the property function
> very easy to use.
>
> The Lucene index can be kept quite simple and small.
>
>>
>> There may also be problems with prefix queries if you happen to hit the
>> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
>> problem myself with LARQ. Another problem for use case B might be that
>> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
>> common English stop words from the index and the query, which might
>> interfer with the exact matching required for B.
>>
>> To be fair, other SPARQL text index implementations are not that good
>> for prefix searches either. Virtuoso [1] requires at least 4 character
>> prefixes to be specified (this can be changed by recompiling). AFAICT
>> the 4store text index [2] doesn't support prefix queries at all, as the
>> index structure requires whole words to be used (though possibly some
>> creative use of subqueries and FILTER(REGEX()) could be used to still
>> get some benefit of the index).
>>
>> Osma
>>
>> [1]
>> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>>
>> [2] http://4store.org/trac/wiki/TextIndexing
>>
>> 26.08.2012 22:49, Paolo Castagna wrote:
>>> Hi Osma
>>>
>>> On 20/08/12 11:10, Osma Suominen wrote:
>>>> Hi Paolo!
>>>>
>>>> Thanks for your quick reply.
>>>>
>>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>>> Does your problem go away without changing the code and using:
>>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>>
>>>> I tested this but it didn't help. If I use a parameter less than 1000
>>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>>
>>> Right.
>>>
>>>> I think the problem is this line in IndexLARQ.java:
>>>>
>>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>>> LARQ.NUM_RESULTS ) ;
>>>>
>>>> As you can see the parameter for maximum number of hits is taken
>>>> directly from the NUM_RESULTS constant. The value specified in the query
>>>> has no effect on this level.
>>>
>>> Correct.
>>>
>>>>> It's not a problem adding a couple of '0'...
>>>>> However, I am thinking that this would just shift the problem, isn't
>>>>> it?
>>>>
>>>> You're right, it would just shift the problem but a sufficiently large
>>>> value could be used that never caused problems in practice. Maybe you
>>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>>
>>> A lot of use cases about search are to used to drive a UI for people and
>>> often only the first few results are necessary.
>>>
>>> Try to continue hit 'next >>' on Google, how many results can you get?
>>>
>>> ;-)
>>>
>>> Anyway, I increased the NUM_RESULT constant.
>>>
>>>> Or maybe LARQ should use another variant of Lucene's
>>>> IndexSearcher.search(), one which takes a Collector object instead of
>>>> the integer n parameter. E.g. this:
>>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>>
>>>
>>> Yes. That would be the thing to use if we want to retrieve all the
>>> results from Lucene.
>>>
>>> More thinking is necessary here...
>>>
>>> In the meantime, you can find a LARQ SNAPSHOT here:
>>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>>
>>>
>>> Paolo
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Osma
>>>>
>>>>
>>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>>> Hi Paolo!
>>>>>>
>>>>>> Thanks for your reply and sorry for the delay.
>>>>>>
>>>>>> I tested this again with today's svn snapshot and it's still a
>>>>>> problem.
>>>>>>
>>>>>> However, after digging a bit further I found this in
>>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>>
>>>>>> --clip--
>>>>>>       // The number of results returned by default
>>>>>>       public static final int NUM_RESULTS             = 1000 ; //
>>>>>> should
>>>>>> we increase this? -- PC
>>>>>> --clip--
>>>>>>
>>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>>
>>>>>> I would suggest that this constant be increased to something larger
>>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>>> thoughts sometime in the past :)
>>>>>>
>>>>>> Thanks,
>>>>>> Osma
>>>>>>
>>>>>>
>>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>>> Hi Osma,
>>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>>> describing
>>>>>>> your problem.
>>>>>>> Further comments inline.
>>>>>>>
>>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>>> Hello!
>>>>>>>>
>>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>>
>>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>>> from
>>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>>> part of the query internally produces many hits, such as when
>>>>>>>> doing a
>>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>>
>>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>>> 2012-07-10 and
>>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>>> LARQ
>>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>>> issue,
>>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>>> Ubuntu packages.
>>>>>>>>
>>>>>>>>
>>>>>>>> Steps to repeat:
>>>>>>>>
>>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>>
>>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>>>       ./fuseki-server --config=larq-config.ttl
>>>>>>>>
>>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>>>       - download
>>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>>>       - unzip so you have stw.rdf
>>>>>>>>
>>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>>>       ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>>
>>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>>>       - kill Fuseki
>>>>>>>>       - rm -r /tmp/lucene
>>>>>>>>       - start Fuseki again, so the index will be built
>>>>>>>>
>>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>>> http://localhost:3030
>>>>>>>>
>>>>>>>> First try this SPARQL query:
>>>>>>>>
>>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>>      ?lit pf:textMatch "ar*" .
>>>>>>>>      ?conc skos:prefLabel ?lit .
>>>>>>>>      FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>>> } ORDER BY ?lit
>>>>>>>>
>>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>>
>>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>>> "a*".
>>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>>> same
>>>>>>>> result set).
>>>>>>>>
>>>>>>>>
>>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>>> longer
>>>>>>>> prefix queries as well.
>>>>>>>>
>>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>>> yet).
>>>>>>>
>>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>>> being
>>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>>
>>>>>>> Not sure about this.
>>>>>>>
>>>>>>> Paolo
>>>>>>>
>>>>>>>>
>>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>>> can of course make a bug report.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks and best regards,
>>>>>>>> Osma Suominen
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>

Re: LARQ prefix search results missing hits

Posted by Paolo Castagna <ca...@gmail.com>.
Hi Osma

On 28/08/12 14:22, Osma Suominen wrote:
> Hi Paolo!
>
> Thanks a lot for the fix! I have tested the latest snapshot and it now
> works as expected. At least until I add lots of new data and hit the new
> limit :)
>
>
> You're of course right about the search use case. I think the problem
> here is that the LARQ index can be used for two very different use cases:
>
> A. Traditional IR, in which the user cares about only the first few
> results. Lucene is obviously very good at this, though full advantage
> (especially for non-English languages) of it can only be achieved by
> using specific Analyzer implementations, which appears not to be
> supported in LARQ, at least not without writing some Java code.
>
> B. Speeding up queries on literals for e.g. autocomplete search. While
> this can be done without a text index using FILTER(REGEX()), the queries
> tend to be quite slow, as the filter is applied only afterwards. In this
> case it is important that the text index returns all possible hits, not
> just the first ones.
>
> I have no idea which is the more important use case for LARQ, but I'm
> currently only interested in B because of the requirements of the
> application I'm building (ONKI Light, a SKOS vocabulary browser for
> SPARQL endpoints).

Do you have any idea/proposal to make LARQ be good for both these
use cases?

> Currently the benefits of LARQ (at least for the out-of-the-box
> configuration for Fuseki+LARQ) for both A and B are somewhat diminished
> by these limitations:
>
> 1. The index is global and contains data from all named graphs mixed up.
> This means that when you have many named graphs with different data (as
> I do), and try to query only one graph, the LARQ query part will still
> return hits from all the other graphs, slowing down later parts of the
> query.

Yep.

I though about this while ago, but I haven't actually tried to implement
it. The changes to the index are trivial. The most
difficult part perhaps is on the property function side, but
maybe it's easy that as well.

I think this could be a good contribution, if you need it.

> 2. Similarly, the index does not allow filtering by language on the
> query level. With multilingual data, you cannot make a query matching
> e.g. only English labels but will get hits from all the other languages
> as well.

Yep.

I have no proposal for this, but I understand the user need.

> 3. The default implementation also doesn't store much context for the
> literal, meaning that you cannot restrict the search only to e.g.
> skos:prefLabel literal values in skos:Concept type resources. This will
> again increase the number of hits returned by the index internally.

I am not sure I follow this or I completely agree with you.

What you say is true, but LARQ provides a property function and you
can use it together with other triple patterns:

 {
   ?l pf:textMatch '...' .
   ?s skos:prefLabel ?l .
   ?s rdf:type skos:Concept .
 }

Now, we can argue on what a clever optimizer should/could do,
but from a point of view of the user, this is quite good and
powerful and it gets you what you want. Isn't it?

The syntax is very easy to remember and the property function
very easy to use.

The Lucene index can be kept quite simple and small.

>
> There may also be problems with prefix queries if you happen to hit the
> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
> problem myself with LARQ. Another problem for use case B might be that
> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
> common English stop words from the index and the query, which might
> interfer with the exact matching required for B.
>
> To be fair, other SPARQL text index implementations are not that good
> for prefix searches either. Virtuoso [1] requires at least 4 character
> prefixes to be specified (this can be changed by recompiling). AFAICT
> the 4store text index [2] doesn't support prefix queries at all, as the
> index structure requires whole words to be used (though possibly some
> creative use of subqueries and FILTER(REGEX()) could be used to still
> get some benefit of the index).
>
> Osma
>
> [1]
> http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
>
> [2] http://4store.org/trac/wiki/TextIndexing
>
> 26.08.2012 22:49, Paolo Castagna wrote:
>> Hi Osma
>>
>> On 20/08/12 11:10, Osma Suominen wrote:
>>> Hi Paolo!
>>>
>>> Thanks for your quick reply.
>>>
>>> 17.08.2012 20:16, Paolo Castagna wrote:
>>>> Does your problem go away without changing the code and using:
>>>> ?lit pf:textMatch ( 'a*' 100000 )
>>>
>>> I tested this but it didn't help. If I use a parameter less than 1000
>>> then I get even fewer hits, but values above 1000 don't have any effect.
>>
>> Right.
>>
>>> I think the problem is this line in IndexLARQ.java:
>>>
>>> TopDocs topDocs = searcher.search(query, (Filter)null,
>>> LARQ.NUM_RESULTS ) ;
>>>
>>> As you can see the parameter for maximum number of hits is taken
>>> directly from the NUM_RESULTS constant. The value specified in the query
>>> has no effect on this level.
>>
>> Correct.
>>
>>>> It's not a problem adding a couple of '0'...
>>>> However, I am thinking that this would just shift the problem, isn't
>>>> it?
>>>
>>> You're right, it would just shift the problem but a sufficiently large
>>> value could be used that never caused problems in practice. Maybe you
>>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>>
>> A lot of use cases about search are to used to drive a UI for people and
>> often only the first few results are necessary.
>>
>> Try to continue hit 'next >>' on Google, how many results can you get?
>>
>> ;-)
>>
>> Anyway, I increased the NUM_RESULT constant.
>>
>>> Or maybe LARQ should use another variant of Lucene's
>>> IndexSearcher.search(), one which takes a Collector object instead of
>>> the integer n parameter. E.g. this:
>>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>>>
>>
>> Yes. That would be the thing to use if we want to retrieve all the
>> results from Lucene.
>>
>> More thinking is necessary here...
>>
>> In the meantime, you can find a LARQ SNAPSHOT here:
>> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>>
>>
>> Paolo
>>
>>>
>>>
>>> Thanks,
>>> Osma
>>>
>>>
>>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>>> Hi Paolo!
>>>>>
>>>>> Thanks for your reply and sorry for the delay.
>>>>>
>>>>> I tested this again with today's svn snapshot and it's still a
>>>>> problem.
>>>>>
>>>>> However, after digging a bit further I found this in
>>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>>
>>>>> --clip--
>>>>>       // The number of results returned by default
>>>>>       public static final int NUM_RESULTS             = 1000 ; //
>>>>> should
>>>>> we increase this? -- PC
>>>>> --clip--
>>>>>
>>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>>> rebuilt Fuseki and now the problem is gone!
>>>>>
>>>>> I would suggest that this constant be increased to something larger
>>>>> than 1000. Based on the code comment, you seem to have had similar
>>>>> thoughts sometime in the past :)
>>>>>
>>>>> Thanks,
>>>>> Osma
>>>>>
>>>>>
>>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>>> Hi Osma,
>>>>>> first of all, thanks for sharing your experience and clearly
>>>>>> describing
>>>>>> your problem.
>>>>>> Further comments inline.
>>>>>>
>>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>>> Hello!
>>>>>>>
>>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>>> to make fast prefix queries on the concept labels.
>>>>>>>
>>>>>>> However, I've noticed that in some situations I get less results
>>>>>>> from
>>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>>> part of the query internally produces many hits, such as when
>>>>>>> doing a
>>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>>
>>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on
>>>>>>> 2012-07-10 and
>>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the
>>>>>>> LARQ
>>>>>>> dependency to pom.xml and running mvn package. Other than this
>>>>>>> issue,
>>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>>> Ubuntu packages.
>>>>>>>
>>>>>>>
>>>>>>> Steps to repeat:
>>>>>>>
>>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>>
>>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>>       ./fuseki-server --config=larq-config.ttl
>>>>>>>
>>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>>> set (though the problem was originally found with other data sets):
>>>>>>>       - download
>>>>>>> http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>>       - unzip so you have stw.rdf
>>>>>>>
>>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>>       ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>>
>>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>>       - kill Fuseki
>>>>>>>       - rm -r /tmp/lucene
>>>>>>>       - start Fuseki again, so the index will be built
>>>>>>>
>>>>>>> 7. Make SPARQL queries from the web interface at
>>>>>>> http://localhost:3030
>>>>>>>
>>>>>>> First try this SPARQL query:
>>>>>>>
>>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>      ?lit pf:textMatch "ar*" .
>>>>>>>      ?conc skos:prefLabel ?lit .
>>>>>>>      FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>>> } ORDER BY ?lit
>>>>>>>
>>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>>
>>>>>>> Now try the same query, but change the pf:textMatch argument to
>>>>>>> "a*".
>>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>>> by the first query (the regex should still filter it down to the
>>>>>>> same
>>>>>>> result set).
>>>>>>>
>>>>>>>
>>>>>>> This issue is not just about single character prefix queries. With
>>>>>>> enough data sets loaded into the same index, this happens with
>>>>>>> longer
>>>>>>> prefix queries as well.
>>>>>>>
>>>>>>> I think that the problem might be related to Lucene's default
>>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>>> yet).
>>>>>>
>>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>>> not triggered? I find it a bit disturbing that hits are silently
>>>>>>> being
>>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>>
>>>>>> Not sure about this.
>>>>>>
>>>>>> Paolo
>>>>>>
>>>>>>>
>>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>>> can of course make a bug report.
>>>>>>>
>>>>>>>
>>>>>>> Thanks and best regards,
>>>>>>> Osma Suominen
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>
>

Re: LARQ prefix search results missing hits

Posted by Osma Suominen <os...@aalto.fi>.
Hi Paolo!

31.08.2012 21:58, Paolo Castagna kirjoitti:

>> A. Traditional IR, in which the user cares about only the first few
>> results. Lucene is obviously very good at this, though full advantage
>> (especially for non-English languages) of it can only be achieved by
>> using specific Analyzer implementations, which appears not to be
>> supported in LARQ, at least not without writing some Java code.
>>
>> B. Speeding up queries on literals for e.g. autocomplete search. While
>> this can be done without a text index using FILTER(REGEX()), the queries
>> tend to be quite slow, as the filter is applied only afterwards. In this
>> case it is important that the text index returns all possible hits, not
>> just the first ones.
[...]
> Do you have any idea/proposal to make LARQ be good for both these
> use cases?

For A, I think LARQ is quite good already, though I note that the 
current implementation is hardcoded to use Lucene StandardAnalyzer which 
is pretty good for English text, fine for most European languages, but 
maybe not that great for some other languages. Making it configurable to 
support other Analyzers such as different language stemmers might be 
useful. 4store allows a German stemmer to be used, for example [1].

For B, see below.

>> 1. The index is global and contains data from all named graphs mixed up.
>> This means that when you have many named graphs with different data (as
>> I do), and try to query only one graph, the LARQ query part will still
>> return hits from all the other graphs, slowing down later parts of the
>> query.
>
> Yep.
>
> I though about this while ago, but I haven't actually tried to implement
> it. The changes to the index are trivial. The most
> difficult part perhaps is on the property function side, but
> maybe it's easy that as well.
>
> I think this could be a good contribution, if you need it.

This would we good for my application as it would speed up queries, 
sometimes by a lot I think. But I'm not that familiar with the Jena 
codebase so I won't volunteer to implement it...

>> 2. Similarly, the index does not allow filtering by language on the
>> query level. With multilingual data, you cannot make a query matching
>> e.g. only English labels but will get hits from all the other languages
>> as well.
>
> Yep.
>
> I have no proposal for this, but I understand the user need.

I tried a single line change to LARQ.java to support querying by 
language. Patch attached.

I tested this with the STW thesaurus dataset mentioned in the beginning 
of this thread. This query against the current unpatched LARQ searches 
for all concepts whose English language skos:prefLabel begins with A:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
    ?lit pf:textMatch "a*" .
    ?conc skos:prefLabel ?lit .
    FILTER(REGEX(?lit, '^a.*', 'i') && langMatches(LANG(?lit), 'en'))
} ORDER BY ?lit

I benchmarked this query a few dozen times using apachebench. It takes 
at minimum 35 ms on my machine.

With the patch applied, I can instead use this query:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
SELECT DISTINCT * WHERE {
    ?lit pf:textMatch "+a* +lang:en" .
    ?conc skos:prefLabel ?lit .
    FILTER(REGEX(?lit, '^a.*', 'i'))
} ORDER BY ?lit

Note that I no longer need to filter the results by language as the 
index only provides hits with the correct language tag. This query now 
takes 25ms, so it's about 30% faster than the original. The Lucene index 
size went from 4352 kb to 4444 kb, a 2% increase.

I admit this is a quite small dataset, but I haven't yet had time to 
test with larger ones.

What do you think?

A possible refinement would be to support a syntax where the language 
tag is taken from the literal in the query, e.g.
    ?lit pf:textMatch "a*"@en .


>> 3. The default implementation also doesn't store much context for the
>> literal, meaning that you cannot restrict the search only to e.g.
>> skos:prefLabel literal values in skos:Concept type resources. This will
>> again increase the number of hits returned by the index internally.
>
> I am not sure I follow this or I completely agree with you.
>
> What you say is true, but LARQ provides a property function and you
> can use it together with other triple patterns:
>
>   {
>     ?l pf:textMatch '...' .
>     ?s skos:prefLabel ?l .
>     ?s rdf:type skos:Concept .
>   }
>
> Now, we can argue on what a clever optimizer should/could do,
> but from a point of view of the user, this is quite good and
> powerful and it gets you what you want. Isn't it?
>
> The syntax is very easy to remember and the property function
> very easy to use.
>
> The Lucene index can be kept quite simple and small.

You're right here, the syntax is perfectly fine. It is only an 
optimization issue.

>> There may also be problems with prefix queries if you happen to hit the
>> default BooleanQuery limit of 1024 clauses, but I haven't yet had this
>> problem myself with LARQ. Another problem for use case B might be that
>> the default Lucene StandardAnalyzer, which LARQ seems to use, filters
>> common English stop words from the index and the query, which might
>> interfer with the exact matching required for B.
>
> Yep.
>
> Any ideas/proposals?

For the BooleanQuery issue, I would suggest adding this somewhere in the 
LARQ code:
	BooleanQuery.setMaxClauseCount(newMax)
where newMax is a sufficiently large value (could be 100000 or 
Integer.MAX_VALUE).

For the other issues, I think use case B would benefit a lot if there 
was a way to make the field "index" in the Lucene index use a simpler 
Analyzer such as SimpleAnalyzer or TokenAnalyzer. Or alternatively, 
perhaps the "lex" field could be processed with another analyzer. For my 
application, something like LowerCaseKeywordAnalyzer would be perfect, 
but it doesn't exist in the Lucene distribution. A quick web search 
finds many such implementations though.

(BTW, I don't quite understand why there's both "index" and "lex" fields 
in the index, I think one field should be enough for both retrieving 
exact strings and for performing text searches using keywords).

-Osma

[1] http://4store.org/trac/wiki/TextIndexing


-- 
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing 
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 
Aalto, Finland

Re: LARQ prefix search results missing hits

Posted by Osma Suominen <os...@aalto.fi>.
Hi Paolo!

Thanks a lot for the fix! I have tested the latest snapshot and it now 
works as expected. At least until I add lots of new data and hit the new 
limit :)


You're of course right about the search use case. I think the problem 
here is that the LARQ index can be used for two very different use cases:

A. Traditional IR, in which the user cares about only the first few 
results. Lucene is obviously very good at this, though full advantage 
(especially for non-English languages) of it can only be achieved by 
using specific Analyzer implementations, which appears not to be 
supported in LARQ, at least not without writing some Java code.

B. Speeding up queries on literals for e.g. autocomplete search. While 
this can be done without a text index using FILTER(REGEX()), the queries 
tend to be quite slow, as the filter is applied only afterwards. In this 
case it is important that the text index returns all possible hits, not 
just the first ones.

I have no idea which is the more important use case for LARQ, but I'm 
currently only interested in B because of the requirements of the 
application I'm building (ONKI Light, a SKOS vocabulary browser for 
SPARQL endpoints).


Currently the benefits of LARQ (at least for the out-of-the-box 
configuration for Fuseki+LARQ) for both A and B are somewhat diminished 
by these limitations:

1. The index is global and contains data from all named graphs mixed up. 
This means that when you have many named graphs with different data (as 
I do), and try to query only one graph, the LARQ query part will still 
return hits from all the other graphs, slowing down later parts of the 
query.

2. Similarly, the index does not allow filtering by language on the 
query level. With multilingual data, you cannot make a query matching 
e.g. only English labels but will get hits from all the other languages 
as well.

3. The default implementation also doesn't store much context for the 
literal, meaning that you cannot restrict the search only to e.g. 
skos:prefLabel literal values in skos:Concept type resources. This will 
again increase the number of hits returned by the index internally.

There may also be problems with prefix queries if you happen to hit the 
default BooleanQuery limit of 1024 clauses, but I haven't yet had this 
problem myself with LARQ. Another problem for use case B might be that 
the default Lucene StandardAnalyzer, which LARQ seems to use, filters 
common English stop words from the index and the query, which might 
interfer with the exact matching required for B.

To be fair, other SPARQL text index implementations are not that good 
for prefix searches either. Virtuoso [1] requires at least 4 character 
prefixes to be specified (this can be changed by recompiling). AFAICT 
the 4store text index [2] doesn't support prefix queries at all, as the 
index structure requires whole words to be used (though possibly some 
creative use of subqueries and FILTER(REGEX()) could be used to still 
get some benefit of the index).

Osma

[1] 
http://docs.openlinksw.com/virtuoso/sparqlextensions.html#rdfsparqlrulefulltext
[2] http://4store.org/trac/wiki/TextIndexing

26.08.2012 22:49, Paolo Castagna wrote:
> Hi Osma
>
> On 20/08/12 11:10, Osma Suominen wrote:
>> Hi Paolo!
>>
>> Thanks for your quick reply.
>>
>> 17.08.2012 20:16, Paolo Castagna wrote:
>>> Does your problem go away without changing the code and using:
>>> ?lit pf:textMatch ( 'a*' 100000 )
>>
>> I tested this but it didn't help. If I use a parameter less than 1000
>> then I get even fewer hits, but values above 1000 don't have any effect.
>
> Right.
>
>> I think the problem is this line in IndexLARQ.java:
>>
>> TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;
>>
>> As you can see the parameter for maximum number of hits is taken
>> directly from the NUM_RESULTS constant. The value specified in the query
>> has no effect on this level.
>
> Correct.
>
>>> It's not a problem adding a couple of '0'...
>>> However, I am thinking that this would just shift the problem, isn't it?
>>
>> You're right, it would just shift the problem but a sufficiently large
>> value could be used that never caused problems in practice. Maybe you
>> could consider NUM_RESULTS = Integer.MAX_VALUE ? :)
>
> A lot of use cases about search are to used to drive a UI for people and
> often only the first few results are necessary.
>
> Try to continue hit 'next >>' on Google, how many results can you get?
>
> ;-)
>
> Anyway, I increased the NUM_RESULT constant.
>
>> Or maybe LARQ should use another variant of Lucene's
>> IndexSearcher.search(), one which takes a Collector object instead of
>> the integer n parameter. E.g. this:
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29
>
> Yes. That would be the thing to use if we want to retrieve all the
> results from Lucene.
>
> More thinking is necessary here...
>
> In the meantime, you can find a LARQ SNAPSHOT here:
> https://repository.apache.org/content/groups/snapshots/org/apache/jena/jena-larq/1.0.1-SNAPSHOT/
>
> Paolo
>
>>
>>
>> Thanks,
>> Osma
>>
>>
>>> On 15/08/12 10:31, Osma Suominen wrote:
>>>> Hi Paolo!
>>>>
>>>> Thanks for your reply and sorry for the delay.
>>>>
>>>> I tested this again with today's svn snapshot and it's still a problem.
>>>>
>>>> However, after digging a bit further I found this in
>>>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>>>
>>>> --clip--
>>>>       // The number of results returned by default
>>>>       public static final int NUM_RESULTS             = 1000 ; // should
>>>> we increase this? -- PC
>>>> --clip--
>>>>
>>>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>>>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>>>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>>>> rebuilt Fuseki and now the problem is gone!
>>>>
>>>> I would suggest that this constant be increased to something larger
>>>> than 1000. Based on the code comment, you seem to have had similar
>>>> thoughts sometime in the past :)
>>>>
>>>> Thanks,
>>>> Osma
>>>>
>>>>
>>>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>>>> Hi Osma,
>>>>> first of all, thanks for sharing your experience and clearly describing
>>>>> your problem.
>>>>> Further comments inline.
>>>>>
>>>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>>>> Hello!
>>>>>>
>>>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>>>> create a system for accessing SKOS thesauri. The user interface
>>>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>>>> to make fast prefix queries on the concept labels.
>>>>>>
>>>>>> However, I've noticed that in some situations I get less results from
>>>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>>>> part of the query internally produces many hits, such as when doing a
>>>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>>>
>>>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>>>>> dependency to pom.xml and running mvn package. Other than this issue,
>>>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>>>> Ubuntu packages.
>>>>>>
>>>>>>
>>>>>> Steps to repeat:
>>>>>>
>>>>>> 1. package Fuseki with LARQ, as described above
>>>>>>
>>>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>>>       ./fuseki-server --config=larq-config.ttl
>>>>>>
>>>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>>>> set (though the problem was originally found with other data sets):
>>>>>>       - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>>>       - unzip so you have stw.rdf
>>>>>>
>>>>>> 4. load the thesaurus file into the endpoint:
>>>>>>       ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>>>
>>>>>> 6. build the LARQ index, e.g. this way:
>>>>>>       - kill Fuseki
>>>>>>       - rm -r /tmp/lucene
>>>>>>       - start Fuseki again, so the index will be built
>>>>>>
>>>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>>>>
>>>>>> First try this SPARQL query:
>>>>>>
>>>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>>>> SELECT DISTINCT * WHERE {
>>>>>>      ?lit pf:textMatch "ar*" .
>>>>>>      ?conc skos:prefLabel ?lit .
>>>>>>      FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>>>> } ORDER BY ?lit
>>>>>>
>>>>>> I get 120 hits, including "Arab"@en.
>>>>>>
>>>>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>>>> the shorter prefix query should match a superset of what was matched
>>>>>> by the first query (the regex should still filter it down to the same
>>>>>> result set).
>>>>>>
>>>>>>
>>>>>> This issue is not just about single character prefix queries. With
>>>>>> enough data sets loaded into the same index, this happens with longer
>>>>>> prefix queries as well.
>>>>>>
>>>>>> I think that the problem might be related to Lucene's default
>>>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>>>> prefix query matches), as described in the Lucene FAQ:
>>>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> Yes, I think your hypothesis might be correct (I've not verified it
>>>>> yet).
>>>>>
>>>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>>>> not triggered? I find it a bit disturbing that hits are silently being
>>>>>> lost. I couldn't see any special output on the Fuseki log.
>>>>>
>>>>> Not sure about this.
>>>>>
>>>>> Paolo
>>>>>
>>>>>>
>>>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>>>> can of course make a bug report.
>>>>>>
>>>>>>
>>>>>> Thanks and best regards,
>>>>>> Osma Suominen
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>
>


-- 
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing 
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 
Aalto, Finland

Re: LARQ prefix search results missing hits

Posted by Osma Suominen <os...@aalto.fi>.
Hi Paolo!

Thanks for your quick reply.

17.08.2012 20:16, Paolo Castagna wrote:
> Does your problem go away without changing the code and using:
> ?lit pf:textMatch ( 'a*' 100000 )

I tested this but it didn't help. If I use a parameter less than 1000 
then I get even fewer hits, but values above 1000 don't have any effect.

I think the problem is this line in IndexLARQ.java:

TopDocs topDocs = searcher.search(query, (Filter)null, LARQ.NUM_RESULTS ) ;

As you can see the parameter for maximum number of hits is taken 
directly from the NUM_RESULTS constant. The value specified in the query 
has no effect on this level.

> It's not a problem adding a couple of '0'...
> However, I am thinking that this would just shift the problem, isn't it?

You're right, it would just shift the problem but a sufficiently large 
value could be used that never caused problems in practice. Maybe you 
could consider NUM_RESULTS = Integer.MAX_VALUE ? :)

Or maybe LARQ should use another variant of Lucene's 
IndexSearcher.search(), one which takes a Collector object instead of 
the integer n parameter. E.g. this:
http://lucene.apache.org/core/old_versioned_docs/versions/3_1_0/api/core/org/apache/lucene/search/IndexSearcher.html#search%28org.apache.lucene.search.Query,%20org.apache.lucene.search.Filter,%20org.apache.lucene.search.Collector%29

Thanks,
Osma


> On 15/08/12 10:31, Osma Suominen wrote:
>> Hi Paolo!
>>
>> Thanks for your reply and sorry for the delay.
>>
>> I tested this again with today's svn snapshot and it's still a problem.
>>
>> However, after digging a bit further I found this in
>> jena-larq/src/main/java/org/apache/jena/larq/LARQ.java:
>>
>> --clip--
>>      // The number of results returned by default
>>      public static final int NUM_RESULTS             = 1000 ; // should
>> we increase this? -- PC
>> --clip--
>>
>> I changed NUM_RESULTS to 100000 (added two zeros), built and installed
>> my modified LARQ with mvn install (NB this required tweaking arq.ver
>> and tdb.ver in jena-larq/pom.xml to match the current svn versions),
>> rebuilt Fuseki and now the problem is gone!
>>
>> I would suggest that this constant be increased to something larger
>> than 1000. Based on the code comment, you seem to have had similar
>> thoughts sometime in the past :)
>>
>> Thanks,
>> Osma
>>
>>
>> 15.07.2012 11:21, Paolo Castagna kirjoitti:
>>> Hi Osma,
>>> first of all, thanks for sharing your experience and clearly describing
>>> your problem.
>>> Further comments inline.
>>>
>>> On 13/07/12 14:13, Osma Suominen wrote:
>>>> Hello!
>>>>
>>>> I'm trying to use a Fuseki SPARQL endpoint together with LARQ to
>>>> create a system for accessing SKOS thesauri. The user interface
>>>> includes an autocompletion widget. The idea is to use the LARQ index
>>>> to make fast prefix queries on the concept labels.
>>>>
>>>> However, I've noticed that in some situations I get less results from
>>>> the index than what I'd expect. This seems to happen when the LARQ
>>>> part of the query internally produces many hits, such as when doing a
>>>> single character prefix query (e.g. ?lit pf:textMatch 'a*').
>>>>
>>>> I'm using Fuseki 0.2.4-SNAPSHOT taken from SVN trunk on 2012-07-10 and
>>>> LARQ 1.0.0-incubating. I compiled Fuseki with LARQ by adding the LARQ
>>>> dependency to pom.xml and running mvn package. Other than this issue,
>>>> Fuseki and LARQ queries seem to work fine. I'm using Ubuntu Linux
>>>> 12.04 LTS amd64 with OpenJDK 1.6.0_24 installed from the standard
>>>> Ubuntu packages.
>>>>
>>>>
>>>> Steps to repeat:
>>>>
>>>> 1. package Fuseki with LARQ, as described above
>>>>
>>>> 2. start Fuseki with the attached configuration file, i.e.
>>>>      ./fuseki-server --config=larq-config.ttl
>>>>
>>>> 3. I'm using the STW thesaurus as an easily accessible example data
>>>> set (though the problem was originally found with other data sets):
>>>>      - download http://zbw.eu/stw/versions/latest/download/stw.rdf.zip
>>>>      - unzip so you have stw.rdf
>>>>
>>>> 4. load the thesaurus file into the endpoint:
>>>>      ./s-put http://localhost:3030/ds/data default stw.rdf
>>>>
>>>> 6. build the LARQ index, e.g. this way:
>>>>      - kill Fuseki
>>>>      - rm -r /tmp/lucene
>>>>      - start Fuseki again, so the index will be built
>>>>
>>>> 7. Make SPARQL queries from the web interface at http://localhost:3030
>>>>
>>>> First try this SPARQL query:
>>>>
>>>> PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
>>>> PREFIX pf:<http://jena.hpl.hp.com/ARQ/property#>
>>>> SELECT DISTINCT * WHERE {
>>>>     ?lit pf:textMatch "ar*" .
>>>>     ?conc skos:prefLabel ?lit .
>>>>     FILTER(REGEX(?lit, '^ar.*', 'i'))
>>>> } ORDER BY ?lit
>>>>
>>>> I get 120 hits, including "Arab"@en.
>>>>
>>>> Now try the same query, but change the pf:textMatch argument to "a*".
>>>> This way I get only 32 results, not including "Arab"@en, even though
>>>> the shorter prefix query should match a superset of what was matched
>>>> by the first query (the regex should still filter it down to the same
>>>> result set).
>>>>
>>>>
>>>> This issue is not just about single character prefix queries. With
>>>> enough data sets loaded into the same index, this happens with longer
>>>> prefix queries as well.
>>>>
>>>> I think that the problem might be related to Lucene's default
>>>> limitation of a maximum of 1024 clauses in boolean queries (and thus
>>>> prefix query matches), as described in the Lucene FAQ:
>>>> http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_a_TooManyClauses_exception.3F
>>>>
>>>>
>>>
>>> Yes, I think your hypothesis might be correct (I've not verified it
>>> yet).
>>>
>>>> In case this is the problem, is there any way to tell LARQ to use a
>>>> higher BooleanQuery.setMaxClauseCount() value so that this limit is
>>>> not triggered? I find it a bit disturbing that hits are silently being
>>>> lost. I couldn't see any special output on the Fuseki log.
>>>
>>> Not sure about this.
>>>
>>> Paolo
>>>
>>>>
>>>> Am I doing something wrong? If this is a genuine problem in LARQ, I
>>>> can of course make a bug report.
>>>>
>>>>
>>>> Thanks and best regards,
>>>> Osma Suominen
>>>>
>>>
>>>
>>
>>
>


-- 
Osma Suominen | Osma.Suominen@aalto.fi | +358 40 5255 882
Aalto University, Department of Media Technology, Semantic Computing 
Research Group
Room 2541, Otaniementie 17, Espoo, Finland; P.O. Box 15500, FI-00076 
Aalto, Finland