You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Vincent Ventresque <vi...@ens-lyon.fr> on 2018/07/19 12:07:03 UTC

fuseki text:query : strange results + Lucene configuration

Hello,

I've just subscribed to the users@jena.apache.org list, and I apologize 
if this mail is not sent properly.

I'm trying to use Fuseki text:query, and have encountered several 
issues. Here are my questions

1) Does text:query require a minimum number of characters to be efficient?

2) Is performance linked to the number of fields indexed?

3) In order to retrieve strings containing hyphens, should I use 
KeywordTokenizer in config file?

~~~ 1) Does text:query require a minimum number of characters to be 
efficient? ~~~~~~~~~~~~~

I've noticed that a query on indexed predicates (foaf:familyName and 
foaf:givenName) returns more results when there are more characters in 
the string :

SELECT * WHERE {

?uriBnF text:query ( foaf:familyName "roussea*" ) .

?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .

?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .

optional {?uriBnF bio:birth ?dateNaissance }

}

I was expecting that "Rousseau" + "Jean-Jacques" would be in the results.

=> if  $MY_STRING = "j*", I get  0 result

=> if  $MY_STRING = "je*", I get 17 results, including "Jean-Claude" & 
"Jean-Baptiste" BUT not "Jean-Jacques"

=> if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"

I don't know anything about Lucene, but it looks very strange to me : I 
expected the contrary (fewer letters = bigger results list).


~~~ 2) Is performance linked to the number of fields indexed? 
~~~~~~~~~~~~~~~~~~~~~~~

If I change the configuration and index only foaf:givenName, and provide 
a constant for foaf:familyName, the query returns more results :

SELECT * WHERE {

?uriBnF foaf:familyName "Rousseau" .

?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .

?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .

optional {?uriBnF bio:birth ?dateNaissance }

}

=> if  $MY_STRING = "j*", I get  7 results, whereas the first query 
returned 0 result.


~~~ 3) In order to retrieve containing hyphens, should I use 
KeywordTokenizer in config file? ~~~~~~~~~~~~~

With the same query, if $MY_STRING = "jean-ja*" :

a) with simple configuration (cf. below), I get 0 result

b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"

Is it the right way to get "Jean-Jacques"?


Thanks in advance

VV



=============== SIMPLE CONFIGURATION ===================

@prefix :        <#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .



[] rdf:type fuseki:Server ;
    .


## Initialize TDB --------------------------------

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query -------------------------------------
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------
## This URI must be fixed - it's used to assemble the text dataset.

:text_dataset rdf:type     text:TextDataset ;
#    text:dataset   <#dataset> ;
     text:dataset :tdb_dataset_readwrite ;
#    text:index     <#indexLucene> ;
     text:index :My_Lucene_index ;
     .

# A TDB datset used for RDF storage ------------------------------
:tdb_dataset_readwrite
         a             tdb:DatasetTDB ;
         tdb:location  "$_BnF_text" ;
.

# Text index description ------------------------------------------
#<#indexLucene> a text:TextIndexLucene ;
:My_Lucene_index a text:TextIndexLucene ;
     text:directory <file:$_Lucene> ;
     text:entityMap <#entMap> ;
     .

# Mapping in the index ---------------------------------------------
# URI stored in field "uri"
<#entMap> a text:EntityMap ;
     text:entityField      "uri" ;
     text:defaultField     "familyName" ;
     text:map (
          [ text:field "familyName" ; text:predicate foaf:familyName ]
          [ text:field "givenName" ; text:predicate foaf:givenName ]
          ) .

:service_tdb_all  a                   fuseki:Service ;
         rdfs:label                    "TDB BnF_text" ;
         fuseki:dataset               :text_dataset ;
         fuseki:name                   "BnF_text" ;
         fuseki:serviceQuery           "query" , "sparql" ;
         fuseki:serviceReadGraphStore  "get" ;
         fuseki:serviceReadWriteGraphStore "data" ;
         fuseki:serviceUpdate          "update" ;
         fuseki:serviceUpload          "upload" .


=========== KEYWORD TOKENIZER CONFIGURATION ================

@prefix :        <#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
@prefix text:    <http://jena.apache.org/text#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .



[] rdf:type fuseki:Server ;

    .


## Initialize TDB --------------------------------

[] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
tdb:GraphTDB    rdfs:subClassOf  ja:Model .

## Initialize text query -------------------------------------
[] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
# A TextDataset is a regular dataset with a text index.
text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
# Lucene index
text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .

## ---------------------------------------------------------------


:text_dataset rdf:type     text:TextDataset ;
#    text:dataset   <#dataset> ;
     text:dataset :tdb_dataset_readwrite ;
#    text:index     <#indexLucene> ;
     text:index :My_Lucene_index ;
     .

# A TDB datset used for RDF storage ------------------------------
:tdb_dataset_readwrite
         a             tdb:DatasetTDB ;
         tdb:location  "$_BnF_text" ;
.

# Text index description ------------------------------------------
#<#indexLucene> a text:TextIndexLucene ;
:My_Lucene_index a text:TextIndexLucene ;
     text:directory <file:$_Lucene> ;
     text:entityMap <#entMap> ;
     .

# Mapping in the index ---------------------------------------------
# URI stored in field "uri"
<#entMap> a text:EntityMap ;
     text:entityField      "uri" ;
     text:defaultField     "givenName" ;
     text:map (

          [ text:field "familyName" ; text:predicate foaf:familyName ;
          text:analyzer [ a text:ConfigurableAnalyzer ;
                text:tokenizer text:KeywordTokenizer ;
                text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
              ] ]
          [ text:field "givenName" ; text:predicate foaf:givenName ;
         text:analyzer [ a text:ConfigurableAnalyzer ;
          text:tokenizer text:KeywordTokenizer ;
          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
          ] ]
          ) .

:service_tdb_all  a                   fuseki:Service ;
         rdfs:label                    "TDB BnF_text" ;
         fuseki:dataset               :text_dataset ; ### marche pr 
index texte
         fuseki:name                   "BnF_text" ;
         fuseki:serviceQuery           "query" , "sparql" ;
         fuseki:serviceReadGraphStore  "get" ;
         fuseki:serviceReadWriteGraphStore "data" ;
         fuseki:serviceUpdate          "update" ;
         fuseki:serviceUpload          "upload" .












-- 
  


Re: fuseki text:query : strange results + Lucene configuration

Posted by Vincent Ventresque <vi...@ens-lyon.fr>.
 > Just to be sure, you can try to execute some very generic queries 
(e.g. "*a*") and count the results.

Thanks, I'll do that when I have a moment

 > The downside of using a high limit (and the reason the default is 
"only" 10000) is that jena-text/Lucene allocates an array of that size 
to hold the results before actually executing the query against the 
index. [...] All these resources (CPU time and memory) are wasted if the 
index then returns only a small number of results.

That's very important, thanks a lot. I think we're going to give 2 
options, the default being a combination like

?auth text:query ( foaf:familyName "roussea*") ;

?edition dcterms:contributor ?auth ;

   text:query ( dcterms:title '*rêverie*' ) .

and a more "fuzzy" one to broaden the search, with text:query on 
givenName. Using named graphs, the default options gives good results 
(between 0.2 and 1.5s depending on the retrieved editions).

Vincent

  

Le 12/09/2018 à 15:23, Osma Suominen a écrit :
> Hi Vincent!
>
> Vincent Ventresque kirjoitti 12.09.2018 klo 15:53:
>> What do you think about this solution :
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" 2000000 ) . ?uriBnF 
>> text:query ( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName 
>> ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>> It returns all the expected results and takes only 1.7 second (with 
>> default configuration, RAM 2Gb).
>
> Sounds good to me!
>
>> Knowing I have 1.71 M givenName, it's reasonable to expect all the 
>> results with a limit = 2 000 000 , ins't it? It is, I think, the most 
>> important question : am I sure to get all the results if I use a 
>> limit > total properties indexed?
>
> Yes, I think this is the case. If you have 1.71M triples with the 
> givenName property, the text index should never return more than 1.71M 
> results on a givenName property. So a limit of 2M should be enough in 
> your case.
>
> Just to be sure, you can try to execute some very generic queries 
> (e.g. "*a*") and count the results.
>
> The downside of using a high limit (and the reason the default is 
> "only" 10000) is that jena-text/Lucene allocates an array of that size 
> to hold the results before actually executing the query against the 
> index. With a large limit value such as 2M, that takes some time - 
> probably most of the 1.7 seconds. You can experiment with how the 
> query execution time changes if you change only the limit value. Also 
> the array will need some memory, maybe in the range of tens or perhaps 
> even hundreds of MB for a limit of 2M. All these resources (CPU time 
> and memory) are wasted if the index then returns only a small number 
> of results. The memory will of course be freed soon after the query by 
> the garbage collector.
>
>> N.B. : I like the idea of using only text:query because it's case 
>> insensitive AND allows fuzzy queries. It's particularly important for 
>> our use case (we want to find author + edition with incomplete 
>> information, such as "1 word in title + 1 word in familyName + 
>> givenName initial + one of these words is not fully legible"). But 
>> you're right, a combination of text:query + regex or contains is very 
>> fast (see example below).
> Great that you tried this approach as well and it is fast.
>
> -Osma
>

-----------------------------------

Hi Osma,


Thanks again, it's very helpful.

 > Either you get less results than expected or the query will take a 
long time, or both

What do you think about this solution :

?uriBnF text:query ( foaf:givenName "*J*" 2000000 ) . ?uriBnF text:query 
( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName ?nom .  ?uriBnF 
foaf:givenName ?prenom

It returns all the expected results and takes only 1.7 second (with 
default configuration, RAM 2Gb).

Knowing I have 1.71 M givenName, it's reasonable to expect all the 
results with a limit = 2 000 000 , ins't it? It is, I think, the most 
important question : am I sure to get all the results if I use a limit > 
total properties indexed?

N.B. : I like the idea of using only text:query because it's case 
insensitive AND allows fuzzy queries. It's particularly important for 
our use case (we want to find author + edition with incomplete 
information, such as "1 word in title + 1 word in familyName + givenName 
initial + one of these words is not fully legible"). But you're right, a 
combination of text:query + regex or contains is very fast (see example 
below).

Vincent


---------------------------------

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
   foaf:givenName ?prenom
   filter(contains(?prenom, "J"))           # case sensitive
   ?uriBnF foaf:familyName ?nom  .

=> 37ms for 130 entries

-----

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
   foaf:givenName ?prenom
   filter(regex(?prenom, "j", "i"))          # case insensitive
   ?uriBnF foaf:familyName ?nom  .

=> 55ms for 133 entries






Le 12/09/2018 à 14:12, Osma Suominen a écrit :
> Hi Vincent!
>
> Jena-text with the Lucene backend indexes each triple as a separate 
> Lucene document. This means that you cannot combine givenName and 
> familyName in the same query - from the Lucene perspective, the 
> givenName appears in one document where familyName appears in another 
> document, and querying for both (using AND) will just give you an 
> empty result. So what you are doing is the correct way. The problem is 
> just that some of the query patterns, such as "*J*", will return a 
> very large number of results. This pushes the limits of jena-text as 
> you've discovered. Either you get less results than expected or the 
> query will take a long time, or both.
>
> It might make sense to use only one text:query for the more 
> restrictive part (familyName "roussea*" in this case) and then use a 
> FILTER with some string matching (STRSTARTS or CONTAINS or REGEX) to 
> further limit the results.
>
> The Elasticsearch backend of jena-text is different though. It will 
> combine different indexed properties of the same subject within the 
> same Elasticsearch/Lucene document. So an AND query with both 
> givenName and familyName is possible when using that backend.
>
> -Osma
>
> Vincent Ventresque kirjoitti 12.09.2018 klo 15:06:
>> Hello Rob
>>
>>
>> Thank you for all these elements.
>>
>>  > there is a limit on the results returned from each text search so 
>> when these are *separately executed and joined together* you may only 
>> get a subset of the full results
>>
>> Could you please explain what would be a 'non-separate' query? Do you 
>> mean :
>>
>> ?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?
>>
>> I made 2 separate triples (1st = givenName + 2nd = familyName) 
>> because I had read that "when a query is to involve two or more 
>> properties then it expressed at the SPARQL level, as it were, versus 
>> in Lucene's query language" 
>> (https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields). 
>>
>>
>> Vincent
>>
>>
>>
>>
>> Le 12/09/2018 à 11:52, Rob Vesse a écrit :
>>> Well the order of triple patterns shouldn't matter too much when you 
>>> have a pure BGP (albeit the optimiser might pick a bad order in some 
>>> cases)
>>>
>>> But we aren't talking about pure BGPs here, having the text:query 
>>> triples results in the BGP being broken up into joins of several 
>>> property functions with the regular triple patterns interspersed 
>>> through those.  So if we take your query and run it through Jena's 
>>> algebra compiler (you can do this online at 
>>> http://sparql.org/validate/query) we get the following:
>>>
>>>    1 (base <http://example/base/>
>>>    2   (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
>>>    3            (owl: <http://www.w3.org/2002/07/owl#>)
>>>    4            (apf: <http://jena.hpl.hp.com/ARQ/property#>)
>>>    5            (xsd: <http://www.w3.org/2001/XMLSchema#>)
>>>    6            (fn: <http://www.w3.org/2005/xpath-functions#>)
>>>    7            (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
>>>    8            (text: <http://jena.apache.org/text#>)
>>>    9            (foaf: <http://xmlns.com/foaf/0.1/>)
>>>   10            (dc: <http://purl.org/dc/elements/1.1/>))
>>>   11     (sequence
>>>   12       (propfunc text:query
>>>   13         ?uriBnF (foaf:givenName "$MY_STRING")
>>>   14         (propfunc text:query
>>>   15           ?uriBnF (foaf:familyName "roussea*")
>>>   16           (table unit)))
>>>   17       (bgp
>>>   18         (triple ?uriBnF foaf:familyName ?nom)
>>>   19         (triple ?uriBnF foaf:givenName ?prenom)
>>>   20       ))))
>>>
>>> So first its doing the text search on your parameter (lines 12-13), 
>>> then joining that to text search on your surname (lines 14-15) via 
>>> substituting binds from your first text search and then finally 
>>> joining that with the plain BGP (lines 17-19).
>>>
>>> So in this case the ordering of your property functions in the query 
>>> is going to make a difference to the evaluation. As I think Osma 
>>> already pointed out there is a limit on the results returned from 
>>> each text search so when these are separately executed and joined 
>>> together you may only get a subset of the full results that your 
>>> text index holds.
>>>
>>> Rob
>>>
>>> On 12/09/2018, 09:55, "Vincent Ventresque" 
>>> <vi...@ens-lyon.fr> wrote:
>>>
>>>      Hi Lorenz,
>>>      Thanks for your reply.
>>>      > for me it sounds more like you've found a bug
>>>      I'm not able to tell, just beginning to use Fuseki + Lucene.
>>>      > I'm just referring to "Order of triple patterns in a BGP" here
>>>      Could you please give a raw text URL for "Order of triple 
>>> patterns in a
>>>      BGP" (seems that the 'here' in your mail had a formatted link 
>>> but I
>>>      didn't receive the url in my mailbox).
>>>      > The order of triple patterns in a BGP shouldn't matter
>>>      I thought that it was better (for performance/speed) to begin 
>>> with 1)
>>>      constants and 2) variables having few solutions in the dataset. 
>>> I've
>>>      read something about Sparql optimization and algebra, but can't 
>>> remember
>>>      where. But maybe you're talking about the logics itself (A+B = 
>>> B+A)?
>>>      N.B. I find these questions very interesting, but I'm no Sparql
>>>      specialist (neither a logician).
>>>      Cheers,
>>>      Vincent
>>>      Le 12/09/2018 à 10:32, Lorenz B. a écrit :
>>>      > Hi "VV",
>>>      >
>>>      > well, for me it sounds more like you've found a bug and are 
>>> now doing a
>>>      > workaround. Or at least something is strange and I'm just 
>>> referring to
>>>      > "Order of triple patterns in a BGP" here.
>>>      >
>>>      > The order of triple patterns in a BGP shouldn't matter - as 
>>> far as I
>>>      > know it's always a good old join on the intermediate result 
>>> of the
>>>      > evaluation of the triple patterns.
>>>      >
>>>      > Indeed, the limit of the text index lookup matters as the 
>>> internal
>>>      > ordering by Lucene is based on some Information Retrieval 
>>> measure (close
>>>      > to TF-IDF probably with default settings).
>>>      >
>>>      > But I guess, Osma and Andy will give you a better and more 
>>> correct answer.
>>>      >
>>>      >
>>>      > Cheers,
>>>      > Lorenz
>>>      >
>>>      >> Hello Osma,
>>>      >>
>>>      >>
>>>      >> Thank you very much for your reply, you solved the problem! 
>>> I've made
>>>      >> a few tests, both the order and the limit are important (see 
>>> below).
>>>      >>
>>>      >> Just one more question : I thought that the "Roussea*" being 
>>> less
>>>      >> numerous than the "*J*", it would be more efficient to begin 
>>> with the
>>>      >> "Roussea*". Can you explain why it's the contrary?
>>>      >>
>>>      >> Best,
>>>      >>
>>>      >> VV.
>>>      >>
>>>      >>
>>>      >> 1) --------- changing only the order --------------------------
>>>      >>
>>>      >> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>>>      >> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>>      >>
>>>      >>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 
>>> 000 or 2
>>>      >> 000 000)
>>>      >>
>>>      >> 2) --------- changing order + limit = 100 000 
>>> --------------------------
>>>      >>
>>>      >> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
>>>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>>      >>
>>>      >>  => 54 entries but not "Jean-Jacques" !
>>>      >>
>>>      >> 3) --------- changing order + limit = 1 000 000
>>>      >> --------------------------
>>>      >>
>>>      >>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
>>>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>>      >>
>>>      >> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>>>      >>
>>>      >> 4) --------- test using filters (strstarts + contains)
>>>      >> --------------------------
>>>      >>
>>>      >> ?uriBnF foaf:familyName ?nom
>>>      >> filter(strstarts(?nom, "Roussea"))
>>>      >> ?uriBnF foaf:givenName ?prenom
>>>      >> filter(contains(?prenom, "J"))
>>>      >>
>>>      >> => 129 entries, 27 seconds [less results than
>>>      >> "text:query ( foaf:givenName "*J*" 1000000)" because 
>>> contains = case
>>>      >> sensible ?]
>>>      >>
>>>      >> -----------------------------------------------------
>>>      >>
>>>      >> More infos about the dataset :
>>>      >>
>>>      >> # 3 fields are indexed ( foaf:name + foaf:givenName are in 
>>> the same
>>>      >> named graph )
>>>      >>
>>>      >> -- dcterms:title = ± 9.45 M.
>>>      >>
>>>      >> -- foaf:givenName = ± 1.71 M.
>>>      >>
>>>      >> -- foaf:familyName = ± 1.78 M.
>>>      >>
>>>      >> # config file :
>>>      >>
>>>      >> ----------------
>>>      >>
>>>      >> text:storeValues true ;
>>>      >>     text:queryParser text:AnalyzingQueryParser ;
>>>      >>     text:map (
>>>      >>         [ text:field "title" ; text:predicate dcterms:title ;
>>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>          text:tokenizer text:KeywordTokenizer ;
>>>      >>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>          ] ]
>>>      >>          [ text:field "familyName" ; text:predicate 
>>> foaf:familyName ;
>>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>          text:tokenizer text:KeywordTokenizer ;
>>>      >>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>          ] ]
>>>      >>          [ text:field "givenName" ; text:predicate 
>>> foaf:givenName ;
>>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>          text:tokenizer text:KeywordTokenizer ;
>>>      >>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>          ] ]
>>>      >>
>>>      >>          ) .
>>>      >>
>>>      >>
>>>      >>
>>>      >>
>>>      >>
>>>      >>
>>>      >> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>>>      >>> Hello Vincent,
>>>      >>>
>>>      >>> The results you get don't seem quite right. As you say, with a
>>>      >>> shorter query one would expect more results.
>>>      >>>
>>>      >>> One thing to do would be to check what results you get if 
>>> you run the
>>>      >>> queries individually. I think combining the two separate 
>>> jena-text
>>>      >>> queries (for foaf:familyName and foaf:givenName) may be 
>>> part of the
>>>      >>> problem here... So if you execute only the "roussea*" part 
>>> of the
>>>      >>> query, do you get the expected number of results? What 
>>> about if you
>>>      >>> only execute one of the givenName queries with no 
>>> restriction on
>>>      >>> familyName?
>>>      >>>
>>>      >>> Does it make a difference if you change the order of the 
>>> firstName
>>>      >>> and givenName clauses?
>>>      >>>
>>>      >>> One thing to consider is that Lucene queries always have a 
>>> limit on
>>>      >>> the number of results. With jena-text you can specify it as an
>>>      >>> additional parameter, but if you leave it out, it will 
>>> default to
>>>      >>> 10000. My guess is that the givenName queries may generate 
>>> more
>>>      >>> results than 10000, and the results will then be cut off. 
>>> This may
>>>      >>> mean that you get many Jeans and Jacques's and Johns etc. 
>>> but many
>>>      >>> the J. Rousseaus get cut off from the list. Try adding a 
>>> large limit
>>>      >>> parameter (say 100000 or more) to the text:query functions 
>>> to see if
>>>      >>> it helps. Like this:
>>>      >>>
>>>      >>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>>>      >>>
>>>      >>> jena-text is not very good at combining multiple criteria. 
>>> You can do
>>>      >>> it with separate queries as you've done, but internally the 
>>> queries
>>>      >>> will run separately and the results will only be combined 
>>> in Jena,
>>>      >>> outside Lucene.
>>>      >>>
>>>      >>> -Osma
>>>      >>>
>>>      >>>
>>>      >>>
>>>      >>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>>>      >>>> Hello,
>>>      >>>>
>>>      >>>>
>>>      >>>> I've made new tests with a slightly different dataset and
>>>      >>>> configuration, the problem is the same.
>>>      >>>>
>>>      >>>> --- Could you please tell me if these results are normal 
>>> (I expected
>>>      >>>> a bigger list with fewer letters)?
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>>>      >>>>
>>>      >>>> Here is the complete query :
>>>      >>>>
>>>      >>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName 
>>> "roussea*" ) .
>>>      >>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>      >>>>
>>>      >>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName 
>>> ?prenom }
>>>      >>>>
>>>      >>>> N.B. : the dataset is quite large : 1,78 M family names 
>>> indexed, and
>>>      >>>> 1,71 M given names. I have 4 distinct "Jean-Jacques 
>>> Rousseau" in the
>>>      >>>> data, 713 family names containing "roussea", including 224 
>>> compound
>>>      >>>> given names.
>>>      >>>>
>>>      >>>> --- Do you know where to find more documentation about Lucene
>>>      >>>> configuration (I read jena.apache.org page + , and also 
>>> found useful
>>>      >>>> explanations on Skosmos wiki 
>>> https://github.com/NatLibFi/Skosmos ),
>>>      >>>> especially about tokenizers  ?
>>>      >>>>
>>>      >>>>
>>>      >>>> Thanks in advance,
>>>      >>>>
>>>      >>>> VV
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>>>      >>>>> Hello,
>>>      >>>>>
>>>      >>>>> I've just subscribed to the users@jena.apache.org list, 
>>> and I
>>>      >>>>> apologize if this mail is not sent properly.
>>>      >>>>>
>>>      >>>>> I'm trying to use Fuseki text:query, and have encountered 
>>> several
>>>      >>>>> issues. Here are my questions
>>>      >>>>>
>>>      >>>>> 1) Does text:query require a minimum number of characters 
>>> to be
>>>      >>>>> efficient?
>>>      >>>>>
>>>      >>>>> 2) Is performance linked to the number of fields indexed?
>>>      >>>>>
>>>      >>>>> 3) In order to retrieve strings containing hyphens, 
>>> should I use
>>>      >>>>> KeywordTokenizer in config file?
>>>      >>>>>
>>>      >>>>> ~~~ 1) Does text:query require a minimum number of 
>>> characters to be
>>>      >>>>> efficient? ~~~~~~~~~~~~~
>>>      >>>>>
>>>      >>>>> I've noticed that a query on indexed predicates 
>>> (foaf:familyName
>>>      >>>>> and foaf:givenName) returns more results when there are more
>>>      >>>>> characters in the string :
>>>      >>>>>
>>>      >>>>> SELECT * WHERE {
>>>      >>>>>
>>>      >>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >>>>>
>>>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>      >>>>>
>>>      >>>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName 
>>> ?prenom .
>>>      >>>>>
>>>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>      >>>>>
>>>      >>>>> }
>>>      >>>>>
>>>      >>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be 
>>> in the
>>>      >>>>> results.
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "j*", I get 0 result
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "je*", I get 17 results, including
>>>      >>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "jea*", I get 27 results, including 
>>> "Jean-Jacques"
>>>      >>>>>
>>>      >>>>> I don't know anything about Lucene, but it looks very 
>>> strange to me
>>>      >>>>> : I expected the contrary (fewer letters = bigger results 
>>> list).
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ~~~ 2) Is performance linked to the number of fields 
>>> indexed?
>>>      >>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>>      >>>>>
>>>      >>>>> If I change the configuration and index only 
>>> foaf:givenName, and
>>>      >>>>> provide a constant for foaf:familyName, the query returns 
>>> more
>>>      >>>>> results :
>>>      >>>>>
>>>      >>>>> SELECT * WHERE {
>>>      >>>>>
>>>      >>>>> ?uriBnF foaf:familyName "Rousseau" .
>>>      >>>>>
>>>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>      >>>>>
>>>      >>>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName 
>>> ?prenom .
>>>      >>>>>
>>>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>      >>>>>
>>>      >>>>> }
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "j*", I get 7 results, whereas the 
>>> first query
>>>      >>>>> returned 0 result.
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
>>>      >>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>>      >>>>>
>>>      >>>>> With the same query, if $MY_STRING = "jean-ja*" :
>>>      >>>>>
>>>      >>>>> a) with simple configuration (cf. below), I get 0 result
>>>      >>>>>
>>>      >>>>> b) with KeywordTokenizer config (cf. below), I get 
>>> "Jean-Jacques"
>>>      >>>>>
>>>      >>>>> Is it the right way to get "Jean-Jacques"?
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> Thanks in advance
>>>      >>>>>
>>>      >>>>> VV
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> =============== SIMPLE CONFIGURATION ===================
>>>      >>>>>
>>>      >>>>> @prefix :        <#> .
>>>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>      >>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>>      >>>>> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
>>>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>      >>>>> @prefix text: <http://jena.apache.org/text#> .
>>>      >>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> [] rdf:type fuseki:Server ;
>>>      >>>>>    .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ## Initialize TDB --------------------------------
>>>      >>>>>
>>>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>      >>>>> tdb:DatasetTDB  rdfs:subClassOf ja:RDFDataset .
>>>      >>>>> tdb:GraphTDB    rdfs:subClassOf ja:Model .
>>>      >>>>>
>>>      >>>>> ## Initialize text query 
>>> -------------------------------------
>>>      >>>>> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>>>      >>>>> # A TextDataset is a regular dataset with a text index.
>>>      >>>>> text:TextDataset rdfs:subClassOf ja:RDFDataset .
>>>      >>>>> # Lucene index
>>>      >>>>> text:TextIndexLucene rdfs:subClassOf   text:TextIndex .
>>>      >>>>>
>>>      >>>>> ## 
>>> ---------------------------------------------------------------
>>>      >>>>> ## This URI must be fixed - it's used to assemble the 
>>> text dataset.
>>>      >>>>>
>>>      >>>>> :text_dataset rdf:type text:TextDataset ;
>>>      >>>>> #    text:dataset <#dataset> ;
>>>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>>>      >>>>> #    text:index <#indexLucene> ;
>>>      >>>>>     text:index :My_Lucene_index ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # A TDB datset used for RDF storage 
>>> ------------------------------
>>>      >>>>> :tdb_dataset_readwrite
>>>      >>>>>         a tdb:DatasetTDB ;
>>>      >>>>>         tdb:location  "$_BnF_text" ;
>>>      >>>>> .
>>>      >>>>>
>>>      >>>>> # Text index description 
>>> ------------------------------------------
>>>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>      >>>>>     text:directory <file:$_Lucene> ;
>>>      >>>>>     text:entityMap <#entMap> ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # Mapping in the index 
>>> ---------------------------------------------
>>>      >>>>> # URI stored in field "uri"
>>>      >>>>> <#entMap> a text:EntityMap ;
>>>      >>>>>     text:entityField      "uri" ;
>>>      >>>>>     text:defaultField "familyName" ;
>>>      >>>>>     text:map (
>>>      >>>>>          [ text:field "familyName" ; text:predicate 
>>> foaf:familyName ]
>>>      >>>>>          [ text:field "givenName" ; text:predicate 
>>> foaf:givenName ]
>>>      >>>>>          ) .
>>>      >>>>>
>>>      >>>>> :service_tdb_all a                   fuseki:Service ;
>>>      >>>>> rdfs:label                    "TDB BnF_text" ;
>>>      >>>>>         fuseki:dataset :text_dataset ;
>>>      >>>>> fuseki:name "BnF_text" ;
>>>      >>>>> fuseki:serviceQuery "query" , "sparql" ;
>>>      >>>>> fuseki:serviceReadGraphStore "get" ;
>>>      >>>>> fuseki:serviceReadWriteGraphStore "data" ;
>>>      >>>>> fuseki:serviceUpdate "update" ;
>>>      >>>>> fuseki:serviceUpload "upload" .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>>      >>>>>
>>>      >>>>> @prefix :        <#> .
>>>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>      >>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>>      >>>>> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
>>>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>      >>>>> @prefix text: <http://jena.apache.org/text#> .
>>>      >>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> [] rdf:type fuseki:Server ;
>>>      >>>>>
>>>      >>>>>    .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ## Initialize TDB --------------------------------
>>>      >>>>>
>>>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>      >>>>> tdb:DatasetTDB  rdfs:subClassOf ja:RDFDataset .
>>>      >>>>> tdb:GraphTDB    rdfs:subClassOf ja:Model .
>>>      >>>>>
>>>      >>>>> ## Initialize text query 
>>> -------------------------------------
>>>      >>>>> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>>>      >>>>> # A TextDataset is a regular dataset with a text index.
>>>      >>>>> text:TextDataset rdfs:subClassOf ja:RDFDataset .
>>>      >>>>> # Lucene index
>>>      >>>>> text:TextIndexLucene rdfs:subClassOf   text:TextIndex .
>>>      >>>>>
>>>      >>>>> ## 
>>> ---------------------------------------------------------------
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> :text_dataset rdf:type text:TextDataset ;
>>>      >>>>> #    text:dataset <#dataset> ;
>>>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>>>      >>>>> #    text:index <#indexLucene> ;
>>>      >>>>>     text:index :My_Lucene_index ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # A TDB datset used for RDF storage 
>>> ------------------------------
>>>      >>>>> :tdb_dataset_readwrite
>>>      >>>>>         a tdb:DatasetTDB ;
>>>      >>>>>         tdb:location  "$_BnF_text" ;
>>>      >>>>> .
>>>      >>>>>
>>>      >>>>> # Text index description 
>>> ------------------------------------------
>>>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>      >>>>>     text:directory <file:$_Lucene> ;
>>>      >>>>>     text:entityMap <#entMap> ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # Mapping in the index 
>>> ---------------------------------------------
>>>      >>>>> # URI stored in field "uri"
>>>      >>>>> <#entMap> a text:EntityMap ;
>>>      >>>>>     text:entityField      "uri" ;
>>>      >>>>>     text:defaultField "givenName" ;
>>>      >>>>>     text:map (
>>>      >>>>>
>>>      >>>>>          [ text:field "familyName" ; text:predicate 
>>> foaf:familyName ;
>>>      >>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>>>>                text:tokenizer text:KeywordTokenizer ;
>>>      >>>>>                text:filters (text:ASCIIFoldingFilter
>>>      >>>>> text:LowerCaseFilter)
>>>      >>>>>              ] ]
>>>      >>>>>          [ text:field "givenName" ; text:predicate 
>>> foaf:givenName ;
>>>      >>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>>>>          text:tokenizer text:KeywordTokenizer ;
>>>      >>>>>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>>>>          ] ]
>>>      >>>>>          ) .
>>>      >>>>>
>>>      >>>>> :service_tdb_all a                   fuseki:Service ;
>>>      >>>>> rdfs:label                    "TDB BnF_text" ;
>>>      >>>>>         fuseki:dataset :text_dataset ; ### marche pr
>>>      >>>>> index texte
>>>      >>>>> fuseki:name "BnF_text" ;
>>>      >>>>> fuseki:serviceQuery "query" , "sparql" ;
>>>      >>>>> fuseki:serviceReadGraphStore "get" ;
>>>      >>>>> fuseki:serviceReadWriteGraphStore "data" ;
>>>      >>>>> fuseki:serviceUpdate "update" ;
>>>      >>>>> fuseki:serviceUpload "upload" .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>
>>>      >>
>>>
>>>
>>>
>>>
>>
>>
>





Re: fuseki text:query : strange results + Lucene configuration

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Vincent!

Vincent Ventresque kirjoitti 12.09.2018 klo 15:53:
> What do you think about this solution :
> 
> ?uriBnF text:query ( foaf:givenName "*J*" 2000000 ) . ?uriBnF text:query 
> ( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName ?nom .  ?uriBnF 
> foaf:givenName ?prenom
> 
> It returns all the expected results and takes only 1.7 second (with 
> default configuration, RAM 2Gb).

Sounds good to me!

> Knowing I have 1.71 M givenName, it's reasonable to expect all the 
> results with a limit = 2 000 000 , ins't it? It is, I think, the most 
> important question : am I sure to get all the results if I use a limit > 
> total properties indexed?

Yes, I think this is the case. If you have 1.71M triples with the 
givenName property, the text index should never return more than 1.71M 
results on a givenName property. So a limit of 2M should be enough in 
your case.

Just to be sure, you can try to execute some very generic queries (e.g. 
"*a*") and count the results.

The downside of using a high limit (and the reason the default is "only" 
10000) is that jena-text/Lucene allocates an array of that size to hold 
the results before actually executing the query against the index. With 
a large limit value such as 2M, that takes some time - probably most of 
the 1.7 seconds. You can experiment with how the query execution time 
changes if you change only the limit value. Also the array will need 
some memory, maybe in the range of tens or perhaps even hundreds of MB 
for a limit of 2M. All these resources (CPU time and memory) are wasted 
if the index then returns only a small number of results. The memory 
will of course be freed soon after the query by the garbage collector.

> N.B. : I like the idea of using only text:query because it's case 
> insensitive AND allows fuzzy queries. It's particularly important for 
> our use case (we want to find author + edition with incomplete 
> information, such as "1 word in title + 1 word in familyName + givenName 
> initial + one of these words is not fully legible"). But you're right, a 
> combination of text:query + regex or contains is very fast (see example 
> below).
Great that you tried this approach as well and it is fast.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: fuseki text:query : strange results + Lucene configuration

Posted by Vincent Ventresque <vi...@ens-lyon.fr>.
Hi Osma,


Thanks again, it's very helpful.

 > Either you get less results than expected or the query will take a 
long time, or both

What do you think about this solution :

?uriBnF text:query ( foaf:givenName "*J*" 2000000 ) . ?uriBnF text:query 
( foaf:familyName "roussea*" ) . ?uriBnF foaf:familyName ?nom .  ?uriBnF 
foaf:givenName ?prenom

It returns all the expected results and takes only 1.7 second (with 
default configuration, RAM 2Gb).

Knowing I have 1.71 M givenName, it's reasonable to expect all the 
results with a limit = 2 000 000 , ins't it? It is, I think, the most 
important question : am I sure to get all the results if I use a limit > 
total properties indexed?

N.B. : I like the idea of using only text:query because it's case 
insensitive AND allows fuzzy queries. It's particularly important for 
our use case (we want to find author + edition with incomplete 
information, such as "1 word in title + 1 word in familyName + givenName 
initial + one of these words is not fully legible"). But you're right, a 
combination of text:query + regex or contains is very fast (see example 
below).

Vincent


---------------------------------

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
   foaf:givenName ?prenom
   filter(contains(?prenom, "J"))           # case sensitive
   ?uriBnF foaf:familyName ?nom  .

=> 37ms for 130 entries

-----

?uriBnF text:query ( foaf:familyName "roussea*" ) ;
   foaf:givenName ?prenom
   filter(regex(?prenom, "j", "i"))          # case insensitive
   ?uriBnF foaf:familyName ?nom  .

=> 55ms for 133 entries




  

Le 12/09/2018 à 14:12, Osma Suominen a écrit :
> Hi Vincent!
>
> Jena-text with the Lucene backend indexes each triple as a separate 
> Lucene document. This means that you cannot combine givenName and 
> familyName in the same query - from the Lucene perspective, the 
> givenName appears in one document where familyName appears in another 
> document, and querying for both (using AND) will just give you an 
> empty result. So what you are doing is the correct way. The problem is 
> just that some of the query patterns, such as "*J*", will return a 
> very large number of results. This pushes the limits of jena-text as 
> you've discovered. Either you get less results than expected or the 
> query will take a long time, or both.
>
> It might make sense to use only one text:query for the more 
> restrictive part (familyName "roussea*" in this case) and then use a 
> FILTER with some string matching (STRSTARTS or CONTAINS or REGEX) to 
> further limit the results.
>
> The Elasticsearch backend of jena-text is different though. It will 
> combine different indexed properties of the same subject within the 
> same Elasticsearch/Lucene document. So an AND query with both 
> givenName and familyName is possible when using that backend.
>
> -Osma
>
> Vincent Ventresque kirjoitti 12.09.2018 klo 15:06:
>> Hello Rob
>>
>>
>> Thank you for all these elements.
>>
>>  > there is a limit on the results returned from each text search so 
>> when these are *separately executed and joined together* you may only 
>> get a subset of the full results
>>
>> Could you please explain what would be a 'non-separate' query? Do you 
>> mean :
>>
>> ?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?
>>
>> I made 2 separate triples (1st = givenName + 2nd = familyName) 
>> because I had read that "when a query is to involve two or more 
>> properties then it expressed at the SPARQL level, as it were, versus 
>> in Lucene's query language" 
>> (https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields). 
>>
>>
>> Vincent
>>
>>
>>
>>
>> Le 12/09/2018 à 11:52, Rob Vesse a écrit :
>>> Well the order of triple patterns shouldn't matter too much when you 
>>> have a pure BGP (albeit the optimiser might pick a bad order in some 
>>> cases)
>>>
>>> But we aren't talking about pure BGPs here, having the text:query 
>>> triples results in the BGP being broken up into joins of several 
>>> property functions with the regular triple patterns interspersed 
>>> through those.  So if we take your query and run it through Jena's 
>>> algebra compiler (you can do this online at 
>>> http://sparql.org/validate/query) we get the following:
>>>
>>>    1 (base <http://example/base/>
>>>    2   (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
>>>    3            (owl: <http://www.w3.org/2002/07/owl#>)
>>>    4            (apf: <http://jena.hpl.hp.com/ARQ/property#>)
>>>    5            (xsd: <http://www.w3.org/2001/XMLSchema#>)
>>>    6            (fn: <http://www.w3.org/2005/xpath-functions#>)
>>>    7            (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
>>>    8            (text: <http://jena.apache.org/text#>)
>>>    9            (foaf: <http://xmlns.com/foaf/0.1/>)
>>>   10            (dc: <http://purl.org/dc/elements/1.1/>))
>>>   11     (sequence
>>>   12       (propfunc text:query
>>>   13         ?uriBnF (foaf:givenName "$MY_STRING")
>>>   14         (propfunc text:query
>>>   15           ?uriBnF (foaf:familyName "roussea*")
>>>   16           (table unit)))
>>>   17       (bgp
>>>   18         (triple ?uriBnF foaf:familyName ?nom)
>>>   19         (triple ?uriBnF foaf:givenName ?prenom)
>>>   20       ))))
>>>
>>> So first its doing the text search on your parameter (lines 12-13), 
>>> then joining that to text search on your surname (lines 14-15) via 
>>> substituting binds from your first text search and then finally 
>>> joining that with the plain BGP (lines 17-19).
>>>
>>> So in this case the ordering of your property functions in the query 
>>> is going to make a difference to the evaluation.  As I think Osma 
>>> already pointed out there is a limit on the results returned from 
>>> each text search so when these are separately executed and joined 
>>> together you may only get a subset of the full results that your 
>>> text index holds.
>>>
>>> Rob
>>>
>>> On 12/09/2018, 09:55, "Vincent Ventresque" 
>>> <vi...@ens-lyon.fr> wrote:
>>>
>>>      Hi Lorenz,
>>>      Thanks for your reply.
>>>      > for me it sounds more like you've found a bug
>>>      I'm not able to tell, just beginning to use Fuseki + Lucene.
>>>      > I'm just referring to "Order of triple patterns in a BGP" here
>>>      Could you please give a raw text URL for "Order of triple 
>>> patterns in a
>>>      BGP" (seems that the 'here' in your mail had a formatted link 
>>> but I
>>>      didn't receive the url in my mailbox).
>>>      > The order of triple patterns in a BGP shouldn't matter
>>>      I thought that it was better (for performance/speed) to begin 
>>> with 1)
>>>      constants and 2) variables having few solutions in the dataset. 
>>> I've
>>>      read something about Sparql optimization and algebra, but can't 
>>> remember
>>>      where. But maybe you're talking about the logics itself (A+B = 
>>> B+A)?
>>>      N.B. I find these questions very interesting, but I'm no Sparql
>>>      specialist (neither a logician).
>>>      Cheers,
>>>      Vincent
>>>      Le 12/09/2018 à 10:32, Lorenz B. a écrit :
>>>      > Hi "VV",
>>>      >
>>>      > well, for me it sounds more like you've found a bug and are 
>>> now doing a
>>>      > workaround. Or at least something is strange and I'm just 
>>> referring to
>>>      > "Order of triple patterns in a BGP" here.
>>>      >
>>>      > The order of triple patterns in a BGP shouldn't matter - as 
>>> far as I
>>>      > know it's always a good old join on the intermediate result 
>>> of the
>>>      > evaluation of the triple patterns.
>>>      >
>>>      > Indeed, the limit of the text index lookup matters as the 
>>> internal
>>>      > ordering by Lucene is based on some Information Retrieval 
>>> measure (close
>>>      > to TF-IDF probably with default settings).
>>>      >
>>>      > But I guess, Osma and Andy will give you a better and more 
>>> correct answer.
>>>      >
>>>      >
>>>      > Cheers,
>>>      > Lorenz
>>>      >
>>>      >> Hello Osma,
>>>      >>
>>>      >>
>>>      >> Thank you very much for your reply, you solved the problem! 
>>> I've made
>>>      >> a few tests, both the order and the limit are important (see 
>>> below).
>>>      >>
>>>      >> Just one more question : I thought that the "Roussea*" being 
>>> less
>>>      >> numerous than the "*J*", it would be more efficient to begin 
>>> with the
>>>      >> "Roussea*". Can you explain why it's the contrary?
>>>      >>
>>>      >> Best,
>>>      >>
>>>      >> VV.
>>>      >>
>>>      >>
>>>      >> 1) --------- changing only the order --------------------------
>>>      >>
>>>      >> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>>>      >> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>>      >>
>>>      >>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 
>>> 000 or 2
>>>      >> 000 000)
>>>      >>
>>>      >> 2) --------- changing order + limit = 100 000 
>>> --------------------------
>>>      >>
>>>      >> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
>>>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>>      >>
>>>      >>  => 54 entries but not "Jean-Jacques" !
>>>      >>
>>>      >> 3) --------- changing order + limit = 1 000 000
>>>      >> --------------------------
>>>      >>
>>>      >>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
>>>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>>      >>
>>>      >> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>>>      >>
>>>      >> 4) --------- test using filters (strstarts + contains)
>>>      >> --------------------------
>>>      >>
>>>      >> ?uriBnF foaf:familyName ?nom
>>>      >> filter(strstarts(?nom, "Roussea"))
>>>      >> ?uriBnF foaf:givenName ?prenom
>>>      >> filter(contains(?prenom, "J"))
>>>      >>
>>>      >> => 129 entries, 27 seconds [less results than
>>>      >> "text:query ( foaf:givenName "*J*" 1000000)" because 
>>> contains = case
>>>      >> sensible ?]
>>>      >>
>>>      >> -----------------------------------------------------
>>>      >>
>>>      >> More infos about the dataset :
>>>      >>
>>>      >> # 3 fields are indexed ( foaf:name + foaf:givenName are in 
>>> the same
>>>      >> named graph )
>>>      >>
>>>      >> -- dcterms:title = +/- 9.45 M.
>>>      >>
>>>      >> -- foaf:givenName = +/- 1.71 M.
>>>      >>
>>>      >> -- foaf:familyName = +/- 1.78 M.
>>>      >>
>>>      >> # config file :
>>>      >>
>>>      >> ----------------
>>>      >>
>>>      >> text:storeValues true ;
>>>      >>     text:queryParser text:AnalyzingQueryParser ;
>>>      >>     text:map (
>>>      >>         [ text:field "title" ; text:predicate dcterms:title ;
>>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>          text:tokenizer text:KeywordTokenizer ;
>>>      >>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>          ] ]
>>>      >>          [ text:field "familyName" ; text:predicate 
>>> foaf:familyName ;
>>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>          text:tokenizer text:KeywordTokenizer ;
>>>      >>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>          ] ]
>>>      >>          [ text:field "givenName" ; text:predicate 
>>> foaf:givenName ;
>>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>          text:tokenizer text:KeywordTokenizer ;
>>>      >>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>          ] ]
>>>      >>
>>>      >>          ) .
>>>      >>
>>>      >>
>>>      >>
>>>      >>
>>>      >>
>>>      >>
>>>      >> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>>>      >>> Hello Vincent,
>>>      >>>
>>>      >>> The results you get don't seem quite right. As you say, with a
>>>      >>> shorter query one would expect more results.
>>>      >>>
>>>      >>> One thing to do would be to check what results you get if 
>>> you run the
>>>      >>> queries individually. I think combining the two separate 
>>> jena-text
>>>      >>> queries (for foaf:familyName and foaf:givenName) may be 
>>> part of the
>>>      >>> problem here... So if you execute only the "roussea*" part 
>>> of the
>>>      >>> query, do you get the expected number of results? What 
>>> about if you
>>>      >>> only execute one of the givenName queries with no 
>>> restriction on
>>>      >>> familyName?
>>>      >>>
>>>      >>> Does it make a difference if you change the order of the 
>>> firstName
>>>      >>> and givenName clauses?
>>>      >>>
>>>      >>> One thing to consider is that Lucene queries always have a 
>>> limit on
>>>      >>> the number of results. With jena-text you can specify it as an
>>>      >>> additional parameter, but if you leave it out, it will 
>>> default to
>>>      >>> 10000. My guess is that the givenName queries may generate 
>>> more
>>>      >>> results than 10000, and the results will then be cut off. 
>>> This may
>>>      >>> mean that you get many Jeans and Jacques's and Johns etc. 
>>> but many
>>>      >>> the J. Rousseaus get cut off from the list. Try adding a 
>>> large limit
>>>      >>> parameter (say 100000 or more) to the text:query functions 
>>> to see if
>>>      >>> it helps. Like this:
>>>      >>>
>>>      >>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>>>      >>>
>>>      >>> jena-text is not very good at combining multiple criteria. 
>>> You can do
>>>      >>> it with separate queries as you've done, but internally the 
>>> queries
>>>      >>> will run separately and the results will only be combined 
>>> in Jena,
>>>      >>> outside Lucene.
>>>      >>>
>>>      >>> -Osma
>>>      >>>
>>>      >>>
>>>      >>>
>>>      >>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>>>      >>>> Hello,
>>>      >>>>
>>>      >>>>
>>>      >>>> I've made new tests with a slightly different dataset and
>>>      >>>> configuration, the problem is the same.
>>>      >>>>
>>>      >>>> --- Could you please tell me if these results are normal 
>>> (I expected
>>>      >>>> a bigger list with fewer letters)?
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>>>      >>>>
>>>      >>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>>>      >>>>
>>>      >>>> Here is the complete query :
>>>      >>>>
>>>      >>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName 
>>> "roussea*" ) .
>>>      >>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>      >>>>
>>>      >>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName 
>>> ?prenom }
>>>      >>>>
>>>      >>>> N.B. : the dataset is quite large : 1,78 M family names 
>>> indexed, and
>>>      >>>> 1,71 M given names. I have 4 distinct "Jean-Jacques 
>>> Rousseau" in the
>>>      >>>> data, 713 family names containing "roussea", including 224 
>>> compound
>>>      >>>> given names.
>>>      >>>>
>>>      >>>> --- Do you know where to find more documentation about Lucene
>>>      >>>> configuration (I read jena.apache.org page + , and also 
>>> found useful
>>>      >>>> explanations on Skosmos wiki 
>>> https://github.com/NatLibFi/Skosmos ),
>>>      >>>> especially about tokenizers  ?
>>>      >>>>
>>>      >>>>
>>>      >>>> Thanks in advance,
>>>      >>>>
>>>      >>>> VV
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>>
>>>      >>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>>>      >>>>> Hello,
>>>      >>>>>
>>>      >>>>> I've just subscribed to the users@jena.apache.org list, 
>>> and I
>>>      >>>>> apologize if this mail is not sent properly.
>>>      >>>>>
>>>      >>>>> I'm trying to use Fuseki text:query, and have encountered 
>>> several
>>>      >>>>> issues. Here are my questions
>>>      >>>>>
>>>      >>>>> 1) Does text:query require a minimum number of characters 
>>> to be
>>>      >>>>> efficient?
>>>      >>>>>
>>>      >>>>> 2) Is performance linked to the number of fields indexed?
>>>      >>>>>
>>>      >>>>> 3) In order to retrieve strings containing hyphens, 
>>> should I use
>>>      >>>>> KeywordTokenizer in config file?
>>>      >>>>>
>>>      >>>>> ~~~ 1) Does text:query require a minimum number of 
>>> characters to be
>>>      >>>>> efficient? ~~~~~~~~~~~~~
>>>      >>>>>
>>>      >>>>> I've noticed that a query on indexed predicates 
>>> (foaf:familyName
>>>      >>>>> and foaf:givenName) returns more results when there are more
>>>      >>>>> characters in the string :
>>>      >>>>>
>>>      >>>>> SELECT * WHERE {
>>>      >>>>>
>>>      >>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>      >>>>>
>>>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>      >>>>>
>>>      >>>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName 
>>> ?prenom .
>>>      >>>>>
>>>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>      >>>>>
>>>      >>>>> }
>>>      >>>>>
>>>      >>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be 
>>> in the
>>>      >>>>> results.
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "j*", I get 0 result
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "je*", I get 17 results, including
>>>      >>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "jea*", I get 27 results, including 
>>> "Jean-Jacques"
>>>      >>>>>
>>>      >>>>> I don't know anything about Lucene, but it looks very 
>>> strange to me
>>>      >>>>> : I expected the contrary (fewer letters = bigger results 
>>> list).
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ~~~ 2) Is performance linked to the number of fields 
>>> indexed?
>>>      >>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>>      >>>>>
>>>      >>>>> If I change the configuration and index only 
>>> foaf:givenName, and
>>>      >>>>> provide a constant for foaf:familyName, the query returns 
>>> more
>>>      >>>>> results :
>>>      >>>>>
>>>      >>>>> SELECT * WHERE {
>>>      >>>>>
>>>      >>>>> ?uriBnF foaf:familyName "Rousseau" .
>>>      >>>>>
>>>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>      >>>>>
>>>      >>>>> ?uriBnF foaf:familyName ?nom . ?uriBnF foaf:givenName 
>>> ?prenom .
>>>      >>>>>
>>>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>      >>>>>
>>>      >>>>> }
>>>      >>>>>
>>>      >>>>> => if  $MY_STRING = "j*", I get 7 results, whereas the 
>>> first query
>>>      >>>>> returned 0 result.
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
>>>      >>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>>      >>>>>
>>>      >>>>> With the same query, if $MY_STRING = "jean-ja*" :
>>>      >>>>>
>>>      >>>>> a) with simple configuration (cf. below), I get 0 result
>>>      >>>>>
>>>      >>>>> b) with KeywordTokenizer config (cf. below), I get 
>>> "Jean-Jacques"
>>>      >>>>>
>>>      >>>>> Is it the right way to get "Jean-Jacques"?
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> Thanks in advance
>>>      >>>>>
>>>      >>>>> VV
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> =============== SIMPLE CONFIGURATION ===================
>>>      >>>>>
>>>      >>>>> @prefix :        <#> .
>>>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>      >>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>>      >>>>> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
>>>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>      >>>>> @prefix text: <http://jena.apache.org/text#> .
>>>      >>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> [] rdf:type fuseki:Server ;
>>>      >>>>>    .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ## Initialize TDB --------------------------------
>>>      >>>>>
>>>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>      >>>>> tdb:DatasetTDB  rdfs:subClassOf ja:RDFDataset .
>>>      >>>>> tdb:GraphTDB    rdfs:subClassOf ja:Model .
>>>      >>>>>
>>>      >>>>> ## Initialize text query 
>>> -------------------------------------
>>>      >>>>> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>>>      >>>>> # A TextDataset is a regular dataset with a text index.
>>>      >>>>> text:TextDataset rdfs:subClassOf   ja:RDFDataset .
>>>      >>>>> # Lucene index
>>>      >>>>> text:TextIndexLucene rdfs:subClassOf   text:TextIndex .
>>>      >>>>>
>>>      >>>>> ## 
>>> ---------------------------------------------------------------
>>>      >>>>> ## This URI must be fixed - it's used to assemble the 
>>> text dataset.
>>>      >>>>>
>>>      >>>>> :text_dataset rdf:type text:TextDataset ;
>>>      >>>>> #    text:dataset   <#dataset> ;
>>>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>>>      >>>>> #    text:index <#indexLucene> ;
>>>      >>>>>     text:index :My_Lucene_index ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # A TDB datset used for RDF storage 
>>> ------------------------------
>>>      >>>>> :tdb_dataset_readwrite
>>>      >>>>>         a             tdb:DatasetTDB ;
>>>      >>>>>         tdb:location  "$_BnF_text" ;
>>>      >>>>> .
>>>      >>>>>
>>>      >>>>> # Text index description 
>>> ------------------------------------------
>>>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>      >>>>>     text:directory <file:$_Lucene> ;
>>>      >>>>>     text:entityMap <#entMap> ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # Mapping in the index 
>>> ---------------------------------------------
>>>      >>>>> # URI stored in field "uri"
>>>      >>>>> <#entMap> a text:EntityMap ;
>>>      >>>>>     text:entityField      "uri" ;
>>>      >>>>>     text:defaultField "familyName" ;
>>>      >>>>>     text:map (
>>>      >>>>>          [ text:field "familyName" ; text:predicate 
>>> foaf:familyName ]
>>>      >>>>>          [ text:field "givenName" ; text:predicate 
>>> foaf:givenName ]
>>>      >>>>>          ) .
>>>      >>>>>
>>>      >>>>> :service_tdb_all a                   fuseki:Service ;
>>>      >>>>> rdfs:label                    "TDB BnF_text" ;
>>>      >>>>>         fuseki:dataset :text_dataset ;
>>>      >>>>> fuseki:name                   "BnF_text" ;
>>>      >>>>> fuseki:serviceQuery           "query" , "sparql" ;
>>>      >>>>> fuseki:serviceReadGraphStore  "get" ;
>>>      >>>>> fuseki:serviceReadWriteGraphStore "data" ;
>>>      >>>>> fuseki:serviceUpdate          "update" ;
>>>      >>>>> fuseki:serviceUpload          "upload" .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>>      >>>>>
>>>      >>>>> @prefix :        <#> .
>>>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>      >>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>>      >>>>> @prefix tdb: <http://jena.hpl.hp.com/2008/tdb#> .
>>>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>      >>>>> @prefix text: <http://jena.apache.org/text#> .
>>>      >>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> [] rdf:type fuseki:Server ;
>>>      >>>>>
>>>      >>>>>    .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> ## Initialize TDB --------------------------------
>>>      >>>>>
>>>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>      >>>>> tdb:DatasetTDB  rdfs:subClassOf ja:RDFDataset .
>>>      >>>>> tdb:GraphTDB    rdfs:subClassOf ja:Model .
>>>      >>>>>
>>>      >>>>> ## Initialize text query 
>>> -------------------------------------
>>>      >>>>> [] ja:loadClass "org.apache.jena.query.text.TextQuery" .
>>>      >>>>> # A TextDataset is a regular dataset with a text index.
>>>      >>>>> text:TextDataset rdfs:subClassOf   ja:RDFDataset .
>>>      >>>>> # Lucene index
>>>      >>>>> text:TextIndexLucene rdfs:subClassOf   text:TextIndex .
>>>      >>>>>
>>>      >>>>> ## 
>>> ---------------------------------------------------------------
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> :text_dataset rdf:type text:TextDataset ;
>>>      >>>>> #    text:dataset   <#dataset> ;
>>>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>>>      >>>>> #    text:index <#indexLucene> ;
>>>      >>>>>     text:index :My_Lucene_index ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # A TDB datset used for RDF storage 
>>> ------------------------------
>>>      >>>>> :tdb_dataset_readwrite
>>>      >>>>>         a             tdb:DatasetTDB ;
>>>      >>>>>         tdb:location  "$_BnF_text" ;
>>>      >>>>> .
>>>      >>>>>
>>>      >>>>> # Text index description 
>>> ------------------------------------------
>>>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>      >>>>>     text:directory <file:$_Lucene> ;
>>>      >>>>>     text:entityMap <#entMap> ;
>>>      >>>>>     .
>>>      >>>>>
>>>      >>>>> # Mapping in the index 
>>> ---------------------------------------------
>>>      >>>>> # URI stored in field "uri"
>>>      >>>>> <#entMap> a text:EntityMap ;
>>>      >>>>>     text:entityField      "uri" ;
>>>      >>>>>     text:defaultField "givenName" ;
>>>      >>>>>     text:map (
>>>      >>>>>
>>>      >>>>>          [ text:field "familyName" ; text:predicate 
>>> foaf:familyName ;
>>>      >>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>>>>                text:tokenizer text:KeywordTokenizer ;
>>>      >>>>>                text:filters (text:ASCIIFoldingFilter
>>>      >>>>> text:LowerCaseFilter)
>>>      >>>>>              ] ]
>>>      >>>>>          [ text:field "givenName" ; text:predicate 
>>> foaf:givenName ;
>>>      >>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>      >>>>>          text:tokenizer text:KeywordTokenizer ;
>>>      >>>>>          text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>      >>>>>          ] ]
>>>      >>>>>          ) .
>>>      >>>>>
>>>      >>>>> :service_tdb_all a                   fuseki:Service ;
>>>      >>>>> rdfs:label                    "TDB BnF_text" ;
>>>      >>>>>         fuseki:dataset :text_dataset ; ### marche pr
>>>      >>>>> index texte
>>>      >>>>> fuseki:name                   "BnF_text" ;
>>>      >>>>> fuseki:serviceQuery           "query" , "sparql" ;
>>>      >>>>> fuseki:serviceReadGraphStore  "get" ;
>>>      >>>>> fuseki:serviceReadWriteGraphStore "data" ;
>>>      >>>>> fuseki:serviceUpdate          "update" ;
>>>      >>>>> fuseki:serviceUpload          "upload" .
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>>
>>>      >>>>
>>>      >>
>>>
>>>
>>>
>>>
>>
>>
>


Re: fuseki text:query : strange results + Lucene configuration

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Vincent!

Jena-text with the Lucene backend indexes each triple as a separate 
Lucene document. This means that you cannot combine givenName and 
familyName in the same query - from the Lucene perspective, the 
givenName appears in one document where familyName appears in another 
document, and querying for both (using AND) will just give you an empty 
result. So what you are doing is the correct way. The problem is just 
that some of the query patterns, such as "*J*", will return a very large 
number of results. This pushes the limits of jena-text as you've 
discovered. Either you get less results than expected or the query will 
take a long time, or both.

It might make sense to use only one text:query for the more restrictive 
part (familyName "roussea*" in this case) and then use a FILTER with 
some string matching (STRSTARTS or CONTAINS or REGEX) to further limit 
the results.

The Elasticsearch backend of jena-text is different though. It will 
combine different indexed properties of the same subject within the same 
Elasticsearch/Lucene document. So an AND query with both givenName and 
familyName is possible when using that backend.

-Osma

Vincent Ventresque kirjoitti 12.09.2018 klo 15:06:
> Hello Rob
> 
> 
> Thank you for all these elements.
> 
>  > there is a limit on the results returned from each text search so 
> when these are *separately executed and joined together* you may only 
> get a subset of the full results
> 
> Could you please explain what would be a 'non-separate' query? Do you 
> mean :
> 
> ?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?
> 
> I made 2 separate triples (1st = givenName + 2nd = familyName) because I 
> had read that "when a query is to involve two or more properties then it 
> expressed at the SPARQL level, as it were, versus in Lucene's query 
> language" 
> (https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields). 
> 
> 
> Vincent
> 
> 
> 
> 
> Le 12/09/2018 à 11:52, Rob Vesse a écrit :
>> Well the order of triple patterns shouldn't matter too much when you 
>> have a pure BGP (albeit the optimiser might pick a bad order in some 
>> cases)
>>
>> But we aren't talking about pure BGPs here, having the text:query 
>> triples results in the BGP being broken up into joins of several 
>> property functions with the regular triple patterns interspersed 
>> through those.  So if we take your query and run it through Jena's 
>> algebra compiler (you can do this online at 
>> http://sparql.org/validate/query) we get the following:
>>
>>    1 (base <http://example/base/>
>>    2   (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
>>    3            (owl: <http://www.w3.org/2002/07/owl#>)
>>    4            (apf: <http://jena.hpl.hp.com/ARQ/property#>)
>>    5            (xsd: <http://www.w3.org/2001/XMLSchema#>)
>>    6            (fn: <http://www.w3.org/2005/xpath-functions#>)
>>    7            (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
>>    8            (text: <http://jena.apache.org/text#>)
>>    9            (foaf: <http://xmlns.com/foaf/0.1/>)
>>   10            (dc: <http://purl.org/dc/elements/1.1/>))
>>   11     (sequence
>>   12       (propfunc text:query
>>   13         ?uriBnF (foaf:givenName "$MY_STRING")
>>   14         (propfunc text:query
>>   15           ?uriBnF (foaf:familyName "roussea*")
>>   16           (table unit)))
>>   17       (bgp
>>   18         (triple ?uriBnF foaf:familyName ?nom)
>>   19         (triple ?uriBnF foaf:givenName ?prenom)
>>   20       ))))
>>
>> So first its doing the text search on your parameter (lines 12-13), 
>> then joining that to text search on your surname (lines 14-15) via 
>> substituting binds from your first text search and then finally 
>> joining that with the plain BGP (lines 17-19).
>>
>> So in this case the ordering of your property functions in the query 
>> is going to make a difference to the evaluation.  As I think Osma 
>> already pointed out there is a limit on the results returned from each 
>> text search so when these are separately executed and joined together 
>> you may only get a subset of the full results that your text index holds.
>>
>> Rob
>>
>> On 12/09/2018, 09:55, "Vincent Ventresque" 
>> <vi...@ens-lyon.fr> wrote:
>>
>>      Hi Lorenz,
>>      Thanks for your reply.
>>      > for me it sounds more like you've found a bug
>>      I'm not able to tell, just beginning to use Fuseki + Lucene.
>>      > I'm just referring to "Order of triple patterns in a BGP" here
>>      Could you please give a raw text URL for "Order of triple 
>> patterns in a
>>      BGP" (seems that the 'here' in your mail had a formatted link but I
>>      didn't receive the url in my mailbox).
>>      > The order of triple patterns in a BGP shouldn't matter
>>      I thought that it was better (for performance/speed) to begin 
>> with 1)
>>      constants and 2) variables having few solutions in the dataset. I've
>>      read something about Sparql optimization and algebra, but can't 
>> remember
>>      where. But maybe you're talking about the logics itself (A+B = B+A)?
>>      N.B. I find these questions very interesting, but I'm no Sparql
>>      specialist (neither a logician).
>>      Cheers,
>>      Vincent
>>      Le 12/09/2018 à 10:32, Lorenz B. a écrit :
>>      > Hi "VV",
>>      >
>>      > well, for me it sounds more like you've found a bug and are now 
>> doing a
>>      > workaround. Or at least something is strange and I'm just 
>> referring to
>>      > "Order of triple patterns in a BGP" here.
>>      >
>>      > The order of triple patterns in a BGP shouldn't matter - as far 
>> as I
>>      > know it's always a good old join on the intermediate result of the
>>      > evaluation of the triple patterns.
>>      >
>>      > Indeed, the limit of the text index lookup matters as the internal
>>      > ordering by Lucene is based on some Information Retrieval 
>> measure (close
>>      > to TF-IDF probably with default settings).
>>      >
>>      > But I guess, Osma and Andy will give you a better and more 
>> correct answer.
>>      >
>>      >
>>      > Cheers,
>>      > Lorenz
>>      >
>>      >> Hello Osma,
>>      >>
>>      >>
>>      >> Thank you very much for your reply, you solved the problem! 
>> I've made
>>      >> a few tests, both the order and the limit are important (see 
>> below).
>>      >>
>>      >> Just one more question : I thought that the "Roussea*" being less
>>      >> numerous than the "*J*", it would be more efficient to begin 
>> with the
>>      >> "Roussea*". Can you explain why it's the contrary?
>>      >>
>>      >> Best,
>>      >>
>>      >> VV.
>>      >>
>>      >>
>>      >> 1) --------- changing only the order --------------------------
>>      >>
>>      >> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>>      >> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>      >> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>      >>
>>      >>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 
>> 000 or 2
>>      >> 000 000)
>>      >>
>>      >> 2) --------- changing order + limit = 100 000 
>> --------------------------
>>      >>
>>      >> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
>>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>      >>
>>      >>  => 54 entries but not "Jean-Jacques" !
>>      >>
>>      >> 3) --------- changing order + limit = 1 000 000
>>      >> --------------------------
>>      >>
>>      >>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
>>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>      >>
>>      >> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>>      >>
>>      >> 4) --------- test using filters (strstarts + contains)
>>      >> --------------------------
>>      >>
>>      >> ?uriBnF foaf:familyName ?nom
>>      >> filter(strstarts(?nom, "Roussea"))
>>      >> ?uriBnF foaf:givenName ?prenom
>>      >> filter(contains(?prenom, "J"))
>>      >>
>>      >> => 129 entries, 27 seconds [less results than
>>      >> "text:query ( foaf:givenName "*J*" 1000000)" because contains 
>> = case
>>      >> sensible ?]
>>      >>
>>      >> -----------------------------------------------------
>>      >>
>>      >> More infos about the dataset :
>>      >>
>>      >> # 3 fields are indexed ( foaf:name + foaf:givenName are in the 
>> same
>>      >> named graph )
>>      >>
>>      >> -- dcterms:title = +/- 9.45 M.
>>      >>
>>      >> -- foaf:givenName = +/- 1.71 M.
>>      >>
>>      >> -- foaf:familyName = +/- 1.78 M.
>>      >>
>>      >> # config file :
>>      >>
>>      >> ----------------
>>      >>
>>      >> text:storeValues true ;
>>      >>     text:queryParser text:AnalyzingQueryParser ;
>>      >>     text:map (
>>      >>         [ text:field "title" ; text:predicate dcterms:title ;
>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>      >>          text:tokenizer text:KeywordTokenizer ;
>>      >>          text:filters (text:ASCIIFoldingFilter 
>> text:LowerCaseFilter)
>>      >>          ] ]
>>      >>          [ text:field "familyName" ; text:predicate 
>> foaf:familyName ;
>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>      >>          text:tokenizer text:KeywordTokenizer ;
>>      >>          text:filters (text:ASCIIFoldingFilter 
>> text:LowerCaseFilter)
>>      >>          ] ]
>>      >>          [ text:field "givenName" ; text:predicate 
>> foaf:givenName ;
>>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>      >>          text:tokenizer text:KeywordTokenizer ;
>>      >>          text:filters (text:ASCIIFoldingFilter 
>> text:LowerCaseFilter)
>>      >>          ] ]
>>      >>
>>      >>          ) .
>>      >>
>>      >>
>>      >>
>>      >>
>>      >>
>>      >>
>>      >> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>>      >>> Hello Vincent,
>>      >>>
>>      >>> The results you get don't seem quite right. As you say, with a
>>      >>> shorter query one would expect more results.
>>      >>>
>>      >>> One thing to do would be to check what results you get if you 
>> run the
>>      >>> queries individually. I think combining the two separate 
>> jena-text
>>      >>> queries (for foaf:familyName and foaf:givenName) may be part 
>> of the
>>      >>> problem here... So if you execute only the "roussea*" part of 
>> the
>>      >>> query, do you get the expected number of results? What about 
>> if you
>>      >>> only execute one of the givenName queries with no restriction on
>>      >>> familyName?
>>      >>>
>>      >>> Does it make a difference if you change the order of the 
>> firstName
>>      >>> and givenName clauses?
>>      >>>
>>      >>> One thing to consider is that Lucene queries always have a 
>> limit on
>>      >>> the number of results. With jena-text you can specify it as an
>>      >>> additional parameter, but if you leave it out, it will 
>> default to
>>      >>> 10000. My guess is that the givenName queries may generate more
>>      >>> results than 10000, and the results will then be cut off. 
>> This may
>>      >>> mean that you get many Jeans and Jacques's and Johns etc. but 
>> many
>>      >>> the J. Rousseaus get cut off from the list. Try adding a 
>> large limit
>>      >>> parameter (say 100000 or more) to the text:query functions to 
>> see if
>>      >>> it helps. Like this:
>>      >>>
>>      >>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>>      >>>
>>      >>> jena-text is not very good at combining multiple criteria. 
>> You can do
>>      >>> it with separate queries as you've done, but internally the 
>> queries
>>      >>> will run separately and the results will only be combined in 
>> Jena,
>>      >>> outside Lucene.
>>      >>>
>>      >>> -Osma
>>      >>>
>>      >>>
>>      >>>
>>      >>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>>      >>>> Hello,
>>      >>>>
>>      >>>>
>>      >>>> I've made new tests with a slightly different dataset and
>>      >>>> configuration, the problem is the same.
>>      >>>>
>>      >>>> --- Could you please tell me if these results are normal (I 
>> expected
>>      >>>> a bigger list with fewer letters)?
>>      >>>>
>>      >>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>>      >>>>
>>      >>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>>      >>>>
>>      >>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>>      >>>>
>>      >>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>>      >>>>
>>      >>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>>      >>>>
>>      >>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>>      >>>>
>>      >>>> Here is the complete query :
>>      >>>>
>>      >>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName 
>> "roussea*" ) .
>>      >>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>      >>>>
>>      >>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName 
>> ?prenom }
>>      >>>>
>>      >>>> N.B. : the dataset is quite large : 1,78 M family names 
>> indexed, and
>>      >>>> 1,71 M given names. I have 4 distinct "Jean-Jacques 
>> Rousseau" in the
>>      >>>> data, 713 family names containing "roussea", including 224 
>> compound
>>      >>>> given names.
>>      >>>>
>>      >>>> --- Do you know where to find more documentation about Lucene
>>      >>>> configuration (I read jena.apache.org page + , and also 
>> found useful
>>      >>>> explanations on Skosmos wiki 
>> https://github.com/NatLibFi/Skosmos ),
>>      >>>> especially about tokenizers  ?
>>      >>>>
>>      >>>>
>>      >>>> Thanks in advance,
>>      >>>>
>>      >>>> VV
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>>
>>      >>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>>      >>>>> Hello,
>>      >>>>>
>>      >>>>> I've just subscribed to the users@jena.apache.org list, and I
>>      >>>>> apologize if this mail is not sent properly.
>>      >>>>>
>>      >>>>> I'm trying to use Fuseki text:query, and have encountered 
>> several
>>      >>>>> issues. Here are my questions
>>      >>>>>
>>      >>>>> 1) Does text:query require a minimum number of characters 
>> to be
>>      >>>>> efficient?
>>      >>>>>
>>      >>>>> 2) Is performance linked to the number of fields indexed?
>>      >>>>>
>>      >>>>> 3) In order to retrieve strings containing hyphens, should 
>> I use
>>      >>>>> KeywordTokenizer in config file?
>>      >>>>>
>>      >>>>> ~~~ 1) Does text:query require a minimum number of 
>> characters to be
>>      >>>>> efficient? ~~~~~~~~~~~~~
>>      >>>>>
>>      >>>>> I've noticed that a query on indexed predicates 
>> (foaf:familyName
>>      >>>>> and foaf:givenName) returns more results when there are more
>>      >>>>> characters in the string :
>>      >>>>>
>>      >>>>> SELECT * WHERE {
>>      >>>>>
>>      >>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>      >>>>>
>>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>      >>>>>
>>      >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName 
>> ?prenom .
>>      >>>>>
>>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>      >>>>>
>>      >>>>> }
>>      >>>>>
>>      >>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be 
>> in the
>>      >>>>> results.
>>      >>>>>
>>      >>>>> => if  $MY_STRING = "j*", I get  0 result
>>      >>>>>
>>      >>>>> => if  $MY_STRING = "je*", I get 17 results, including
>>      >>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
>>      >>>>>
>>      >>>>> => if  $MY_STRING = "jea*", I get 27 results, including 
>> "Jean-Jacques"
>>      >>>>>
>>      >>>>> I don't know anything about Lucene, but it looks very 
>> strange to me
>>      >>>>> : I expected the contrary (fewer letters = bigger results 
>> list).
>>      >>>>>
>>      >>>>>
>>      >>>>> ~~~ 2) Is performance linked to the number of fields indexed?
>>      >>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>      >>>>>
>>      >>>>> If I change the configuration and index only 
>> foaf:givenName, and
>>      >>>>> provide a constant for foaf:familyName, the query returns more
>>      >>>>> results :
>>      >>>>>
>>      >>>>> SELECT * WHERE {
>>      >>>>>
>>      >>>>> ?uriBnF foaf:familyName "Rousseau" .
>>      >>>>>
>>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>      >>>>>
>>      >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName 
>> ?prenom .
>>      >>>>>
>>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>      >>>>>
>>      >>>>> }
>>      >>>>>
>>      >>>>> => if  $MY_STRING = "j*", I get  7 results, whereas the 
>> first query
>>      >>>>> returned 0 result.
>>      >>>>>
>>      >>>>>
>>      >>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
>>      >>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>      >>>>>
>>      >>>>> With the same query, if $MY_STRING = "jean-ja*" :
>>      >>>>>
>>      >>>>> a) with simple configuration (cf. below), I get 0 result
>>      >>>>>
>>      >>>>> b) with KeywordTokenizer config (cf. below), I get 
>> "Jean-Jacques"
>>      >>>>>
>>      >>>>> Is it the right way to get "Jean-Jacques"?
>>      >>>>>
>>      >>>>>
>>      >>>>> Thanks in advance
>>      >>>>>
>>      >>>>> VV
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>> =============== SIMPLE CONFIGURATION ===================
>>      >>>>>
>>      >>>>> @prefix :        <#> .
>>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>      >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>      >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>      >>>>> @prefix text:    <http://jena.apache.org/text#> .
>>      >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>> [] rdf:type fuseki:Server ;
>>      >>>>>    .
>>      >>>>>
>>      >>>>>
>>      >>>>> ## Initialize TDB --------------------------------
>>      >>>>>
>>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>      >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>      >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>      >>>>>
>>      >>>>> ## Initialize text query -------------------------------------
>>      >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>      >>>>> # A TextDataset is a regular dataset with a text index.
>>      >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>      >>>>> # Lucene index
>>      >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>      >>>>>
>>      >>>>> ## 
>> ---------------------------------------------------------------
>>      >>>>> ## This URI must be fixed - it's used to assemble the text 
>> dataset.
>>      >>>>>
>>      >>>>> :text_dataset rdf:type     text:TextDataset ;
>>      >>>>> #    text:dataset   <#dataset> ;
>>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>>      >>>>> #    text:index     <#indexLucene> ;
>>      >>>>>     text:index :My_Lucene_index ;
>>      >>>>>     .
>>      >>>>>
>>      >>>>> # A TDB datset used for RDF storage 
>> ------------------------------
>>      >>>>> :tdb_dataset_readwrite
>>      >>>>>         a             tdb:DatasetTDB ;
>>      >>>>>         tdb:location  "$_BnF_text" ;
>>      >>>>> .
>>      >>>>>
>>      >>>>> # Text index description 
>> ------------------------------------------
>>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>>      >>>>>     text:directory <file:$_Lucene> ;
>>      >>>>>     text:entityMap <#entMap> ;
>>      >>>>>     .
>>      >>>>>
>>      >>>>> # Mapping in the index 
>> ---------------------------------------------
>>      >>>>> # URI stored in field "uri"
>>      >>>>> <#entMap> a text:EntityMap ;
>>      >>>>>     text:entityField      "uri" ;
>>      >>>>>     text:defaultField     "familyName" ;
>>      >>>>>     text:map (
>>      >>>>>          [ text:field "familyName" ; text:predicate 
>> foaf:familyName ]
>>      >>>>>          [ text:field "givenName" ; text:predicate 
>> foaf:givenName ]
>>      >>>>>          ) .
>>      >>>>>
>>      >>>>> :service_tdb_all  a                   fuseki:Service ;
>>      >>>>>         rdfs:label                    "TDB BnF_text" ;
>>      >>>>>         fuseki:dataset               :text_dataset ;
>>      >>>>>         fuseki:name                   "BnF_text" ;
>>      >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>      >>>>>         fuseki:serviceReadGraphStore  "get" ;
>>      >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>      >>>>>         fuseki:serviceUpdate          "update" ;
>>      >>>>>         fuseki:serviceUpload          "upload" .
>>      >>>>>
>>      >>>>>
>>      >>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>      >>>>>
>>      >>>>> @prefix :        <#> .
>>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>      >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>      >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>      >>>>> @prefix text:    <http://jena.apache.org/text#> .
>>      >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>> [] rdf:type fuseki:Server ;
>>      >>>>>
>>      >>>>>    .
>>      >>>>>
>>      >>>>>
>>      >>>>> ## Initialize TDB --------------------------------
>>      >>>>>
>>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>      >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>      >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>      >>>>>
>>      >>>>> ## Initialize text query -------------------------------------
>>      >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>      >>>>> # A TextDataset is a regular dataset with a text index.
>>      >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>      >>>>> # Lucene index
>>      >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>      >>>>>
>>      >>>>> ## 
>> ---------------------------------------------------------------
>>      >>>>>
>>      >>>>>
>>      >>>>> :text_dataset rdf:type     text:TextDataset ;
>>      >>>>> #    text:dataset   <#dataset> ;
>>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>>      >>>>> #    text:index     <#indexLucene> ;
>>      >>>>>     text:index :My_Lucene_index ;
>>      >>>>>     .
>>      >>>>>
>>      >>>>> # A TDB datset used for RDF storage 
>> ------------------------------
>>      >>>>> :tdb_dataset_readwrite
>>      >>>>>         a             tdb:DatasetTDB ;
>>      >>>>>         tdb:location  "$_BnF_text" ;
>>      >>>>> .
>>      >>>>>
>>      >>>>> # Text index description 
>> ------------------------------------------
>>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>>      >>>>>     text:directory <file:$_Lucene> ;
>>      >>>>>     text:entityMap <#entMap> ;
>>      >>>>>     .
>>      >>>>>
>>      >>>>> # Mapping in the index 
>> ---------------------------------------------
>>      >>>>> # URI stored in field "uri"
>>      >>>>> <#entMap> a text:EntityMap ;
>>      >>>>>     text:entityField      "uri" ;
>>      >>>>>     text:defaultField     "givenName" ;
>>      >>>>>     text:map (
>>      >>>>>
>>      >>>>>          [ text:field "familyName" ; text:predicate 
>> foaf:familyName ;
>>      >>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>>      >>>>>                text:tokenizer text:KeywordTokenizer ;
>>      >>>>>                text:filters (text:ASCIIFoldingFilter
>>      >>>>> text:LowerCaseFilter)
>>      >>>>>              ] ]
>>      >>>>>          [ text:field "givenName" ; text:predicate 
>> foaf:givenName ;
>>      >>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>      >>>>>          text:tokenizer text:KeywordTokenizer ;
>>      >>>>>          text:filters (text:ASCIIFoldingFilter 
>> text:LowerCaseFilter)
>>      >>>>>          ] ]
>>      >>>>>          ) .
>>      >>>>>
>>      >>>>> :service_tdb_all  a                   fuseki:Service ;
>>      >>>>>         rdfs:label                    "TDB BnF_text" ;
>>      >>>>>         fuseki:dataset               :text_dataset ; ### 
>> marche pr
>>      >>>>> index texte
>>      >>>>>         fuseki:name                   "BnF_text" ;
>>      >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>      >>>>>         fuseki:serviceReadGraphStore  "get" ;
>>      >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>      >>>>>         fuseki:serviceUpdate          "update" ;
>>      >>>>>         fuseki:serviceUpload          "upload" .
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>>
>>      >>>>
>>      >>
>>
>>
>>
>>
> 
> 

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: fuseki text:query : strange results + Lucene configuration

Posted by Vincent Ventresque <vi...@ens-lyon.fr>.
Hello Rob


Thank you for all these elements.

 > there is a limit on the results returned from each text search so 
when these are *separately executed and joined together* you may only 
get a subset of the full results

Could you please explain what would be a 'non-separate' query? Do you mean :

?s text:query ( "givenName:\"*J*\" AND familyName:\"Roussea\"" ) ?

I made 2 separate triples (1st = givenName + 2nd = familyName) because I 
had read that "when a query is to involve two or more properties then it 
expressed at the SPARQL level, as it were, versus in Lucene's query 
language" 
(https://jena.apache.org/documentation/query/text-query.html#queries-across-multiple-fields).

Vincent


  

Le 12/09/2018 à 11:52, Rob Vesse a écrit :
> Well the order of triple patterns shouldn't matter too much when you have a pure BGP (albeit the optimiser might pick a bad order in some cases)
>
> But we aren't talking about pure BGPs here, having the text:query triples results in the BGP being broken up into joins of several property functions with the regular triple patterns interspersed through those.  So if we take your query and run it through Jena's algebra compiler (you can do this online at http://sparql.org/validate/query) we get the following:
>
>    1 (base <http://example/base/>
>    2   (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
>    3            (owl: <http://www.w3.org/2002/07/owl#>)
>    4            (apf: <http://jena.hpl.hp.com/ARQ/property#>)
>    5            (xsd: <http://www.w3.org/2001/XMLSchema#>)
>    6            (fn: <http://www.w3.org/2005/xpath-functions#>)
>    7            (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
>    8            (text: <http://jena.apache.org/text#>)
>    9            (foaf: <http://xmlns.com/foaf/0.1/>)
>   10            (dc: <http://purl.org/dc/elements/1.1/>))
>   11     (sequence
>   12       (propfunc text:query
>   13         ?uriBnF (foaf:givenName "$MY_STRING")
>   14         (propfunc text:query
>   15           ?uriBnF (foaf:familyName "roussea*")
>   16           (table unit)))
>   17       (bgp
>   18         (triple ?uriBnF foaf:familyName ?nom)
>   19         (triple ?uriBnF foaf:givenName ?prenom)
>   20       ))))
>
> So first its doing the text search on your parameter (lines 12-13), then joining that to text search on your surname (lines 14-15) via substituting binds from your first text search and then finally joining that with the plain BGP (lines 17-19).
>
> So in this case the ordering of your property functions in the query is going to make a difference to the evaluation.  As I think Osma already pointed out there is a limit on the results returned from each text search so when these are separately executed and joined together you may only get a subset of the full results that your text index holds.
>
> Rob
>
> On 12/09/2018, 09:55, "Vincent Ventresque" <vi...@ens-lyon.fr> wrote:
>
>      Hi Lorenz,
>      
>      
>      Thanks for your reply.
>      
>      > for me it sounds more like you've found a bug
>      
>      I'm not able to tell, just beginning to use Fuseki + Lucene.
>      
>      > I'm just referring to "Order of triple patterns in a BGP" here
>      
>      Could you please give a raw text URL for "Order of triple patterns in a
>      BGP" (seems that the 'here' in your mail had a formatted link but I
>      didn't receive the url in my mailbox).
>      
>      > The order of triple patterns in a BGP shouldn't matter
>      
>      I thought that it was better (for performance/speed) to begin with 1)
>      constants and 2) variables having few solutions in the dataset. I've
>      read something about Sparql optimization and algebra, but can't remember
>      where. But maybe you're talking about the logics itself (A+B = B+A)?
>      N.B. I find these questions very interesting, but I'm no Sparql
>      specialist (neither a logician).
>      
>      Cheers,
>      
>      Vincent
>      
>      
>      
>      
>      Le 12/09/2018 à 10:32, Lorenz B. a écrit :
>      > Hi "VV",
>      >
>      > well, for me it sounds more like you've found a bug and are now doing a
>      > workaround. Or at least something is strange and I'm just referring to
>      > "Order of triple patterns in a BGP" here.
>      >
>      > The order of triple patterns in a BGP shouldn't matter - as far as I
>      > know it's always a good old join on the intermediate result of the
>      > evaluation of the triple patterns.
>      >
>      > Indeed, the limit of the text index lookup matters as the internal
>      > ordering by Lucene is based on some Information Retrieval measure (close
>      > to TF-IDF probably with default settings).
>      >
>      > But I guess, Osma and Andy will give you a better and more correct answer.
>      >
>      >
>      > Cheers,
>      > Lorenz
>      >
>      >> Hello Osma,
>      >>
>      >>
>      >> Thank you very much for your reply, you solved the problem! I've made
>      >> a few tests, both the order and the limit are important (see below).
>      >>
>      >> Just one more question : I thought that the "Roussea*" being less
>      >> numerous than the "*J*", it would be more efficient to begin with the
>      >> "Roussea*". Can you explain why it's the contrary?
>      >>
>      >> Best,
>      >>
>      >> VV.
>      >>
>      >>
>      >> 1) --------- changing only the order --------------------------
>      >>
>      >> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>      >> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>      >> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>      >>
>      >>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
>      >> 000 000)
>      >>
>      >> 2) --------- changing order + limit = 100 000 --------------------------
>      >>
>      >> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>      >>
>      >>  => 54 entries but not "Jean-Jacques" !
>      >>
>      >> 3) --------- changing order + limit = 1 000 000
>      >> --------------------------
>      >>
>      >>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
>      >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>      >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>      >>
>      >> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>      >>
>      >> 4) --------- test using filters (strstarts + contains)
>      >> --------------------------
>      >>
>      >> ?uriBnF foaf:familyName ?nom
>      >> filter(strstarts(?nom, "Roussea"))
>      >> ?uriBnF foaf:givenName ?prenom
>      >> filter(contains(?prenom, "J"))
>      >>
>      >> => 129 entries, 27 seconds [less results than
>      >> "text:query ( foaf:givenName "*J*" 1000000)" because contains = case
>      >> sensible ?]
>      >>
>      >> -----------------------------------------------------
>      >>
>      >> More infos about the dataset :
>      >>
>      >> # 3 fields are indexed ( foaf:name + foaf:givenName are in the same
>      >> named graph )
>      >>
>      >> -- dcterms:title = +/- 9.45 M.
>      >>
>      >> -- foaf:givenName = +/- 1.71 M.
>      >>
>      >> -- foaf:familyName = +/- 1.78 M.
>      >>
>      >> # config file :
>      >>
>      >> ----------------
>      >>
>      >> text:storeValues true ;
>      >>     text:queryParser text:AnalyzingQueryParser ;
>      >>     text:map (
>      >>         [ text:field "title" ; text:predicate dcterms:title ;
>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>      >>          text:tokenizer text:KeywordTokenizer ;
>      >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>      >>          ] ]
>      >>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>      >>          text:tokenizer text:KeywordTokenizer ;
>      >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>      >>          ] ]
>      >>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>      >>         text:analyzer [ a text:ConfigurableAnalyzer ;
>      >>          text:tokenizer text:KeywordTokenizer ;
>      >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>      >>          ] ]
>      >>
>      >>          ) .
>      >>
>      >>
>      >>
>      >>
>      >>
>      >>
>      >> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>      >>> Hello Vincent,
>      >>>
>      >>> The results you get don't seem quite right. As you say, with a
>      >>> shorter query one would expect more results.
>      >>>
>      >>> One thing to do would be to check what results you get if you run the
>      >>> queries individually. I think combining the two separate jena-text
>      >>> queries (for foaf:familyName and foaf:givenName) may be part of the
>      >>> problem here... So if you execute only the "roussea*" part of the
>      >>> query, do you get the expected number of results? What about if you
>      >>> only execute one of the givenName queries with no restriction on
>      >>> familyName?
>      >>>
>      >>> Does it make a difference if you change the order of the firstName
>      >>> and givenName clauses?
>      >>>
>      >>> One thing to consider is that Lucene queries always have a limit on
>      >>> the number of results. With jena-text you can specify it as an
>      >>> additional parameter, but if you leave it out, it will default to
>      >>> 10000. My guess is that the givenName queries may generate more
>      >>> results than 10000, and the results will then be cut off. This may
>      >>> mean that you get many Jeans and Jacques's and Johns etc. but many
>      >>> the J. Rousseaus get cut off from the list. Try adding a large limit
>      >>> parameter (say 100000 or more) to the text:query functions to see if
>      >>> it helps. Like this:
>      >>>
>      >>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>      >>>
>      >>> jena-text is not very good at combining multiple criteria. You can do
>      >>> it with separate queries as you've done, but internally the queries
>      >>> will run separately and the results will only be combined in Jena,
>      >>> outside Lucene.
>      >>>
>      >>> -Osma
>      >>>
>      >>>
>      >>>
>      >>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>      >>>> Hello,
>      >>>>
>      >>>>
>      >>>> I've made new tests with a slightly different dataset and
>      >>>> configuration, the problem is the same.
>      >>>>
>      >>>> --- Could you please tell me if these results are normal (I expected
>      >>>> a bigger list with fewer letters)?
>      >>>>
>      >>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>      >>>>
>      >>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>      >>>>
>      >>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>      >>>>
>      >>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>      >>>>
>      >>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>      >>>>
>      >>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>      >>>>
>      >>>> Here is the complete query :
>      >>>>
>      >>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>      >>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>      >>>>
>      >>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }
>      >>>>
>      >>>> N.B. : the dataset is quite large : 1,78 M family names indexed, and
>      >>>> 1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the
>      >>>> data, 713 family names containing "roussea", including 224 compound
>      >>>> given names.
>      >>>>
>      >>>> --- Do you know where to find more documentation about Lucene
>      >>>> configuration (I read jena.apache.org page + , and also found useful
>      >>>> explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ),
>      >>>> especially about tokenizers  ?
>      >>>>
>      >>>>
>      >>>> Thanks in advance,
>      >>>>
>      >>>> VV
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>>
>      >>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>      >>>>> Hello,
>      >>>>>
>      >>>>> I've just subscribed to the users@jena.apache.org list, and I
>      >>>>> apologize if this mail is not sent properly.
>      >>>>>
>      >>>>> I'm trying to use Fuseki text:query, and have encountered several
>      >>>>> issues. Here are my questions
>      >>>>>
>      >>>>> 1) Does text:query require a minimum number of characters to be
>      >>>>> efficient?
>      >>>>>
>      >>>>> 2) Is performance linked to the number of fields indexed?
>      >>>>>
>      >>>>> 3) In order to retrieve strings containing hyphens, should I use
>      >>>>> KeywordTokenizer in config file?
>      >>>>>
>      >>>>> ~~~ 1) Does text:query require a minimum number of characters to be
>      >>>>> efficient? ~~~~~~~~~~~~~
>      >>>>>
>      >>>>> I've noticed that a query on indexed predicates (foaf:familyName
>      >>>>> and foaf:givenName) returns more results when there are more
>      >>>>> characters in the string :
>      >>>>>
>      >>>>> SELECT * WHERE {
>      >>>>>
>      >>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>      >>>>>
>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>      >>>>>
>      >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>      >>>>>
>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>      >>>>>
>      >>>>> }
>      >>>>>
>      >>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be in the
>      >>>>> results.
>      >>>>>
>      >>>>> => if  $MY_STRING = "j*", I get  0 result
>      >>>>>
>      >>>>> => if  $MY_STRING = "je*", I get 17 results, including
>      >>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
>      >>>>>
>      >>>>> => if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"
>      >>>>>
>      >>>>> I don't know anything about Lucene, but it looks very strange to me
>      >>>>> : I expected the contrary (fewer letters = bigger results list).
>      >>>>>
>      >>>>>
>      >>>>> ~~~ 2) Is performance linked to the number of fields indexed?
>      >>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>      >>>>>
>      >>>>> If I change the configuration and index only foaf:givenName, and
>      >>>>> provide a constant for foaf:familyName, the query returns more
>      >>>>> results :
>      >>>>>
>      >>>>> SELECT * WHERE {
>      >>>>>
>      >>>>> ?uriBnF foaf:familyName "Rousseau" .
>      >>>>>
>      >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>      >>>>>
>      >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>      >>>>>
>      >>>>> optional {?uriBnF bio:birth ?dateNaissance }
>      >>>>>
>      >>>>> }
>      >>>>>
>      >>>>> => if  $MY_STRING = "j*", I get  7 results, whereas the first query
>      >>>>> returned 0 result.
>      >>>>>
>      >>>>>
>      >>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
>      >>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>      >>>>>
>      >>>>> With the same query, if $MY_STRING = "jean-ja*" :
>      >>>>>
>      >>>>> a) with simple configuration (cf. below), I get 0 result
>      >>>>>
>      >>>>> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
>      >>>>>
>      >>>>> Is it the right way to get "Jean-Jacques"?
>      >>>>>
>      >>>>>
>      >>>>> Thanks in advance
>      >>>>>
>      >>>>> VV
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>> =============== SIMPLE CONFIGURATION ===================
>      >>>>>
>      >>>>> @prefix :        <#> .
>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>      >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>      >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>      >>>>> @prefix text:    <http://jena.apache.org/text#> .
>      >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>> [] rdf:type fuseki:Server ;
>      >>>>>    .
>      >>>>>
>      >>>>>
>      >>>>> ## Initialize TDB --------------------------------
>      >>>>>
>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>      >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>      >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>      >>>>>
>      >>>>> ## Initialize text query -------------------------------------
>      >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>      >>>>> # A TextDataset is a regular dataset with a text index.
>      >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>      >>>>> # Lucene index
>      >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>      >>>>>
>      >>>>> ## ---------------------------------------------------------------
>      >>>>> ## This URI must be fixed - it's used to assemble the text dataset.
>      >>>>>
>      >>>>> :text_dataset rdf:type     text:TextDataset ;
>      >>>>> #    text:dataset   <#dataset> ;
>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>      >>>>> #    text:index     <#indexLucene> ;
>      >>>>>     text:index :My_Lucene_index ;
>      >>>>>     .
>      >>>>>
>      >>>>> # A TDB datset used for RDF storage ------------------------------
>      >>>>> :tdb_dataset_readwrite
>      >>>>>         a             tdb:DatasetTDB ;
>      >>>>>         tdb:location  "$_BnF_text" ;
>      >>>>> .
>      >>>>>
>      >>>>> # Text index description ------------------------------------------
>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>      >>>>>     text:directory <file:$_Lucene> ;
>      >>>>>     text:entityMap <#entMap> ;
>      >>>>>     .
>      >>>>>
>      >>>>> # Mapping in the index ---------------------------------------------
>      >>>>> # URI stored in field "uri"
>      >>>>> <#entMap> a text:EntityMap ;
>      >>>>>     text:entityField      "uri" ;
>      >>>>>     text:defaultField     "familyName" ;
>      >>>>>     text:map (
>      >>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ]
>      >>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ]
>      >>>>>          ) .
>      >>>>>
>      >>>>> :service_tdb_all  a                   fuseki:Service ;
>      >>>>>         rdfs:label                    "TDB BnF_text" ;
>      >>>>>         fuseki:dataset               :text_dataset ;
>      >>>>>         fuseki:name                   "BnF_text" ;
>      >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>      >>>>>         fuseki:serviceReadGraphStore  "get" ;
>      >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>      >>>>>         fuseki:serviceUpdate          "update" ;
>      >>>>>         fuseki:serviceUpload          "upload" .
>      >>>>>
>      >>>>>
>      >>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>      >>>>>
>      >>>>> @prefix :        <#> .
>      >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>      >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>      >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>      >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>      >>>>> @prefix text:    <http://jena.apache.org/text#> .
>      >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>      >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>      >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>> [] rdf:type fuseki:Server ;
>      >>>>>
>      >>>>>    .
>      >>>>>
>      >>>>>
>      >>>>> ## Initialize TDB --------------------------------
>      >>>>>
>      >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>      >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>      >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>      >>>>>
>      >>>>> ## Initialize text query -------------------------------------
>      >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>      >>>>> # A TextDataset is a regular dataset with a text index.
>      >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>      >>>>> # Lucene index
>      >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>      >>>>>
>      >>>>> ## ---------------------------------------------------------------
>      >>>>>
>      >>>>>
>      >>>>> :text_dataset rdf:type     text:TextDataset ;
>      >>>>> #    text:dataset   <#dataset> ;
>      >>>>>     text:dataset :tdb_dataset_readwrite ;
>      >>>>> #    text:index     <#indexLucene> ;
>      >>>>>     text:index :My_Lucene_index ;
>      >>>>>     .
>      >>>>>
>      >>>>> # A TDB datset used for RDF storage ------------------------------
>      >>>>> :tdb_dataset_readwrite
>      >>>>>         a             tdb:DatasetTDB ;
>      >>>>>         tdb:location  "$_BnF_text" ;
>      >>>>> .
>      >>>>>
>      >>>>> # Text index description ------------------------------------------
>      >>>>> #<#indexLucene> a text:TextIndexLucene ;
>      >>>>> :My_Lucene_index a text:TextIndexLucene ;
>      >>>>>     text:directory <file:$_Lucene> ;
>      >>>>>     text:entityMap <#entMap> ;
>      >>>>>     .
>      >>>>>
>      >>>>> # Mapping in the index ---------------------------------------------
>      >>>>> # URI stored in field "uri"
>      >>>>> <#entMap> a text:EntityMap ;
>      >>>>>     text:entityField      "uri" ;
>      >>>>>     text:defaultField     "givenName" ;
>      >>>>>     text:map (
>      >>>>>
>      >>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>      >>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>      >>>>>                text:tokenizer text:KeywordTokenizer ;
>      >>>>>                text:filters (text:ASCIIFoldingFilter
>      >>>>> text:LowerCaseFilter)
>      >>>>>              ] ]
>      >>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>      >>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>      >>>>>          text:tokenizer text:KeywordTokenizer ;
>      >>>>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>      >>>>>          ] ]
>      >>>>>          ) .
>      >>>>>
>      >>>>> :service_tdb_all  a                   fuseki:Service ;
>      >>>>>         rdfs:label                    "TDB BnF_text" ;
>      >>>>>         fuseki:dataset               :text_dataset ; ### marche pr
>      >>>>> index texte
>      >>>>>         fuseki:name                   "BnF_text" ;
>      >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>      >>>>>         fuseki:serviceReadGraphStore  "get" ;
>      >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>      >>>>>         fuseki:serviceUpdate          "update" ;
>      >>>>>         fuseki:serviceUpload          "upload" .
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>>
>      >>>>
>      >>
>      
>      
>
>
>
>


Re: fuseki text:query : strange results + Lucene configuration

Posted by Rob Vesse <rv...@dotnetrdf.org>.
Well the order of triple patterns shouldn't matter too much when you have a pure BGP (albeit the optimiser might pick a bad order in some cases)

But we aren't talking about pure BGPs here, having the text:query triples results in the BGP being broken up into joins of several property functions with the regular triple patterns interspersed through those.  So if we take your query and run it through Jena's algebra compiler (you can do this online at http://sparql.org/validate/query) we get the following:

  1 (base <http://example/base/>
  2   (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
  3            (owl: <http://www.w3.org/2002/07/owl#>)
  4            (apf: <http://jena.hpl.hp.com/ARQ/property#>)
  5            (xsd: <http://www.w3.org/2001/XMLSchema#>)
  6            (fn: <http://www.w3.org/2005/xpath-functions#>)
  7            (rdfs: <http://www.w3.org/2000/01/rdf-schema#>)
  8            (text: <http://jena.apache.org/text#>)
  9            (foaf: <http://xmlns.com/foaf/0.1/>)
 10            (dc: <http://purl.org/dc/elements/1.1/>))
 11     (sequence
 12       (propfunc text:query
 13         ?uriBnF (foaf:givenName "$MY_STRING")
 14         (propfunc text:query
 15           ?uriBnF (foaf:familyName "roussea*")
 16           (table unit)))
 17       (bgp
 18         (triple ?uriBnF foaf:familyName ?nom)
 19         (triple ?uriBnF foaf:givenName ?prenom)
 20       ))))

So first its doing the text search on your parameter (lines 12-13), then joining that to text search on your surname (lines 14-15) via substituting binds from your first text search and then finally joining that with the plain BGP (lines 17-19).

So in this case the ordering of your property functions in the query is going to make a difference to the evaluation.  As I think Osma already pointed out there is a limit on the results returned from each text search so when these are separately executed and joined together you may only get a subset of the full results that your text index holds.  

Rob

On 12/09/2018, 09:55, "Vincent Ventresque" <vi...@ens-lyon.fr> wrote:

    Hi Lorenz,
    
    
    Thanks for your reply.
    
    > for me it sounds more like you've found a bug
    
    I'm not able to tell, just beginning to use Fuseki + Lucene.
    
    > I'm just referring to "Order of triple patterns in a BGP" here
    
    Could you please give a raw text URL for "Order of triple patterns in a
    BGP" (seems that the 'here' in your mail had a formatted link but I
    didn't receive the url in my mailbox).
    
    > The order of triple patterns in a BGP shouldn't matter
    
    I thought that it was better (for performance/speed) to begin with 1)
    constants and 2) variables having few solutions in the dataset. I've
    read something about Sparql optimization and algebra, but can't remember
    where. But maybe you're talking about the logics itself (A+B = B+A)?
    N.B. I find these questions very interesting, but I'm no Sparql
    specialist (neither a logician).
    
    Cheers,
    
    Vincent
    
    
    
    
    Le 12/09/2018 à 10:32, Lorenz B. a écrit :
    > Hi "VV",
    >
    > well, for me it sounds more like you've found a bug and are now doing a
    > workaround. Or at least something is strange and I'm just referring to
    > "Order of triple patterns in a BGP" here.
    >
    > The order of triple patterns in a BGP shouldn't matter - as far as I
    > know it's always a good old join on the intermediate result of the
    > evaluation of the triple patterns.
    >
    > Indeed, the limit of the text index lookup matters as the internal
    > ordering by Lucene is based on some Information Retrieval measure (close
    > to TF-IDF probably with default settings).
    >
    > But I guess, Osma and Andy will give you a better and more correct answer.
    >
    >
    > Cheers,
    > Lorenz
    >
    >> Hello Osma,
    >>
    >>
    >> Thank you very much for your reply, you solved the problem! I've made
    >> a few tests, both the order and the limit are important (see below).
    >>
    >> Just one more question : I thought that the "Roussea*" being less
    >> numerous than the "*J*", it would be more efficient to begin with the
    >> "Roussea*". Can you explain why it's the contrary?
    >>
    >> Best,
    >>
    >> VV.
    >>
    >>
    >> 1) --------- changing only the order --------------------------
    >>
    >> ?uriBnF text:query ( foaf:givenName "*J*" ) .
    >> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
    >> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
    >>
    >>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
    >> 000 000)
    >>
    >> 2) --------- changing order + limit = 100 000 --------------------------
    >>
    >> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
    >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
    >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
    >>
    >>  => 54 entries but not "Jean-Jacques" !
    >>
    >> 3) --------- changing order + limit = 1 000 000
    >> --------------------------
    >>
    >>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
    >>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
    >>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
    >>
    >> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
    >>
    >> 4) --------- test using filters (strstarts + contains)
    >> --------------------------
    >>
    >> ?uriBnF foaf:familyName ?nom
    >> filter(strstarts(?nom, "Roussea"))
    >> ?uriBnF foaf:givenName ?prenom
    >> filter(contains(?prenom, "J"))
    >>
    >> => 129 entries, 27 seconds [less results than
    >> "text:query ( foaf:givenName "*J*" 1000000)" because contains = case
    >> sensible ?]
    >>
    >> -----------------------------------------------------
    >>
    >> More infos about the dataset :
    >>
    >> # 3 fields are indexed ( foaf:name + foaf:givenName are in the same
    >> named graph )
    >>
    >> -- dcterms:title = +/- 9.45 M.
    >>
    >> -- foaf:givenName = +/- 1.71 M.
    >>
    >> -- foaf:familyName = +/- 1.78 M.
    >>
    >> # config file :
    >>
    >> ----------------
    >>
    >> text:storeValues true ;
    >>     text:queryParser text:AnalyzingQueryParser ;
    >>     text:map (
    >>         [ text:field "title" ; text:predicate dcterms:title ;
    >>         text:analyzer [ a text:ConfigurableAnalyzer ;
    >>          text:tokenizer text:KeywordTokenizer ;
    >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
    >>          ] ]
    >>          [ text:field "familyName" ; text:predicate foaf:familyName ;
    >>         text:analyzer [ a text:ConfigurableAnalyzer ;
    >>          text:tokenizer text:KeywordTokenizer ;
    >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
    >>          ] ]
    >>          [ text:field "givenName" ; text:predicate foaf:givenName ;
    >>         text:analyzer [ a text:ConfigurableAnalyzer ;
    >>          text:tokenizer text:KeywordTokenizer ;
    >>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
    >>          ] ]
    >>
    >>          ) .
    >>
    >>
    >>
    >>
    >>  
    >>
    >> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
    >>> Hello Vincent,
    >>>
    >>> The results you get don't seem quite right. As you say, with a
    >>> shorter query one would expect more results.
    >>>
    >>> One thing to do would be to check what results you get if you run the
    >>> queries individually. I think combining the two separate jena-text
    >>> queries (for foaf:familyName and foaf:givenName) may be part of the
    >>> problem here... So if you execute only the "roussea*" part of the
    >>> query, do you get the expected number of results? What about if you
    >>> only execute one of the givenName queries with no restriction on
    >>> familyName?
    >>>
    >>> Does it make a difference if you change the order of the firstName
    >>> and givenName clauses?
    >>>
    >>> One thing to consider is that Lucene queries always have a limit on
    >>> the number of results. With jena-text you can specify it as an
    >>> additional parameter, but if you leave it out, it will default to
    >>> 10000. My guess is that the givenName queries may generate more
    >>> results than 10000, and the results will then be cut off. This may
    >>> mean that you get many Jeans and Jacques's and Johns etc. but many
    >>> the J. Rousseaus get cut off from the list. Try adding a large limit
    >>> parameter (say 100000 or more) to the text:query functions to see if
    >>> it helps. Like this:
    >>>
    >>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
    >>>
    >>> jena-text is not very good at combining multiple criteria. You can do
    >>> it with separate queries as you've done, but internally the queries
    >>> will run separately and the results will only be combined in Jena,
    >>> outside Lucene.
    >>>
    >>> -Osma
    >>>
    >>>
    >>>
    >>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
    >>>> Hello,
    >>>>
    >>>>
    >>>> I've made new tests with a slightly different dataset and
    >>>> configuration, the problem is the same.
    >>>>
    >>>> --- Could you please tell me if these results are normal (I expected
    >>>> a bigger list with fewer letters)?
    >>>>
    >>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
    >>>>
    >>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
    >>>>
    >>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
    >>>>
    >>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
    >>>>
    >>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
    >>>>
    >>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
    >>>>
    >>>> Here is the complete query :
    >>>>
    >>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) .
    >>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
    >>>>
    >>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }
    >>>>
    >>>> N.B. : the dataset is quite large : 1,78 M family names indexed, and
    >>>> 1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the
    >>>> data, 713 family names containing "roussea", including 224 compound
    >>>> given names.
    >>>>
    >>>> --- Do you know where to find more documentation about Lucene
    >>>> configuration (I read jena.apache.org page + , and also found useful
    >>>> explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ),
    >>>> especially about tokenizers  ?
    >>>>
    >>>>
    >>>> Thanks in advance,
    >>>>
    >>>> VV
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>>
    >>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
    >>>>> Hello,
    >>>>>
    >>>>> I've just subscribed to the users@jena.apache.org list, and I
    >>>>> apologize if this mail is not sent properly.
    >>>>>
    >>>>> I'm trying to use Fuseki text:query, and have encountered several
    >>>>> issues. Here are my questions
    >>>>>
    >>>>> 1) Does text:query require a minimum number of characters to be
    >>>>> efficient?
    >>>>>
    >>>>> 2) Is performance linked to the number of fields indexed?
    >>>>>
    >>>>> 3) In order to retrieve strings containing hyphens, should I use
    >>>>> KeywordTokenizer in config file?
    >>>>>
    >>>>> ~~~ 1) Does text:query require a minimum number of characters to be
    >>>>> efficient? ~~~~~~~~~~~~~
    >>>>>
    >>>>> I've noticed that a query on indexed predicates (foaf:familyName
    >>>>> and foaf:givenName) returns more results when there are more
    >>>>> characters in the string :
    >>>>>
    >>>>> SELECT * WHERE {
    >>>>>
    >>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
    >>>>>
    >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
    >>>>>
    >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
    >>>>>
    >>>>> optional {?uriBnF bio:birth ?dateNaissance }
    >>>>>
    >>>>> }
    >>>>>
    >>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be in the
    >>>>> results.
    >>>>>
    >>>>> => if  $MY_STRING = "j*", I get  0 result
    >>>>>
    >>>>> => if  $MY_STRING = "je*", I get 17 results, including
    >>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
    >>>>>
    >>>>> => if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"
    >>>>>
    >>>>> I don't know anything about Lucene, but it looks very strange to me
    >>>>> : I expected the contrary (fewer letters = bigger results list).
    >>>>>
    >>>>>
    >>>>> ~~~ 2) Is performance linked to the number of fields indexed?
    >>>>> ~~~~~~~~~~~~~~~~~~~~~~~
    >>>>>
    >>>>> If I change the configuration and index only foaf:givenName, and
    >>>>> provide a constant for foaf:familyName, the query returns more
    >>>>> results :
    >>>>>
    >>>>> SELECT * WHERE {
    >>>>>
    >>>>> ?uriBnF foaf:familyName "Rousseau" .
    >>>>>
    >>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
    >>>>>
    >>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
    >>>>>
    >>>>> optional {?uriBnF bio:birth ?dateNaissance }
    >>>>>
    >>>>> }
    >>>>>
    >>>>> => if  $MY_STRING = "j*", I get  7 results, whereas the first query
    >>>>> returned 0 result.
    >>>>>
    >>>>>
    >>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
    >>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
    >>>>>
    >>>>> With the same query, if $MY_STRING = "jean-ja*" :
    >>>>>
    >>>>> a) with simple configuration (cf. below), I get 0 result
    >>>>>
    >>>>> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
    >>>>>
    >>>>> Is it the right way to get "Jean-Jacques"?
    >>>>>
    >>>>>
    >>>>> Thanks in advance
    >>>>>
    >>>>> VV
    >>>>>
    >>>>>
    >>>>>
    >>>>> =============== SIMPLE CONFIGURATION ===================
    >>>>>
    >>>>> @prefix :        <#> .
    >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
    >>>>> @prefix text:    <http://jena.apache.org/text#> .
    >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
    >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
    >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
    >>>>>
    >>>>>
    >>>>>
    >>>>> [] rdf:type fuseki:Server ;
    >>>>>    .
    >>>>>
    >>>>>
    >>>>> ## Initialize TDB --------------------------------
    >>>>>
    >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
    >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    >>>>>
    >>>>> ## Initialize text query -------------------------------------
    >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
    >>>>> # A TextDataset is a regular dataset with a text index.
    >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    >>>>> # Lucene index
    >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
    >>>>>
    >>>>> ## ---------------------------------------------------------------
    >>>>> ## This URI must be fixed - it's used to assemble the text dataset.
    >>>>>
    >>>>> :text_dataset rdf:type     text:TextDataset ;
    >>>>> #    text:dataset   <#dataset> ;
    >>>>>     text:dataset :tdb_dataset_readwrite ;
    >>>>> #    text:index     <#indexLucene> ;
    >>>>>     text:index :My_Lucene_index ;
    >>>>>     .
    >>>>>
    >>>>> # A TDB datset used for RDF storage ------------------------------
    >>>>> :tdb_dataset_readwrite
    >>>>>         a             tdb:DatasetTDB ;
    >>>>>         tdb:location  "$_BnF_text" ;
    >>>>> .
    >>>>>
    >>>>> # Text index description ------------------------------------------
    >>>>> #<#indexLucene> a text:TextIndexLucene ;
    >>>>> :My_Lucene_index a text:TextIndexLucene ;
    >>>>>     text:directory <file:$_Lucene> ;
    >>>>>     text:entityMap <#entMap> ;
    >>>>>     .
    >>>>>
    >>>>> # Mapping in the index ---------------------------------------------
    >>>>> # URI stored in field "uri"
    >>>>> <#entMap> a text:EntityMap ;
    >>>>>     text:entityField      "uri" ;
    >>>>>     text:defaultField     "familyName" ;
    >>>>>     text:map (
    >>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ]
    >>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ]
    >>>>>          ) .
    >>>>>
    >>>>> :service_tdb_all  a                   fuseki:Service ;
    >>>>>         rdfs:label                    "TDB BnF_text" ;
    >>>>>         fuseki:dataset               :text_dataset ;
    >>>>>         fuseki:name                   "BnF_text" ;
    >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
    >>>>>         fuseki:serviceReadGraphStore  "get" ;
    >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
    >>>>>         fuseki:serviceUpdate          "update" ;
    >>>>>         fuseki:serviceUpload          "upload" .
    >>>>>
    >>>>>
    >>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
    >>>>>
    >>>>> @prefix :        <#> .
    >>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
    >>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
    >>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
    >>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
    >>>>> @prefix text:    <http://jena.apache.org/text#> .
    >>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
    >>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
    >>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
    >>>>>
    >>>>>
    >>>>>
    >>>>> [] rdf:type fuseki:Server ;
    >>>>>
    >>>>>    .
    >>>>>
    >>>>>
    >>>>> ## Initialize TDB --------------------------------
    >>>>>
    >>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
    >>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
    >>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
    >>>>>
    >>>>> ## Initialize text query -------------------------------------
    >>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
    >>>>> # A TextDataset is a regular dataset with a text index.
    >>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
    >>>>> # Lucene index
    >>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
    >>>>>
    >>>>> ## ---------------------------------------------------------------
    >>>>>
    >>>>>
    >>>>> :text_dataset rdf:type     text:TextDataset ;
    >>>>> #    text:dataset   <#dataset> ;
    >>>>>     text:dataset :tdb_dataset_readwrite ;
    >>>>> #    text:index     <#indexLucene> ;
    >>>>>     text:index :My_Lucene_index ;
    >>>>>     .
    >>>>>
    >>>>> # A TDB datset used for RDF storage ------------------------------
    >>>>> :tdb_dataset_readwrite
    >>>>>         a             tdb:DatasetTDB ;
    >>>>>         tdb:location  "$_BnF_text" ;
    >>>>> .
    >>>>>
    >>>>> # Text index description ------------------------------------------
    >>>>> #<#indexLucene> a text:TextIndexLucene ;
    >>>>> :My_Lucene_index a text:TextIndexLucene ;
    >>>>>     text:directory <file:$_Lucene> ;
    >>>>>     text:entityMap <#entMap> ;
    >>>>>     .
    >>>>>
    >>>>> # Mapping in the index ---------------------------------------------
    >>>>> # URI stored in field "uri"
    >>>>> <#entMap> a text:EntityMap ;
    >>>>>     text:entityField      "uri" ;
    >>>>>     text:defaultField     "givenName" ;
    >>>>>     text:map (
    >>>>>
    >>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
    >>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
    >>>>>                text:tokenizer text:KeywordTokenizer ;
    >>>>>                text:filters (text:ASCIIFoldingFilter
    >>>>> text:LowerCaseFilter)
    >>>>>              ] ]
    >>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
    >>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
    >>>>>          text:tokenizer text:KeywordTokenizer ;
    >>>>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
    >>>>>          ] ]
    >>>>>          ) .
    >>>>>
    >>>>> :service_tdb_all  a                   fuseki:Service ;
    >>>>>         rdfs:label                    "TDB BnF_text" ;
    >>>>>         fuseki:dataset               :text_dataset ; ### marche pr
    >>>>> index texte
    >>>>>         fuseki:name                   "BnF_text" ;
    >>>>>         fuseki:serviceQuery           "query" , "sparql" ;
    >>>>>         fuseki:serviceReadGraphStore  "get" ;
    >>>>>         fuseki:serviceReadWriteGraphStore "data" ;
    >>>>>         fuseki:serviceUpdate          "update" ;
    >>>>>         fuseki:serviceUpload          "upload" .
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>
    >>
    
    





Re: fuseki text:query : strange results + Lucene configuration

Posted by Vincent Ventresque <vi...@ens-lyon.fr>.
Hi Lorenz,


Thanks for your reply.

> for me it sounds more like you've found a bug

I'm not able to tell, just beginning to use Fuseki + Lucene.

> I'm just referring to "Order of triple patterns in a BGP" here

Could you please give a raw text URL for "Order of triple patterns in a
BGP" (seems that the 'here' in your mail had a formatted link but I
didn't receive the url in my mailbox).

> The order of triple patterns in a BGP shouldn't matter

I thought that it was better (for performance/speed) to begin with 1)
constants and 2) variables having few solutions in the dataset. I've
read something about Sparql optimization and algebra, but can't remember
where. But maybe you're talking about the logics itself (A+B = B+A)?
N.B. I find these questions very interesting, but I'm no Sparql
specialist (neither a logician).

Cheers,

Vincent




Le 12/09/2018 à 10:32, Lorenz B. a écrit :
> Hi "VV",
>
> well, for me it sounds more like you've found a bug and are now doing a
> workaround. Or at least something is strange and I'm just referring to
> "Order of triple patterns in a BGP" here.
>
> The order of triple patterns in a BGP shouldn't matter - as far as I
> know it's always a good old join on the intermediate result of the
> evaluation of the triple patterns.
>
> Indeed, the limit of the text index lookup matters as the internal
> ordering by Lucene is based on some Information Retrieval measure (close
> to TF-IDF probably with default settings).
>
> But I guess, Osma and Andy will give you a better and more correct answer.
>
>
> Cheers,
> Lorenz
>
>> Hello Osma,
>>
>>
>> Thank you very much for your reply, you solved the problem! I've made
>> a few tests, both the order and the limit are important (see below).
>>
>> Just one more question : I thought that the "Roussea*" being less
>> numerous than the "*J*", it would be more efficient to begin with the
>> "Roussea*". Can you explain why it's the contrary?
>>
>> Best,
>>
>> VV.
>>
>>
>> 1) --------- changing only the order --------------------------
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" ) .
>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
>> 000 000)
>>
>> 2) --------- changing order + limit = 100 000 --------------------------
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
>>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>>  => 54 entries but not "Jean-Jacques" !
>>
>> 3) --------- changing order + limit = 1 000 000
>> --------------------------
>>
>>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
>>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>>
>> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>>
>> 4) --------- test using filters (strstarts + contains)
>> --------------------------
>>
>> ?uriBnF foaf:familyName ?nom
>> filter(strstarts(?nom, "Roussea"))
>> ?uriBnF foaf:givenName ?prenom
>> filter(contains(?prenom, "J"))
>>
>> => 129 entries, 27 seconds [less results than
>> "text:query ( foaf:givenName "*J*" 1000000)" because contains = case
>> sensible ?]
>>
>> -----------------------------------------------------
>>
>> More infos about the dataset :
>>
>> # 3 fields are indexed ( foaf:name + foaf:givenName are in the same
>> named graph )
>>
>> -- dcterms:title = +/- 9.45 M.
>>
>> -- foaf:givenName = +/- 1.71 M.
>>
>> -- foaf:familyName = +/- 1.78 M.
>>
>> # config file :
>>
>> ----------------
>>
>> text:storeValues true ;
>>     text:queryParser text:AnalyzingQueryParser ;
>>     text:map (
>>         [ text:field "title" ; text:predicate dcterms:title ;
>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>          text:tokenizer text:KeywordTokenizer ;
>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>          ] ]
>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>          text:tokenizer text:KeywordTokenizer ;
>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>          ] ]
>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>          text:tokenizer text:KeywordTokenizer ;
>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>          ] ]
>>
>>          ) .
>>
>>
>>
>>
>>  
>>
>> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>>> Hello Vincent,
>>>
>>> The results you get don't seem quite right. As you say, with a
>>> shorter query one would expect more results.
>>>
>>> One thing to do would be to check what results you get if you run the
>>> queries individually. I think combining the two separate jena-text
>>> queries (for foaf:familyName and foaf:givenName) may be part of the
>>> problem here... So if you execute only the "roussea*" part of the
>>> query, do you get the expected number of results? What about if you
>>> only execute one of the givenName queries with no restriction on
>>> familyName?
>>>
>>> Does it make a difference if you change the order of the firstName
>>> and givenName clauses?
>>>
>>> One thing to consider is that Lucene queries always have a limit on
>>> the number of results. With jena-text you can specify it as an
>>> additional parameter, but if you leave it out, it will default to
>>> 10000. My guess is that the givenName queries may generate more
>>> results than 10000, and the results will then be cut off. This may
>>> mean that you get many Jeans and Jacques's and Johns etc. but many
>>> the J. Rousseaus get cut off from the list. Try adding a large limit
>>> parameter (say 100000 or more) to the text:query functions to see if
>>> it helps. Like this:
>>>
>>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>>>
>>> jena-text is not very good at combining multiple criteria. You can do
>>> it with separate queries as you've done, but internally the queries
>>> will run separately and the results will only be combined in Jena,
>>> outside Lucene.
>>>
>>> -Osma
>>>
>>>
>>>
>>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>>>> Hello,
>>>>
>>>>
>>>> I've made new tests with a slightly different dataset and
>>>> configuration, the problem is the same.
>>>>
>>>> --- Could you please tell me if these results are normal (I expected
>>>> a bigger list with fewer letters)?
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>>>>
>>>> Here is the complete query :
>>>>
>>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>
>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }
>>>>
>>>> N.B. : the dataset is quite large : 1,78 M family names indexed, and
>>>> 1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the
>>>> data, 713 family names containing "roussea", including 224 compound
>>>> given names.
>>>>
>>>> --- Do you know where to find more documentation about Lucene
>>>> configuration (I read jena.apache.org page + , and also found useful
>>>> explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ),
>>>> especially about tokenizers  ?
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> VV
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>>>>> Hello,
>>>>>
>>>>> I've just subscribed to the users@jena.apache.org list, and I
>>>>> apologize if this mail is not sent properly.
>>>>>
>>>>> I'm trying to use Fuseki text:query, and have encountered several
>>>>> issues. Here are my questions
>>>>>
>>>>> 1) Does text:query require a minimum number of characters to be
>>>>> efficient?
>>>>>
>>>>> 2) Is performance linked to the number of fields indexed?
>>>>>
>>>>> 3) In order to retrieve strings containing hyphens, should I use
>>>>> KeywordTokenizer in config file?
>>>>>
>>>>> ~~~ 1) Does text:query require a minimum number of characters to be
>>>>> efficient? ~~~~~~~~~~~~~
>>>>>
>>>>> I've noticed that a query on indexed predicates (foaf:familyName
>>>>> and foaf:givenName) returns more results when there are more
>>>>> characters in the string :
>>>>>
>>>>> SELECT * WHERE {
>>>>>
>>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>>>
>>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>>
>>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>>>>
>>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>>>
>>>>> }
>>>>>
>>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be in the
>>>>> results.
>>>>>
>>>>> => if  $MY_STRING = "j*", I get  0 result
>>>>>
>>>>> => if  $MY_STRING = "je*", I get 17 results, including
>>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
>>>>>
>>>>> => if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"
>>>>>
>>>>> I don't know anything about Lucene, but it looks very strange to me
>>>>> : I expected the contrary (fewer letters = bigger results list).
>>>>>
>>>>>
>>>>> ~~~ 2) Is performance linked to the number of fields indexed?
>>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>>>>
>>>>> If I change the configuration and index only foaf:givenName, and
>>>>> provide a constant for foaf:familyName, the query returns more
>>>>> results :
>>>>>
>>>>> SELECT * WHERE {
>>>>>
>>>>> ?uriBnF foaf:familyName "Rousseau" .
>>>>>
>>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>>
>>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>>>>
>>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>>>
>>>>> }
>>>>>
>>>>> => if  $MY_STRING = "j*", I get  7 results, whereas the first query
>>>>> returned 0 result.
>>>>>
>>>>>
>>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
>>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>>>>
>>>>> With the same query, if $MY_STRING = "jean-ja*" :
>>>>>
>>>>> a) with simple configuration (cf. below), I get 0 result
>>>>>
>>>>> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
>>>>>
>>>>> Is it the right way to get "Jean-Jacques"?
>>>>>
>>>>>
>>>>> Thanks in advance
>>>>>
>>>>> VV
>>>>>
>>>>>
>>>>>
>>>>> =============== SIMPLE CONFIGURATION ===================
>>>>>
>>>>> @prefix :        <#> .
>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>>> @prefix text:    <http://jena.apache.org/text#> .
>>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>>>
>>>>>
>>>>>
>>>>> [] rdf:type fuseki:Server ;
>>>>>    .
>>>>>
>>>>>
>>>>> ## Initialize TDB --------------------------------
>>>>>
>>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>>>>
>>>>> ## Initialize text query -------------------------------------
>>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>>>> # A TextDataset is a regular dataset with a text index.
>>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>>>> # Lucene index
>>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>>>>
>>>>> ## ---------------------------------------------------------------
>>>>> ## This URI must be fixed - it's used to assemble the text dataset.
>>>>>
>>>>> :text_dataset rdf:type     text:TextDataset ;
>>>>> #    text:dataset   <#dataset> ;
>>>>>     text:dataset :tdb_dataset_readwrite ;
>>>>> #    text:index     <#indexLucene> ;
>>>>>     text:index :My_Lucene_index ;
>>>>>     .
>>>>>
>>>>> # A TDB datset used for RDF storage ------------------------------
>>>>> :tdb_dataset_readwrite
>>>>>         a             tdb:DatasetTDB ;
>>>>>         tdb:location  "$_BnF_text" ;
>>>>> .
>>>>>
>>>>> # Text index description ------------------------------------------
>>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>>>     text:directory <file:$_Lucene> ;
>>>>>     text:entityMap <#entMap> ;
>>>>>     .
>>>>>
>>>>> # Mapping in the index ---------------------------------------------
>>>>> # URI stored in field "uri"
>>>>> <#entMap> a text:EntityMap ;
>>>>>     text:entityField      "uri" ;
>>>>>     text:defaultField     "familyName" ;
>>>>>     text:map (
>>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ]
>>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ]
>>>>>          ) .
>>>>>
>>>>> :service_tdb_all  a                   fuseki:Service ;
>>>>>         rdfs:label                    "TDB BnF_text" ;
>>>>>         fuseki:dataset               :text_dataset ;
>>>>>         fuseki:name                   "BnF_text" ;
>>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>>>>         fuseki:serviceReadGraphStore  "get" ;
>>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>>>>         fuseki:serviceUpdate          "update" ;
>>>>>         fuseki:serviceUpload          "upload" .
>>>>>
>>>>>
>>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>>>>
>>>>> @prefix :        <#> .
>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>>> @prefix text:    <http://jena.apache.org/text#> .
>>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>>>
>>>>>
>>>>>
>>>>> [] rdf:type fuseki:Server ;
>>>>>
>>>>>    .
>>>>>
>>>>>
>>>>> ## Initialize TDB --------------------------------
>>>>>
>>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>>>>
>>>>> ## Initialize text query -------------------------------------
>>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>>>> # A TextDataset is a regular dataset with a text index.
>>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>>>> # Lucene index
>>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>>>>
>>>>> ## ---------------------------------------------------------------
>>>>>
>>>>>
>>>>> :text_dataset rdf:type     text:TextDataset ;
>>>>> #    text:dataset   <#dataset> ;
>>>>>     text:dataset :tdb_dataset_readwrite ;
>>>>> #    text:index     <#indexLucene> ;
>>>>>     text:index :My_Lucene_index ;
>>>>>     .
>>>>>
>>>>> # A TDB datset used for RDF storage ------------------------------
>>>>> :tdb_dataset_readwrite
>>>>>         a             tdb:DatasetTDB ;
>>>>>         tdb:location  "$_BnF_text" ;
>>>>> .
>>>>>
>>>>> # Text index description ------------------------------------------
>>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>>>     text:directory <file:$_Lucene> ;
>>>>>     text:entityMap <#entMap> ;
>>>>>     .
>>>>>
>>>>> # Mapping in the index ---------------------------------------------
>>>>> # URI stored in field "uri"
>>>>> <#entMap> a text:EntityMap ;
>>>>>     text:entityField      "uri" ;
>>>>>     text:defaultField     "givenName" ;
>>>>>     text:map (
>>>>>
>>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>>>>>                text:tokenizer text:KeywordTokenizer ;
>>>>>                text:filters (text:ASCIIFoldingFilter
>>>>> text:LowerCaseFilter)
>>>>>              ] ]
>>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>>>          text:tokenizer text:KeywordTokenizer ;
>>>>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>>>>          ] ]
>>>>>          ) .
>>>>>
>>>>> :service_tdb_all  a                   fuseki:Service ;
>>>>>         rdfs:label                    "TDB BnF_text" ;
>>>>>         fuseki:dataset               :text_dataset ; ### marche pr
>>>>> index texte
>>>>>         fuseki:name                   "BnF_text" ;
>>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>>>>         fuseki:serviceReadGraphStore  "get" ;
>>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>>>>         fuseki:serviceUpdate          "update" ;
>>>>>         fuseki:serviceUpload          "upload" .
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>


Re: fuseki text:query : strange results + Lucene configuration

Posted by "Lorenz B." <bu...@informatik.uni-leipzig.de>.
Hi "VV",

well, for me it sounds more like you've found a bug and are now doing a
workaround. Or at least something is strange and I'm just referring to
"Order of triple patterns in a BGP" here.

The order of triple patterns in a BGP shouldn't matter - as far as I
know it's always a good old join on the intermediate result of the
evaluation of the triple patterns.

Indeed, the limit of the text index lookup matters as the internal
ordering by Lucene is based on some Information Retrieval measure (close
to TF-IDF probably with default settings).

But I guess, Osma and Andy will give you a better and more correct answer.


Cheers,
Lorenz

> Hello Osma,
>
>
> Thank you very much for your reply, you solved the problem! I've made
> a few tests, both the order and the limit are important (see below).
>
> Just one more question : I thought that the "Roussea*" being less
> numerous than the "*J*", it would be more efficient to begin with the
> "Roussea*". Can you explain why it's the contrary?
>
> Best,
>
> VV.
>
>
> 1) --------- changing only the order --------------------------
>
> ?uriBnF text:query ( foaf:givenName "*J*" ) .
> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>
>  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2
> 000 000)
>
> 2) --------- changing order + limit = 100 000 --------------------------
>
> ?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>
>  => 54 entries but not "Jean-Jacques" !
>
> 3) --------- changing order + limit = 1 000 000
> --------------------------
>
>  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
>  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom
>
> => 135 entries, including the 4 "Jean-Jacques", in  1.7 second
>
> 4) --------- test using filters (strstarts + contains)
> --------------------------
>
> ?uriBnF foaf:familyName ?nom
> filter(strstarts(?nom, "Roussea"))
> ?uriBnF foaf:givenName ?prenom
> filter(contains(?prenom, "J"))
>
> => 129 entries, 27 seconds [less results than
> "text:query ( foaf:givenName "*J*" 1000000)" because contains = case
> sensible ?]
>
> -----------------------------------------------------
>
> More infos about the dataset :
>
> # 3 fields are indexed ( foaf:name + foaf:givenName are in the same
> named graph )
>
> -- dcterms:title = +/- 9.45 M.
>
> -- foaf:givenName = +/- 1.71 M.
>
> -- foaf:familyName = +/- 1.78 M.
>
> # config file :
>
> ----------------
>
> text:storeValues true ;
>     text:queryParser text:AnalyzingQueryParser ;
>     text:map (
>         [ text:field "title" ; text:predicate dcterms:title ;
>         text:analyzer [ a text:ConfigurableAnalyzer ;
>          text:tokenizer text:KeywordTokenizer ;
>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>          ] ]
>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>         text:analyzer [ a text:ConfigurableAnalyzer ;
>          text:tokenizer text:KeywordTokenizer ;
>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>          ] ]
>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>         text:analyzer [ a text:ConfigurableAnalyzer ;
>          text:tokenizer text:KeywordTokenizer ;
>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>          ] ]
>
>          ) .
>
>
>
>
>  
>
> Le 10/09/2018 à 18:58, Osma Suominen a écrit :
>> Hello Vincent,
>>
>> The results you get don't seem quite right. As you say, with a
>> shorter query one would expect more results.
>>
>> One thing to do would be to check what results you get if you run the
>> queries individually. I think combining the two separate jena-text
>> queries (for foaf:familyName and foaf:givenName) may be part of the
>> problem here... So if you execute only the "roussea*" part of the
>> query, do you get the expected number of results? What about if you
>> only execute one of the givenName queries with no restriction on
>> familyName?
>>
>> Does it make a difference if you change the order of the firstName
>> and givenName clauses?
>>
>> One thing to consider is that Lucene queries always have a limit on
>> the number of results. With jena-text you can specify it as an
>> additional parameter, but if you leave it out, it will default to
>> 10000. My guess is that the givenName queries may generate more
>> results than 10000, and the results will then be cut off. This may
>> mean that you get many Jeans and Jacques's and Johns etc. but many
>> the J. Rousseaus get cut off from the list. Try adding a large limit
>> parameter (say 100000 or more) to the text:query functions to see if
>> it helps. Like this:
>>
>>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>>
>> jena-text is not very good at combining multiple criteria. You can do
>> it with separate queries as you've done, but internally the queries
>> will run separately and the results will only be combined in Jena,
>> outside Lucene.
>>
>> -Osma
>>
>>
>>
>> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>>> Hello,
>>>
>>>
>>> I've made new tests with a slightly different dataset and
>>> configuration, the problem is the same.
>>>
>>> --- Could you please tell me if these results are normal (I expected
>>> a bigger list with fewer letters)?
>>>
>>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>>>
>>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>>>
>>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>>>
>>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>>>
>>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>>>
>>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>>>
>>> Here is the complete query :
>>>
>>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>
>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }
>>>
>>> N.B. : the dataset is quite large : 1,78 M family names indexed, and
>>> 1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the
>>> data, 713 family names containing "roussea", including 224 compound
>>> given names.
>>>
>>> --- Do you know where to find more documentation about Lucene
>>> configuration (I read jena.apache.org page + , and also found useful
>>> explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ),
>>> especially about tokenizers  ?
>>>
>>>
>>> Thanks in advance,
>>>
>>> VV
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>>>> Hello,
>>>>
>>>> I've just subscribed to the users@jena.apache.org list, and I
>>>> apologize if this mail is not sent properly.
>>>>
>>>> I'm trying to use Fuseki text:query, and have encountered several
>>>> issues. Here are my questions
>>>>
>>>> 1) Does text:query require a minimum number of characters to be
>>>> efficient?
>>>>
>>>> 2) Is performance linked to the number of fields indexed?
>>>>
>>>> 3) In order to retrieve strings containing hyphens, should I use
>>>> KeywordTokenizer in config file?
>>>>
>>>> ~~~ 1) Does text:query require a minimum number of characters to be
>>>> efficient? ~~~~~~~~~~~~~
>>>>
>>>> I've noticed that a query on indexed predicates (foaf:familyName
>>>> and foaf:givenName) returns more results when there are more
>>>> characters in the string :
>>>>
>>>> SELECT * WHERE {
>>>>
>>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>
>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>>>
>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>>
>>>> }
>>>>
>>>> I was expecting that "Rousseau" + "Jean-Jacques" would be in the
>>>> results.
>>>>
>>>> => if  $MY_STRING = "j*", I get  0 result
>>>>
>>>> => if  $MY_STRING = "je*", I get 17 results, including
>>>> "Jean-Claude" & "Jean-Baptiste" BUT not "Jean-Jacques"
>>>>
>>>> => if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"
>>>>
>>>> I don't know anything about Lucene, but it looks very strange to me
>>>> : I expected the contrary (fewer letters = bigger results list).
>>>>
>>>>
>>>> ~~~ 2) Is performance linked to the number of fields indexed?
>>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>>>
>>>> If I change the configuration and index only foaf:givenName, and
>>>> provide a constant for foaf:familyName, the query returns more
>>>> results :
>>>>
>>>> SELECT * WHERE {
>>>>
>>>> ?uriBnF foaf:familyName "Rousseau" .
>>>>
>>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>>
>>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>>>
>>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>>
>>>> }
>>>>
>>>> => if  $MY_STRING = "j*", I get  7 results, whereas the first query
>>>> returned 0 result.
>>>>
>>>>
>>>> ~~~ 3) In order to retrieve containing hyphens, should I use
>>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>>>
>>>> With the same query, if $MY_STRING = "jean-ja*" :
>>>>
>>>> a) with simple configuration (cf. below), I get 0 result
>>>>
>>>> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
>>>>
>>>> Is it the right way to get "Jean-Jacques"?
>>>>
>>>>
>>>> Thanks in advance
>>>>
>>>> VV
>>>>
>>>>
>>>>
>>>> =============== SIMPLE CONFIGURATION ===================
>>>>
>>>> @prefix :        <#> .
>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>> @prefix text:    <http://jena.apache.org/text#> .
>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>>
>>>>
>>>>
>>>> [] rdf:type fuseki:Server ;
>>>>    .
>>>>
>>>>
>>>> ## Initialize TDB --------------------------------
>>>>
>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>>>
>>>> ## Initialize text query -------------------------------------
>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>>> # A TextDataset is a regular dataset with a text index.
>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>>> # Lucene index
>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>>>
>>>> ## ---------------------------------------------------------------
>>>> ## This URI must be fixed - it's used to assemble the text dataset.
>>>>
>>>> :text_dataset rdf:type     text:TextDataset ;
>>>> #    text:dataset   <#dataset> ;
>>>>     text:dataset :tdb_dataset_readwrite ;
>>>> #    text:index     <#indexLucene> ;
>>>>     text:index :My_Lucene_index ;
>>>>     .
>>>>
>>>> # A TDB datset used for RDF storage ------------------------------
>>>> :tdb_dataset_readwrite
>>>>         a             tdb:DatasetTDB ;
>>>>         tdb:location  "$_BnF_text" ;
>>>> .
>>>>
>>>> # Text index description ------------------------------------------
>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>>     text:directory <file:$_Lucene> ;
>>>>     text:entityMap <#entMap> ;
>>>>     .
>>>>
>>>> # Mapping in the index ---------------------------------------------
>>>> # URI stored in field "uri"
>>>> <#entMap> a text:EntityMap ;
>>>>     text:entityField      "uri" ;
>>>>     text:defaultField     "familyName" ;
>>>>     text:map (
>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ]
>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ]
>>>>          ) .
>>>>
>>>> :service_tdb_all  a                   fuseki:Service ;
>>>>         rdfs:label                    "TDB BnF_text" ;
>>>>         fuseki:dataset               :text_dataset ;
>>>>         fuseki:name                   "BnF_text" ;
>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>>>         fuseki:serviceReadGraphStore  "get" ;
>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>>>         fuseki:serviceUpdate          "update" ;
>>>>         fuseki:serviceUpload          "upload" .
>>>>
>>>>
>>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>>>
>>>> @prefix :        <#> .
>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>> @prefix text:    <http://jena.apache.org/text#> .
>>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>>
>>>>
>>>>
>>>> [] rdf:type fuseki:Server ;
>>>>
>>>>    .
>>>>
>>>>
>>>> ## Initialize TDB --------------------------------
>>>>
>>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>>>
>>>> ## Initialize text query -------------------------------------
>>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>>> # A TextDataset is a regular dataset with a text index.
>>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>>> # Lucene index
>>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>>>
>>>> ## ---------------------------------------------------------------
>>>>
>>>>
>>>> :text_dataset rdf:type     text:TextDataset ;
>>>> #    text:dataset   <#dataset> ;
>>>>     text:dataset :tdb_dataset_readwrite ;
>>>> #    text:index     <#indexLucene> ;
>>>>     text:index :My_Lucene_index ;
>>>>     .
>>>>
>>>> # A TDB datset used for RDF storage ------------------------------
>>>> :tdb_dataset_readwrite
>>>>         a             tdb:DatasetTDB ;
>>>>         tdb:location  "$_BnF_text" ;
>>>> .
>>>>
>>>> # Text index description ------------------------------------------
>>>> #<#indexLucene> a text:TextIndexLucene ;
>>>> :My_Lucene_index a text:TextIndexLucene ;
>>>>     text:directory <file:$_Lucene> ;
>>>>     text:entityMap <#entMap> ;
>>>>     .
>>>>
>>>> # Mapping in the index ---------------------------------------------
>>>> # URI stored in field "uri"
>>>> <#entMap> a text:EntityMap ;
>>>>     text:entityField      "uri" ;
>>>>     text:defaultField     "givenName" ;
>>>>     text:map (
>>>>
>>>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>>>>                text:tokenizer text:KeywordTokenizer ;
>>>>                text:filters (text:ASCIIFoldingFilter
>>>> text:LowerCaseFilter)
>>>>              ] ]
>>>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>>          text:tokenizer text:KeywordTokenizer ;
>>>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>>>          ] ]
>>>>          ) .
>>>>
>>>> :service_tdb_all  a                   fuseki:Service ;
>>>>         rdfs:label                    "TDB BnF_text" ;
>>>>         fuseki:dataset               :text_dataset ; ### marche pr
>>>> index texte
>>>>         fuseki:name                   "BnF_text" ;
>>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>>>         fuseki:serviceReadGraphStore  "get" ;
>>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>>>         fuseki:serviceUpdate          "update" ;
>>>>         fuseki:serviceUpload          "upload" .
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>
>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center


Re: fuseki text:query : strange results + Lucene configuration

Posted by Vincent Ventresque <vi...@ens-lyon.fr>.
Hello Osma,


Thank you very much for your reply, you solved the problem! I've made a 
few tests, both the order and the limit are important (see below).

Just one more question : I thought that the "Roussea*" being less 
numerous than the "*J*", it would be more efficient to begin with the 
"Roussea*". Can you explain why it's the contrary?

Best,

VV.


1) --------- changing only the order --------------------------

?uriBnF text:query ( foaf:givenName "*J*" ) .
?uriBnF text:query ( foaf:familyName "roussea*" ) .
?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom

  => 3 "Jean-Marie Rousseau" ... (even if I add a limit = 100 000 or 2 
000 000)

2) --------- changing order + limit = 100 000 --------------------------

?uriBnF text:query ( foaf:givenName "*J*" 100000 ) .
  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom

  => 54 entries but not "Jean-Jacques" !

3) --------- changing order + limit = 1 000 000 --------------------------

  ?uriBnF text:query ( foaf:givenName "*J*" 1000000 ) .
  ?uriBnF text:query ( foaf:familyName "roussea*" ) .
  ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom

=> 135 entries, including the 4 "Jean-Jacques", in  1.7 second

4) --------- test using filters (strstarts + contains) 
--------------------------

?uriBnF foaf:familyName ?nom
filter(strstarts(?nom, "Roussea"))
?uriBnF foaf:givenName ?prenom
filter(contains(?prenom, "J"))

=> 129 entries, 27 seconds [less results than
"text:query ( foaf:givenName "*J*" 1000000)" because contains = case 
sensible ?]

-----------------------------------------------------

More infos about the dataset :

# 3 fields are indexed ( foaf:name + foaf:givenName are in the same 
named graph )

-- dcterms:title = +/- 9.45 M.

-- foaf:givenName = +/- 1.71 M.

-- foaf:familyName = +/- 1.78 M.

# config file :

----------------

text:storeValues true ;
     text:queryParser text:AnalyzingQueryParser ;
     text:map (
         [ text:field "title" ; text:predicate dcterms:title ;
         text:analyzer [ a text:ConfigurableAnalyzer ;
          text:tokenizer text:KeywordTokenizer ;
          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
          ] ]
          [ text:field "familyName" ; text:predicate foaf:familyName ;
         text:analyzer [ a text:ConfigurableAnalyzer ;
          text:tokenizer text:KeywordTokenizer ;
          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
          ] ]
          [ text:field "givenName" ; text:predicate foaf:givenName ;
         text:analyzer [ a text:ConfigurableAnalyzer ;
          text:tokenizer text:KeywordTokenizer ;
          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
          ] ]

          ) .




  

Le 10/09/2018 à 18:58, Osma Suominen a écrit :
> Hello Vincent,
>
> The results you get don't seem quite right. As you say, with a shorter 
> query one would expect more results.
>
> One thing to do would be to check what results you get if you run the 
> queries individually. I think combining the two separate jena-text 
> queries (for foaf:familyName and foaf:givenName) may be part of the 
> problem here... So if you execute only the "roussea*" part of the 
> query, do you get the expected number of results? What about if you 
> only execute one of the givenName queries with no restriction on 
> familyName?
>
> Does it make a difference if you change the order of the firstName and 
> givenName clauses?
>
> One thing to consider is that Lucene queries always have a limit on 
> the number of results. With jena-text you can specify it as an 
> additional parameter, but if you leave it out, it will default to 
> 10000. My guess is that the givenName queries may generate more 
> results than 10000, and the results will then be cut off. This may 
> mean that you get many Jeans and Jacques's and Johns etc. but many the 
> J. Rousseaus get cut off from the list. Try adding a large limit 
> parameter (say 100000 or more) to the text:query functions to see if 
> it helps. Like this:
>
>     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )
>
> jena-text is not very good at combining multiple criteria. You can do 
> it with separate queries as you've done, but internally the queries 
> will run separately and the results will only be combined in Jena, 
> outside Lucene.
>
> -Osma
>
>
>
> Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
>> Hello,
>>
>>
>> I've made new tests with a slightly different dataset and 
>> configuration, the problem is the same.
>>
>> --- Could you please tell me if these results are normal (I expected 
>> a bigger list with fewer letters)?
>>
>> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
>>
>> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
>>
>> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
>>
>> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
>>
>> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
>>
>> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
>>
>> Here is the complete query :
>>
>> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) . 
>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>
>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }
>>
>> N.B. : the dataset is quite large : 1,78 M family names indexed, and 
>> 1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the 
>> data, 713 family names containing "roussea", including 224 compound 
>> given names.
>>
>> --- Do you know where to find more documentation about Lucene 
>> configuration (I read jena.apache.org page + , and also found useful 
>> explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ), 
>> especially about tokenizers  ?
>>
>>
>> Thanks in advance,
>>
>> VV
>>
>>
>>
>>
>>
>>
>>
>> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>>> Hello,
>>>
>>> I've just subscribed to the users@jena.apache.org list, and I 
>>> apologize if this mail is not sent properly.
>>>
>>> I'm trying to use Fuseki text:query, and have encountered several 
>>> issues. Here are my questions
>>>
>>> 1) Does text:query require a minimum number of characters to be 
>>> efficient?
>>>
>>> 2) Is performance linked to the number of fields indexed?
>>>
>>> 3) In order to retrieve strings containing hyphens, should I use 
>>> KeywordTokenizer in config file?
>>>
>>> ~~~ 1) Does text:query require a minimum number of characters to be 
>>> efficient? ~~~~~~~~~~~~~
>>>
>>> I've noticed that a query on indexed predicates (foaf:familyName and 
>>> foaf:givenName) returns more results when there are more characters 
>>> in the string :
>>>
>>> SELECT * WHERE {
>>>
>>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>>
>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>
>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>>
>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>
>>> }
>>>
>>> I was expecting that "Rousseau" + "Jean-Jacques" would be in the 
>>> results.
>>>
>>> => if  $MY_STRING = "j*", I get  0 result
>>>
>>> => if  $MY_STRING = "je*", I get 17 results, including "Jean-Claude" 
>>> & "Jean-Baptiste" BUT not "Jean-Jacques"
>>>
>>> => if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"
>>>
>>> I don't know anything about Lucene, but it looks very strange to me 
>>> : I expected the contrary (fewer letters = bigger results list).
>>>
>>>
>>> ~~~ 2) Is performance linked to the number of fields indexed? 
>>> ~~~~~~~~~~~~~~~~~~~~~~~
>>>
>>> If I change the configuration and index only foaf:givenName, and 
>>> provide a constant for foaf:familyName, the query returns more 
>>> results :
>>>
>>> SELECT * WHERE {
>>>
>>> ?uriBnF foaf:familyName "Rousseau" .
>>>
>>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>>
>>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>>
>>> optional {?uriBnF bio:birth ?dateNaissance }
>>>
>>> }
>>>
>>> => if  $MY_STRING = "j*", I get  7 results, whereas the first query 
>>> returned 0 result.
>>>
>>>
>>> ~~~ 3) In order to retrieve containing hyphens, should I use 
>>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>>
>>> With the same query, if $MY_STRING = "jean-ja*" :
>>>
>>> a) with simple configuration (cf. below), I get 0 result
>>>
>>> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
>>>
>>> Is it the right way to get "Jean-Jacques"?
>>>
>>>
>>> Thanks in advance
>>>
>>> VV
>>>
>>>
>>>
>>> =============== SIMPLE CONFIGURATION ===================
>>>
>>> @prefix :        <#> .
>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>> @prefix text:    <http://jena.apache.org/text#> .
>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>
>>>
>>>
>>> [] rdf:type fuseki:Server ;
>>>    .
>>>
>>>
>>> ## Initialize TDB --------------------------------
>>>
>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>>
>>> ## Initialize text query -------------------------------------
>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>> # A TextDataset is a regular dataset with a text index.
>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>> # Lucene index
>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>>
>>> ## ---------------------------------------------------------------
>>> ## This URI must be fixed - it's used to assemble the text dataset.
>>>
>>> :text_dataset rdf:type     text:TextDataset ;
>>> #    text:dataset   <#dataset> ;
>>>     text:dataset :tdb_dataset_readwrite ;
>>> #    text:index     <#indexLucene> ;
>>>     text:index :My_Lucene_index ;
>>>     .
>>>
>>> # A TDB datset used for RDF storage ------------------------------
>>> :tdb_dataset_readwrite
>>>         a             tdb:DatasetTDB ;
>>>         tdb:location  "$_BnF_text" ;
>>> .
>>>
>>> # Text index description ------------------------------------------
>>> #<#indexLucene> a text:TextIndexLucene ;
>>> :My_Lucene_index a text:TextIndexLucene ;
>>>     text:directory <file:$_Lucene> ;
>>>     text:entityMap <#entMap> ;
>>>     .
>>>
>>> # Mapping in the index ---------------------------------------------
>>> # URI stored in field "uri"
>>> <#entMap> a text:EntityMap ;
>>>     text:entityField      "uri" ;
>>>     text:defaultField     "familyName" ;
>>>     text:map (
>>>          [ text:field "familyName" ; text:predicate foaf:familyName ]
>>>          [ text:field "givenName" ; text:predicate foaf:givenName ]
>>>          ) .
>>>
>>> :service_tdb_all  a                   fuseki:Service ;
>>>         rdfs:label                    "TDB BnF_text" ;
>>>         fuseki:dataset               :text_dataset ;
>>>         fuseki:name                   "BnF_text" ;
>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>>         fuseki:serviceReadGraphStore  "get" ;
>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>>         fuseki:serviceUpdate          "update" ;
>>>         fuseki:serviceUpload          "upload" .
>>>
>>>
>>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>>
>>> @prefix :        <#> .
>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>> @prefix text:    <http://jena.apache.org/text#> .
>>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>>
>>>
>>>
>>> [] rdf:type fuseki:Server ;
>>>
>>>    .
>>>
>>>
>>> ## Initialize TDB --------------------------------
>>>
>>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>>
>>> ## Initialize text query -------------------------------------
>>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>>> # A TextDataset is a regular dataset with a text index.
>>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>>> # Lucene index
>>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>>
>>> ## ---------------------------------------------------------------
>>>
>>>
>>> :text_dataset rdf:type     text:TextDataset ;
>>> #    text:dataset   <#dataset> ;
>>>     text:dataset :tdb_dataset_readwrite ;
>>> #    text:index     <#indexLucene> ;
>>>     text:index :My_Lucene_index ;
>>>     .
>>>
>>> # A TDB datset used for RDF storage ------------------------------
>>> :tdb_dataset_readwrite
>>>         a             tdb:DatasetTDB ;
>>>         tdb:location  "$_BnF_text" ;
>>> .
>>>
>>> # Text index description ------------------------------------------
>>> #<#indexLucene> a text:TextIndexLucene ;
>>> :My_Lucene_index a text:TextIndexLucene ;
>>>     text:directory <file:$_Lucene> ;
>>>     text:entityMap <#entMap> ;
>>>     .
>>>
>>> # Mapping in the index ---------------------------------------------
>>> # URI stored in field "uri"
>>> <#entMap> a text:EntityMap ;
>>>     text:entityField      "uri" ;
>>>     text:defaultField     "givenName" ;
>>>     text:map (
>>>
>>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>>>                text:tokenizer text:KeywordTokenizer ;
>>>                text:filters (text:ASCIIFoldingFilter 
>>> text:LowerCaseFilter)
>>>              ] ]
>>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>>          text:tokenizer text:KeywordTokenizer ;
>>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>>          ] ]
>>>          ) .
>>>
>>> :service_tdb_all  a                   fuseki:Service ;
>>>         rdfs:label                    "TDB BnF_text" ;
>>>         fuseki:dataset               :text_dataset ; ### marche pr 
>>> index texte
>>>         fuseki:name                   "BnF_text" ;
>>>         fuseki:serviceQuery           "query" , "sparql" ;
>>>         fuseki:serviceReadGraphStore  "get" ;
>>>         fuseki:serviceReadWriteGraphStore "data" ;
>>>         fuseki:serviceUpdate          "update" ;
>>>         fuseki:serviceUpload          "upload" .
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>


Re: fuseki text:query : strange results + Lucene configuration

Posted by Osma Suominen <os...@helsinki.fi>.
Hello Vincent,

The results you get don't seem quite right. As you say, with a shorter 
query one would expect more results.

One thing to do would be to check what results you get if you run the 
queries individually. I think combining the two separate jena-text 
queries (for foaf:familyName and foaf:givenName) may be part of the 
problem here... So if you execute only the "roussea*" part of the query, 
do you get the expected number of results? What about if you only 
execute one of the givenName queries with no restriction on familyName?

Does it make a difference if you change the order of the firstName and 
givenName clauses?

One thing to consider is that Lucene queries always have a limit on the 
number of results. With jena-text you can specify it as an additional 
parameter, but if you leave it out, it will default to 10000. My guess 
is that the givenName queries may generate more results than 10000, and 
the results will then be cut off. This may mean that you get many Jeans 
and Jacques's and Johns etc. but many the J. Rousseaus get cut off from 
the list. Try adding a large limit parameter (say 100000 or more) to the 
text:query functions to see if it helps. Like this:

     ?uriBnF text:query ( foaf:givenName "*J*" 100000 )

jena-text is not very good at combining multiple criteria. You can do it 
with separate queries as you've done, but internally the queries will 
run separately and the results will only be combined in Jena, outside 
Lucene.

-Osma



Vincent Ventresque kirjoitti 10.09.2018 klo 13:03:
> Hello,
> 
> 
> I've made new tests with a slightly different dataset and configuration, 
> the problem is the same.
> 
> --- Could you please tell me if these results are normal (I expected a 
> bigger list with fewer letters)?
> 
> ?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries
> 
> ?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries
> 
> ?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries
> 
> ?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries
> 
> ?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries
> 
> ?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries
> 
> Here is the complete query :
> 
> SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) . 
> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
> 
> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }
> 
> N.B. : the dataset is quite large : 1,78 M family names indexed, and 
> 1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the 
> data, 713 family names containing "roussea", including 224 compound 
> given names.
> 
> --- Do you know where to find more documentation about Lucene 
> configuration (I read jena.apache.org page + , and also found useful 
> explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ), 
> especially about tokenizers  ?
> 
> 
> Thanks in advance,
> 
> VV
> 
> 
> 
> 
> 
> 
> 
> Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
>> Hello,
>>
>> I've just subscribed to the users@jena.apache.org list, and I 
>> apologize if this mail is not sent properly.
>>
>> I'm trying to use Fuseki text:query, and have encountered several 
>> issues. Here are my questions
>>
>> 1) Does text:query require a minimum number of characters to be 
>> efficient?
>>
>> 2) Is performance linked to the number of fields indexed?
>>
>> 3) In order to retrieve strings containing hyphens, should I use 
>> KeywordTokenizer in config file?
>>
>> ~~~ 1) Does text:query require a minimum number of characters to be 
>> efficient? ~~~~~~~~~~~~~
>>
>> I've noticed that a query on indexed predicates (foaf:familyName and 
>> foaf:givenName) returns more results when there are more characters in 
>> the string :
>>
>> SELECT * WHERE {
>>
>> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>>
>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>
>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>
>> optional {?uriBnF bio:birth ?dateNaissance }
>>
>> }
>>
>> I was expecting that "Rousseau" + "Jean-Jacques" would be in the results.
>>
>> => if  $MY_STRING = "j*", I get  0 result
>>
>> => if  $MY_STRING = "je*", I get 17 results, including "Jean-Claude" & 
>> "Jean-Baptiste" BUT not "Jean-Jacques"
>>
>> => if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"
>>
>> I don't know anything about Lucene, but it looks very strange to me : 
>> I expected the contrary (fewer letters = bigger results list).
>>
>>
>> ~~~ 2) Is performance linked to the number of fields indexed? 
>> ~~~~~~~~~~~~~~~~~~~~~~~
>>
>> If I change the configuration and index only foaf:givenName, and 
>> provide a constant for foaf:familyName, the query returns more results :
>>
>> SELECT * WHERE {
>>
>> ?uriBnF foaf:familyName "Rousseau" .
>>
>> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>>
>> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>>
>> optional {?uriBnF bio:birth ?dateNaissance }
>>
>> }
>>
>> => if  $MY_STRING = "j*", I get  7 results, whereas the first query 
>> returned 0 result.
>>
>>
>> ~~~ 3) In order to retrieve containing hyphens, should I use 
>> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>>
>> With the same query, if $MY_STRING = "jean-ja*" :
>>
>> a) with simple configuration (cf. below), I get 0 result
>>
>> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
>>
>> Is it the right way to get "Jean-Jacques"?
>>
>>
>> Thanks in advance
>>
>> VV
>>
>>
>>
>> =============== SIMPLE CONFIGURATION ===================
>>
>> @prefix :        <#> .
>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
>> @prefix text:    <http://jena.apache.org/text#> .
>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>
>>
>>
>> [] rdf:type fuseki:Server ;
>>    .
>>
>>
>> ## Initialize TDB --------------------------------
>>
>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>
>> ## Initialize text query -------------------------------------
>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>> # A TextDataset is a regular dataset with a text index.
>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>> # Lucene index
>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>
>> ## ---------------------------------------------------------------
>> ## This URI must be fixed - it's used to assemble the text dataset.
>>
>> :text_dataset rdf:type     text:TextDataset ;
>> #    text:dataset   <#dataset> ;
>>     text:dataset :tdb_dataset_readwrite ;
>> #    text:index     <#indexLucene> ;
>>     text:index :My_Lucene_index ;
>>     .
>>
>> # A TDB datset used for RDF storage ------------------------------
>> :tdb_dataset_readwrite
>>         a             tdb:DatasetTDB ;
>>         tdb:location  "$_BnF_text" ;
>> .
>>
>> # Text index description ------------------------------------------
>> #<#indexLucene> a text:TextIndexLucene ;
>> :My_Lucene_index a text:TextIndexLucene ;
>>     text:directory <file:$_Lucene> ;
>>     text:entityMap <#entMap> ;
>>     .
>>
>> # Mapping in the index ---------------------------------------------
>> # URI stored in field "uri"
>> <#entMap> a text:EntityMap ;
>>     text:entityField      "uri" ;
>>     text:defaultField     "familyName" ;
>>     text:map (
>>          [ text:field "familyName" ; text:predicate foaf:familyName ]
>>          [ text:field "givenName" ; text:predicate foaf:givenName ]
>>          ) .
>>
>> :service_tdb_all  a                   fuseki:Service ;
>>         rdfs:label                    "TDB BnF_text" ;
>>         fuseki:dataset               :text_dataset ;
>>         fuseki:name                   "BnF_text" ;
>>         fuseki:serviceQuery           "query" , "sparql" ;
>>         fuseki:serviceReadGraphStore  "get" ;
>>         fuseki:serviceReadWriteGraphStore "data" ;
>>         fuseki:serviceUpdate          "update" ;
>>         fuseki:serviceUpload          "upload" .
>>
>>
>> =========== KEYWORD TOKENIZER CONFIGURATION ================
>>
>> @prefix :        <#> .
>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
>> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
>> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
>> @prefix text:    <http://jena.apache.org/text#> .
>> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
>> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
>> @prefix dcterms: <http://purl.org/dc/terms/> .
>>
>>
>>
>> [] rdf:type fuseki:Server ;
>>
>>    .
>>
>>
>> ## Initialize TDB --------------------------------
>>
>> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
>> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
>> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>>
>> ## Initialize text query -------------------------------------
>> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
>> # A TextDataset is a regular dataset with a text index.
>> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
>> # Lucene index
>> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>>
>> ## ---------------------------------------------------------------
>>
>>
>> :text_dataset rdf:type     text:TextDataset ;
>> #    text:dataset   <#dataset> ;
>>     text:dataset :tdb_dataset_readwrite ;
>> #    text:index     <#indexLucene> ;
>>     text:index :My_Lucene_index ;
>>     .
>>
>> # A TDB datset used for RDF storage ------------------------------
>> :tdb_dataset_readwrite
>>         a             tdb:DatasetTDB ;
>>         tdb:location  "$_BnF_text" ;
>> .
>>
>> # Text index description ------------------------------------------
>> #<#indexLucene> a text:TextIndexLucene ;
>> :My_Lucene_index a text:TextIndexLucene ;
>>     text:directory <file:$_Lucene> ;
>>     text:entityMap <#entMap> ;
>>     .
>>
>> # Mapping in the index ---------------------------------------------
>> # URI stored in field "uri"
>> <#entMap> a text:EntityMap ;
>>     text:entityField      "uri" ;
>>     text:defaultField     "givenName" ;
>>     text:map (
>>
>>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>>          text:analyzer [ a text:ConfigurableAnalyzer ;
>>                text:tokenizer text:KeywordTokenizer ;
>>                text:filters (text:ASCIIFoldingFilter 
>> text:LowerCaseFilter)
>>              ] ]
>>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>>         text:analyzer [ a text:ConfigurableAnalyzer ;
>>          text:tokenizer text:KeywordTokenizer ;
>>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>>          ] ]
>>          ) .
>>
>> :service_tdb_all  a                   fuseki:Service ;
>>         rdfs:label                    "TDB BnF_text" ;
>>         fuseki:dataset               :text_dataset ; ### marche pr 
>> index texte
>>         fuseki:name                   "BnF_text" ;
>>         fuseki:serviceQuery           "query" , "sparql" ;
>>         fuseki:serviceReadGraphStore  "get" ;
>>         fuseki:serviceReadWriteGraphStore "data" ;
>>         fuseki:serviceUpdate          "update" ;
>>         fuseki:serviceUpload          "upload" .
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
> 
> 

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: fuseki text:query : strange results + Lucene configuration

Posted by Vincent Ventresque <vi...@ens-lyon.fr>.
Hello,


I've made new tests with a slightly different dataset and configuration, 
the problem is the same.

--- Could you please tell me if these results are normal (I expected a 
bigger list with fewer letters)?

?uriBnF text:query ( foaf:givenName "*J*" ) => 3 entries

?uriBnF text:query ( foaf:givenName "*Ja*" ) => 1 entries

?uriBnF text:query ( foaf:givenName "*Je*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*-J*" ) => 11 entries

?uriBnF text:query ( foaf:givenName "*Jea*" ) => 12 entries

?uriBnF text:query ( foaf:givenName "*Jac*" ) => 13 entries

Here is the complete query :

SELECT * WHERE { ?uriBnF text:query ( foaf:familyName "roussea*" ) . 
?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .

?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom }

N.B. : the dataset is quite large : 1,78 M family names indexed, and 
1,71 M given names. I have 4 distinct "Jean-Jacques Rousseau" in the 
data, 713 family names containing "roussea", including 224 compound 
given names.

--- Do you know where to find more documentation about Lucene 
configuration (I read jena.apache.org page + , and also found useful 
explanations on Skosmos wiki https://github.com/NatLibFi/Skosmos ), 
especially about tokenizers  ?


Thanks in advance,

VV





  

Le 19/07/2018 à 14:07, Vincent Ventresque a écrit :
> Hello,
>
> I've just subscribed to the users@jena.apache.org list, and I 
> apologize if this mail is not sent properly.
>
> I'm trying to use Fuseki text:query, and have encountered several 
> issues. Here are my questions
>
> 1) Does text:query require a minimum number of characters to be 
> efficient?
>
> 2) Is performance linked to the number of fields indexed?
>
> 3) In order to retrieve strings containing hyphens, should I use 
> KeywordTokenizer in config file?
>
> ~~~ 1) Does text:query require a minimum number of characters to be 
> efficient? ~~~~~~~~~~~~~
>
> I've noticed that a query on indexed predicates (foaf:familyName and 
> foaf:givenName) returns more results when there are more characters in 
> the string :
>
> SELECT * WHERE {
>
> ?uriBnF text:query ( foaf:familyName "roussea*" ) .
>
> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>
> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>
> optional {?uriBnF bio:birth ?dateNaissance }
>
> }
>
> I was expecting that "Rousseau" + "Jean-Jacques" would be in the results.
>
> => if  $MY_STRING = "j*", I get  0 result
>
> => if  $MY_STRING = "je*", I get 17 results, including "Jean-Claude" & 
> "Jean-Baptiste" BUT not "Jean-Jacques"
>
> => if  $MY_STRING = "jea*", I get 27 results, including "Jean-Jacques"
>
> I don't know anything about Lucene, but it looks very strange to me : 
> I expected the contrary (fewer letters = bigger results list).
>
>
> ~~~ 2) Is performance linked to the number of fields indexed? 
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> If I change the configuration and index only foaf:givenName, and 
> provide a constant for foaf:familyName, the query returns more results :
>
> SELECT * WHERE {
>
> ?uriBnF foaf:familyName "Rousseau" .
>
> ?uriBnF text:query ( foaf:givenName "$MY_STRING" ) .
>
> ?uriBnF foaf:familyName ?nom .  ?uriBnF foaf:givenName ?prenom .
>
> optional {?uriBnF bio:birth ?dateNaissance }
>
> }
>
> => if  $MY_STRING = "j*", I get  7 results, whereas the first query 
> returned 0 result.
>
>
> ~~~ 3) In order to retrieve containing hyphens, should I use 
> KeywordTokenizer in config file? ~~~~~~~~~~~~~
>
> With the same query, if $MY_STRING = "jean-ja*" :
>
> a) with simple configuration (cf. below), I get 0 result
>
> b) with KeywordTokenizer config (cf. below), I get "Jean-Jacques"
>
> Is it the right way to get "Jean-Jacques"?
>
>
> Thanks in advance
>
> VV
>
>
>
> =============== SIMPLE CONFIGURATION ===================
>
> @prefix :        <#> .
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
> @prefix text:    <http://jena.apache.org/text#> .
> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
> @prefix dcterms: <http://purl.org/dc/terms/> .
>
>
>
> [] rdf:type fuseki:Server ;
>    .
>
>
> ## Initialize TDB --------------------------------
>
> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>
> ## Initialize text query -------------------------------------
> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
> # A TextDataset is a regular dataset with a text index.
> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
> # Lucene index
> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>
> ## ---------------------------------------------------------------
> ## This URI must be fixed - it's used to assemble the text dataset.
>
> :text_dataset rdf:type     text:TextDataset ;
> #    text:dataset   <#dataset> ;
>     text:dataset :tdb_dataset_readwrite ;
> #    text:index     <#indexLucene> ;
>     text:index :My_Lucene_index ;
>     .
>
> # A TDB datset used for RDF storage ------------------------------
> :tdb_dataset_readwrite
>         a             tdb:DatasetTDB ;
>         tdb:location  "$_BnF_text" ;
> .
>
> # Text index description ------------------------------------------
> #<#indexLucene> a text:TextIndexLucene ;
> :My_Lucene_index a text:TextIndexLucene ;
>     text:directory <file:$_Lucene> ;
>     text:entityMap <#entMap> ;
>     .
>
> # Mapping in the index ---------------------------------------------
> # URI stored in field "uri"
> <#entMap> a text:EntityMap ;
>     text:entityField      "uri" ;
>     text:defaultField     "familyName" ;
>     text:map (
>          [ text:field "familyName" ; text:predicate foaf:familyName ]
>          [ text:field "givenName" ; text:predicate foaf:givenName ]
>          ) .
>
> :service_tdb_all  a                   fuseki:Service ;
>         rdfs:label                    "TDB BnF_text" ;
>         fuseki:dataset               :text_dataset ;
>         fuseki:name                   "BnF_text" ;
>         fuseki:serviceQuery           "query" , "sparql" ;
>         fuseki:serviceReadGraphStore  "get" ;
>         fuseki:serviceReadWriteGraphStore "data" ;
>         fuseki:serviceUpdate          "update" ;
>         fuseki:serviceUpload          "upload" .
>
>
> =========== KEYWORD TOKENIZER CONFIGURATION ================
>
> @prefix :        <#> .
> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
> @prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
> @prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
> @prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .
> @prefix text:    <http://jena.apache.org/text#> .
> @prefix fuseki:  <http://jena.apache.org/fuseki#> .
> @prefix foaf: <http://xmlns.com/foaf/0.1/> .
> @prefix dcterms: <http://purl.org/dc/terms/> .
>
>
>
> [] rdf:type fuseki:Server ;
>
>    .
>
>
> ## Initialize TDB --------------------------------
>
> [] ja:loadClass "com.hp.hpl.jena.tdb.TDB" .
> tdb:DatasetTDB  rdfs:subClassOf  ja:RDFDataset .
> tdb:GraphTDB    rdfs:subClassOf  ja:Model .
>
> ## Initialize text query -------------------------------------
> [] ja:loadClass       "org.apache.jena.query.text.TextQuery" .
> # A TextDataset is a regular dataset with a text index.
> text:TextDataset      rdfs:subClassOf   ja:RDFDataset .
> # Lucene index
> text:TextIndexLucene  rdfs:subClassOf   text:TextIndex .
>
> ## ---------------------------------------------------------------
>
>
> :text_dataset rdf:type     text:TextDataset ;
> #    text:dataset   <#dataset> ;
>     text:dataset :tdb_dataset_readwrite ;
> #    text:index     <#indexLucene> ;
>     text:index :My_Lucene_index ;
>     .
>
> # A TDB datset used for RDF storage ------------------------------
> :tdb_dataset_readwrite
>         a             tdb:DatasetTDB ;
>         tdb:location  "$_BnF_text" ;
> .
>
> # Text index description ------------------------------------------
> #<#indexLucene> a text:TextIndexLucene ;
> :My_Lucene_index a text:TextIndexLucene ;
>     text:directory <file:$_Lucene> ;
>     text:entityMap <#entMap> ;
>     .
>
> # Mapping in the index ---------------------------------------------
> # URI stored in field "uri"
> <#entMap> a text:EntityMap ;
>     text:entityField      "uri" ;
>     text:defaultField     "givenName" ;
>     text:map (
>
>          [ text:field "familyName" ; text:predicate foaf:familyName ;
>          text:analyzer [ a text:ConfigurableAnalyzer ;
>                text:tokenizer text:KeywordTokenizer ;
>                text:filters (text:ASCIIFoldingFilter 
> text:LowerCaseFilter)
>              ] ]
>          [ text:field "givenName" ; text:predicate foaf:givenName ;
>         text:analyzer [ a text:ConfigurableAnalyzer ;
>          text:tokenizer text:KeywordTokenizer ;
>          text:filters (text:ASCIIFoldingFilter text:LowerCaseFilter)
>          ] ]
>          ) .
>
> :service_tdb_all  a                   fuseki:Service ;
>         rdfs:label                    "TDB BnF_text" ;
>         fuseki:dataset               :text_dataset ; ### marche pr 
> index texte
>         fuseki:name                   "BnF_text" ;
>         fuseki:serviceQuery           "query" , "sparql" ;
>         fuseki:serviceReadGraphStore  "get" ;
>         fuseki:serviceReadWriteGraphStore "data" ;
>         fuseki:serviceUpdate          "update" ;
>         fuseki:serviceUpload          "upload" .
>
>
>
>
>
>
>
>
>
>
>
>