You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2020/04/01 11:04:12 UTC

Re: Apache Jena Fuseki with text indexing


On 26/03/2020 18:26, Zhenya Antić wrote:
> Andy,
> 
> I think I figured out what the issue is. It seems that I have two datasets with the same name, and one was started with the config file I sent (and has no data in it - and hence it is not indexed), and the other was started without a config file (like this: fuseki-server --port 3030 --loc="db" /biology), and it has the data.
> 
> How do I transfer the data from one to other?

The safest way is to reload.

You can copy files aroudn when the server is not running (for both TDB 
and Lucene) but obviously that's error prone and only works if the 
target is empty.

     Andy

> 
> Thanks,
> Zhenya
> 
> 
> On Thu, Mar 26, 2020, at 12:22 PM, Chris Tomlinson wrote:
>> Zhenya,
>>
>> Do you see any content in the directory:
>>
>>> text:directory <file:data/luceneIndexing> ;
>>
>> like the following partial listing:
>>
>>> fuseki@foo :~/base/lucene-test$ ls -l
>>> total 3608108
>>> -rw-rw---- 1 fuseki fuseki 7772 Jan 29 21:15 _19a_5x.liv
>>> -rw-r----- 1 fuseki fuseki 299 Jan 21 15:53 _19a.cfe
>>> -rw-r----- 1 fuseki fuseki 36547721 Jan 21 15:53 _19a.cfs
>>> -rw-r----- 1 fuseki fuseki 443 Jan 21 15:53 _19a.si
>>> -rw-r----- 1 fuseki fuseki 23621 Jan 21 15:53 _24_17n.liv
>>> -rw-r----- 1 fuseki fuseki 22718569 Jan 21 15:53 _24.fdt
>>> -rw-r----- 1 fuseki fuseki 9184 Jan 21 15:53 _24.fdx
>>> -rw-r----- 1 fuseki fuseki 12975 Jan 21 15:53 _24.fnm
>>> -rw-r----- 1 fuseki fuseki 7009762 Jan 21 15:53 _24_Lucene50_0.doc
>>> -rw-r----- 1 fuseki fuseki 3804794 Jan 21 15:53 _24_Lucene50_0.pos
>>> -rw-r----- 1 fuseki fuseki 16186474 Jan 21 15:53 _24_Lucene50_0.tim
>>> -rw-r----- 1 fuseki fuseki 103945 Jan 21 15:53 _24_Lucene50_0.tip
>>> -rw-r----- 1 fuseki fuseki 667296 Jan 21 15:53 _24.nvd
>>> -rw-r----- 1 fuseki fuseki 4027 Jan 21 15:53 _24.nvm
>>> -rw-r----- 1 fuseki fuseki 540 Jan 21 15:53 _24.si
>>
>> Also if you don’t have storevalues true then queries like:
>>
>>   (?s ?score ?lit) text:query “ribosome”
>>
>> won’t bind anything to ?lit. The storevalues is set like:
>>
>>> # Text index description
>>> :test_lucene_index a text:TextIndexLucene ;
>>> text:directory <file:/usr/local/fuseki/base/lucene-test> ;
>>> text:storeValues true ;
>>> text:entityMap :test_entmap ;
>>
>>
>> Also you need to reload the data if you change the configuration so that the indexing will be done according to the configuration.
>>
>> ciao,
>> Chris
>>
>>
>>> On Mar 26, 2020, at 10:33 AM, Zhenya Antić <zh...@fastmail.com> wrote:
>>>
>>> @prefix : <http://base/#> .
>>> @prefix tdb2: <http://jena.apache.org/2016/tdb#> .
>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>> @prefix text: <http://jena.apache.org/text#> .
>>>
>>> <http://jena.apache.org/2016/tdb#DatasetTDB>
>>> rdfs:subClassOf ja:RDFDataset .
>>>
>>> ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .
>>>
>>> tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .
>>>
>>> tdb2:GraphTDB2 rdfs:subClassOf ja:Model .
>>>
>>> <http://jena.apache.org/2016/tdb#GraphTDB2>
>>> rdfs:subClassOf ja:Model .
>>>
>>> ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .
>>>
>>> ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .
>>>
>>> <http://jena.apache.org/text#TextDataset>
>>> rdfs:subClassOf ja:RDFDataset .
>>>
>>> :service_tdb_all a fuseki:Service ;
>>> rdfs:label "TDB biology" ;
>>> fuseki:dataset :tdb_dataset_readwrite ;
>>> fuseki:name "biology" ;
>>> fuseki:serviceQuery "query" , "" , "sparql" ;
>>> fuseki:serviceReadGraphStore "get" ;
>>> fuseki:serviceReadQuads "" ;
>>> fuseki:serviceReadWriteGraphStore
>>> "data" ;
>>> fuseki:serviceReadWriteQuads "" ;
>>> fuseki:serviceUpdate "" , "update" ;
>>> fuseki:serviceUpload "upload" .
>>>
>>> :tdb_dataset_readwrite
>>> a tdb2:DatasetTDB2 ;
>>> tdb2:location "db" .
>>>
>>> <http://jena.apache.org/2016/tdb#GraphTDB>
>>> rdfs:subClassOf ja:Model .
>>>
>>> ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .
>>>
>>> ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .
>>>
>>> <http://jena.apache.org/2016/tdb#DatasetTDB2>
>>> rdfs:subClassOf ja:RDFDataset .
>>>
>>> <#dataset> rdf:type tdb2:DatasetTDB2 ;
>>> tdb2:location "db" ; #path to TDB;
>>> .
>>>
>>> # Text index description
>>> :text_dataset rdf:type text:TextDataset ;
>>> text:dataset <#dataset> ; # <-- replace `:my_dataset` with the desired URI
>>> text:index <#indexLucene> ;
>>> .
>>>
>>> <#indexLucene> a text:TextIndexLucene ;
>>> text:directory <file:data/luceneIndexing> ;
>>> text:entityMap <#entMap> ;
>>> .
>>>
>>> <#entMap> a text:EntityMap ;
>>> text:defaultField "text" ;
>>> text:entityField "uri" ;
>>> text:map (
>>> #RDF label abstracts
>>> [ text:field "text" ;
>>> text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> ;
>>> text:analyzer [
>>> a text:StandardAnalyzer
>>> ]
>>> ]
>>> [ text:field "text" ;
>>> text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym> ;
>>> text:analyzer [
>>> a text:StandardAnalyzer
>>> ]
>>> ]
>>> ) .
>>>
>>>
>>>
>>> <#service_text_tdb> rdf:type fuseki:Service ;
>>> fuseki:name "ds" ;
>>> fuseki:serviceQuery "query" ;
>>> fuseki:serviceQuery "sparql" ;
>>> fuseki:serviceUpdate "update" ;
>>> fuseki:serviceUpload "upload" ;
>>> fuseki:serviceReadGraphStore "get" ;
>>> fuseki:serviceReadWriteGraphStore "data" ;
>>> fuseki:dataset :text_dataset ;
>>> .
>>>
>>>
>>>
>>> On Thu, Mar 26, 2020, at 11:31 AM, Zhenya Antić wrote:
>>>> Hi Andy,
>>>>
>>>> Thanks. So I think I have all the lines you listed in the .ttl file (attached). I also checked, the data file contains the relevant data. But I have 0 properties indexed.
>>>>
>>>> Thanks,
>>>> Zhenya
>>>>
>>>>
>>>>
>>>> On Wed, Mar 25, 2020, at 4:41 AM, Andy Seaborne wrote:
>>>>>
>>>>>
>>>>> On 24/03/2020 15:11, Zhenya Antić wrote:
>>>>>> Hi Andy,
>>>>>>
>>>>>>> Did you load the data before attaching the text index?
>>>>>>
>>>>>> How do I do it (or not do it, wasn't sure from your post)?
>>>>>
>>>>> Set up the Fueski system, with the text index as the Fuskei service dataset:
>>>>>
>>>>> fuseki:name "biology" ;
>>>>> fuseki:dataset :text_dataset ;
>>>>> ...
>>>>>
>>>>> :text_dataset rdf:type text:TextDataset ;
>>>>> text:dataset <#dataset> ;
>>>>>
>>>>>
>>>>>
>>>>> <#dataset> rdf:type tdb2:DatasetTDB2 ;
>>>>> tdb2:location "db" ; #path to TDB;
>>>>> .
>>>>>
>>>>> then send the data to /biology/data (which is the SPARQl GSP write
>>>>> endpoint) or however you want to push the data to the server (SPARQL
>>>>> Update, or the UI.
>>>>>
>>>>> For very large data:
>>>>>
>>>>> Load the TDB2 dataset offline
>>>>> Then run the "jena.textindexer" utility
>>>>>
>>>>> https://jena.apache.org/documentation/query/text-query.html#configuration
>>>>>
>>>>> The first way is easier.
>>>>>
>>>>> Andy
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Zhenya
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Mar 22, 2020, at 9:18 AM, Andy Seaborne wrote:
>>>>>>> Just checking one point:
>>>>>>>
>>>>>>> Did you load the data before attaching the text index?
>>>>>>>
>>>>>>> The text index is calculated as data is added so if you first load the
>>>>>>> dataset then setup a text index, it will miss indexing the data.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>> On 21/03/2020 07:55, Lorenz Buehmann wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> welcome to Semantic Web and Apache Jena.
>>>>>>>>
>>>>>>>> Comments inline:
>>>>>>>>
>>>>>>>> On 20.03.20 15:36, Zhenya Antić wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I am a beginner with Fuseki, knowledge graphs and SPARQL, so please forgive me if the questions seem obvious, the learning curve for this turned out to be quite steep.
>>>>>>>> No problem, nothing is simple in the beginning,
>>>>>>>>>
>>>>>>>>> I am trying to get text indexing to work with my Fuseki knowledge graph.
>>>>>>>> Which DBpedia dataset did you load? I mean, which files?
>>>>>>>>>
>>>>>>>>> For starters, I tried using a regular expression, but that didn't work:
>>>>>>>>>
>>>>>>>>> Just a plain query like this:
>>>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>>> ?s ?p ?o
>>>>>>>>> }
>>>>>>>>> gives 98 results such as:
>>>>>>>>>
>>>>>>>>> 1
>>>>>>>>> <http://dbpedia.org/ontology/wikiPageID:9127632>
>>>>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
>>>>>>>>> <http://dbpedia.org/resource/Biology>
>>>>>>>>> 2
>>>>>>>>> <http://dbpedia.org/ontology/wikiPageID:9127632>
>>>>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
>>>>>>>>> <http://dbpedia.org/resource/Biology#Branches>
>>>>>>>>> 3
>>>>>>>>> <http://dbpedia.org/ontology/wikiPageID:9127632>
>>>>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym>
>>>>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#branches_of_biology>
>>>>>>>>> 4
>>>>>>>>> <http://dbpedia.org/ontology/wikiPageID:18393>
>>>>>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#label>
>>>>>>>>> <http://dbpedia.org/resource/Life>
>>>>>>>> That can't be the correct output of this query. rdfs:label should return
>>>>>>>> literals as object (?o) - or you loaded some really weird data
>>>>>>>>>
>>>>>>>>> But a query with a regular expression:
>>>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>>> ?s ?p ?o
>>>>>>>>> FILTER regex(?o, "Biol", "i")
>>>>>>>>> }
>>>>>>>>
>>>>>>>> 1. you should help the query engine and use rdfs:label as property
>>>>>>>>
>>>>>>>> 2. you should use str() function on the ?o values:
>>>>>>>>
>>>>>>>> SELECT DISTINCT * WHERE {
>>>>>>>> ?s rdfs:label ?o
>>>>>>>> FILTER regex(str(?o), "Biol", "i")
>>>>>>>> }
>>>>>>>>
>>>>>>>>> gives 0 results, although there are clearly results that contain "Biol".
>>>>>>>>
>>>>>>>>
>>>>>>>> I've to try your config or maybe others will spot the issue in the meantime.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also tried setting up indexing with a .ttl file, however the result was "INFO 0 (0 per second) properties indexed". .ttl file below:
>>>>>>>>>
>>>>>>>>> @prefix : <http://base/#> .
>>>>>>>>> @prefix tdb2: <http://jena.apache.org/2016/tdb#> .
>>>>>>>>> @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
>>>>>>>>> @prefix ja: <http://jena.hpl.hp.com/2005/11/Assembler#> .
>>>>>>>>> @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
>>>>>>>>> @prefix fuseki: <http://jena.apache.org/fuseki#> .
>>>>>>>>> @prefix text: <http://jena.apache.org/text#> .
>>>>>>>>>
>>>>>>>>> <http://jena.apache.org/2016/tdb#DatasetTDB>
>>>>>>>>> rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> ja:DatasetTxnMem rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> tdb2:DatasetTDB2 rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> tdb2:GraphTDB2 rdfs:subClassOf ja:Model .
>>>>>>>>>
>>>>>>>>> <http://jena.apache.org/2016/tdb#GraphTDB2>
>>>>>>>>> rdfs:subClassOf ja:Model .
>>>>>>>>>
>>>>>>>>> ja:MemoryDataset rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> ja:RDFDatasetZero rdfs:subClassOf ja:RDFDataset .
>>>>>>>
>>>>>>> The rdfs:subClassOf should not be necessary (recent versions of Fuseki).
>>>>>>>
>>>>>>> If any are, let's use know so it can be fixed.
>>>>>>>
>>>>>>>>>
>>>>>>>>> <http://jena.apache.org/text#TextDataset>
>>>>>>>>> rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> :service_tdb_all a fuseki:Service ;
>>>>>>>>> rdfs:label "TDB biology" ;
>>>>>>>>> fuseki:dataset :tdb_dataset_readwrite ;
>>>>>>>>> fuseki:name "biology" ;
>>>>>>>>> fuseki:serviceQuery "query" , "" , "sparql" ;
>>>>>>>>> fuseki:serviceReadGraphStore "get" ;
>>>>>>>>> fuseki:serviceReadQuads "" ;
>>>>>>>>> fuseki:serviceReadWriteGraphStore
>>>>>>>>> "data" ;
>>>>>>>>> fuseki:serviceReadWriteQuads "" ;
>>>>>>>>> fuseki:serviceUpdate "" , "update" ;
>>>>>>>>> fuseki:serviceUpload "upload" .
>>>>>>>>>
>>>>>>>>> :tdb_dataset_readwrite
>>>>>>>>> a tdb2:DatasetTDB2 ;
>>>>>>>>> tdb2:location "db" .
>>>>>>>>>
>>>>>>>>> <http://jena.apache.org/2016/tdb#GraphTDB>
>>>>>>>>> rdfs:subClassOf ja:Model .
>>>>>>>>>
>>>>>>>>> ja:RDFDatasetOne rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> ja:RDFDatasetSink rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> <http://jena.apache.org/2016/tdb#DatasetTDB2>
>>>>>>>>> rdfs:subClassOf ja:RDFDataset .
>>>>>>>>>
>>>>>>>>> <#dataset> rdf:type tdb2:DatasetTDB2 ;
>>>>>>>>> tdb2:location "db" ; #path to TDB;
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> # Text index description
>>>>>>>>> :text_dataset rdf:type text:TextDataset ;
>>>>>>>>> text:dataset <#dataset> ; # <-- replace `:my_dataset` with the desired URI
>>>>>>>>> text:index <#indexLucene> ;
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> <#indexLucene> a text:TextIndexLucene ;
>>>>>>>>> text:directory <file:data/luceneIndexing> ;
>>>>>>>>> text:entityMap <#entMap> ;
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> <#entMap> a text:EntityMap ;
>>>>>>>>> text:defaultField "text" ;
>>>>>>>>> text:entityField "uri" ;
>>>>>>>>> text:map (
>>>>>>>>> #RDF label abstracts
>>>>>>>>> [ text:field "text" ;
>>>>>>>>> text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#label> ;
>>>>>>>>> text:analyzer [
>>>>>>>>> a text:StandardAnalyzer
>>>>>>>>> ]
>>>>>>>>> ]
>>>>>>>>> [ text:field "text" ;
>>>>>>>>> text:predicate <http://www.w3.org/1999/02/22-rdf-syntax-ns#synonym> ;
>>>>>>>>> text:analyzer [
>>>>>>>>> a text:StandardAnalyzer
>>>>>>>>> ]
>>>>>>>>> ]
>>>>>>>>> ) .
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> <#service_text_tdb> rdf:type fuseki:Service ;
>>>>>>>>> fuseki:name "ds" ;
>>>>>>>>> fuseki:serviceQuery "query" ;
>>>>>>>>> fuseki:serviceQuery "sparql" ;
>>>>>>>>> fuseki:serviceUpdate "update" ;
>>>>>>>>> fuseki:serviceUpload "upload" ;
>>>>>>>>> fuseki:serviceReadGraphStore "get" ;
>>>>>>>>> fuseki:serviceReadWriteGraphStore "data" ;
>>>>>>>>> fuseki:dataset :text_dataset ;
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> Thank you so much in advance,
>>>>>>>>>
>>>>>>>>> __________________________
>>>>>>>>> Zhenya Antić, PhD
>>>>>>>>> Natural Language Processing
>>>>>>>>> https://www.linkedin.com/in/zhenya-antic/
>>>>>>>>>
>>>>>>>>> Practical Linguistics Inc
>>>>>>>>> http://www.practicallinguistics.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>>
>