You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Jean-Marc Vanel <je...@gmail.com> on 2020/06/04 09:25:03 UTC

slow loading in TDB with Lucene

Hi

It took hours loading a TTL document with text indexing (in TDB 3.15.0).
The TTL document is Taxrefld_taxonomy_classes.ttl (size: 2_676_428 triples)
in zip taxref12-core.zip
<https://github.com/frmichel/taxref-ld/blob/master/dataset/12.0/taxref12-core.zip>
 .

This method in DatasetGraph is called :
    public void add(Node g, Node s, Node p, Node o) ;

With logging at debug level, it appeared that most of the elapsed time is
taken by removing the graph, one entity at a time.
In fact I explicitly call *removeGraph()* before, because the data is
stored in provenance specific graphs in this database.

Is there a way to accelerate things ?
I wondered if wrapping removeGraph()operation in a transaction is mandatory
or useful. At runtime Jena does not protest about that ...

A typical block in the data:
<http://taxref.mnhn.fr/lod/taxon/629656/12.0>
        a                            owl:Class ;
        rdfs:isDefinedBy             <
http://taxref.mnhn.fr/lod/taxref-ld/12.0> ;

*        rdfs:label                   "Eranthemum pulchellum" ;*
rdfs:subClassOf              <http://taxref.mnhn.fr/lod/taxon/452421/12.0> ;
        schema:mainEntityOfPage      <
https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> ;
        taxrefprop:habitat           taxrefhab:FreshWater ,
taxrefhab:Terrestrial ;
        taxrefprop:hasRank           taxrefrk:Species ;
        taxrefprop:hasReferenceName  <http://taxref.mnhn.fr/lod/name/629656>
;
        taxrefprop:hasSynonym        <http://taxref.mnhn.fr/lod/name/633029>
, <http://taxref.mnhn.fr/lod/name/637984> , <
http://taxref.mnhn.fr/lod/name/634312> ;
        foaf:homepage                <
https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> .

Jean-Marc Vanel
<http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
 Chroniques jardin
<http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle>

Re: slow loading in TDB with Lucene

Posted by Andy Seaborne <an...@apache.org>.


On 17/06/2020 17:11, Jean-Marc Vanel wrote:
> Sorry for the late answer ;
> I'm aware of the bad side of autocommit which I never use.
> I did wrap In Transaction the call to removeGraph
> I'll make measurements you asked to assert the respective CPU and elapsed
> times for loading RDF and indexing the text.
> 
> But for the time being, I had to solve my issue of loading data without
> stopping my SPARQL + HTML server .
> So I wrote a client RDF uploader, talking to the SPARQL graph store
> protocol :
> https://www.w3.org/TR/sparql11-http-rdf-update/
> splitting given RDF file in chunks of 10000 triples for sending :
> https://github.com/jmvanel/semantic_forms/blob/master/scala/clients/src/main/scala/deductions/runtime/clients/RDFuploader.scala#L66
> I used for the first time the Riot parser with callback
> (org.apache.jena.riot.system.StreamRDFBase) , which I'll also test for
> performance. It is understandable that it can be slow, since the input was
> a Turtle file , not N-Triple .
> 
> On server side, I modularized my code, so that now several instances TDB(1)
> are created on the same directory, which is not a problem for TDB.
> But apparently this is a problem for Lucene: there is a
> LockObtainFailedException: "Lock held by this virtual machine:
> ../LUCENE/write.lock" when creating the second TDB instance connected to
> Lucene.

Re: One lucene index shared across multiple databases.

The code isn't written to be used in this way. The locking issue could 
be made to work - I don't think there is a fundamental reason why text 
indexes can't be shared read-only across databases in the same JVM.

But update adds a complication. Having one index in multiple transaction 
controllers is not going to work.

DatasetGraphText does special things for TDB1 and TDB2.

TDB1 transaction management only works with one database and special 
TransactionLifecycle listeners.

TDB2 transaction management can, in theory, work across databases and 
extra TransactionalComponents but the code to build the compound 
transaction domain does not exist.

     Andy

> So 'll ensure that only one TDB database is instantiated.
> Or maybe I use badly the API (it's configured by API not RDF configuration).
> 
> NOTES
> 
>     - I'm not sure if LUCENE/write.lock is deleted in all cases when closing
>     the TDB, although it has been specified at text index creation:
>                 TextDatasetFactory.create(... closeIndexOnDSGClose = true)
>     - using the GUI Luke in lucene-8.5.2 is useful to inspect Lucene index
> 
> 
> Jean-Marc Vanel
> <http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
> +33
> (0)6 89 16 29 52
> 
> 
> Le sam. 6 juin 2020 à 11:45, Andy Seaborne <an...@apache.org> a écrit :
> 
>>
>>
>> On 04/06/2020 10:25, Jean-Marc Vanel wrote:
>>> Hi
>>>
>>> It took hours loading a TTL document with text indexing (in TDB 3.15.0).
>>> The TTL document is Taxrefld_taxonomy_classes.ttl (size: 2_676_428
>> triples)
>>> in zip taxref12-core.zip
>>> <
>> https://github.com/frmichel/taxref-ld/blob/master/dataset/12.0/taxref12-core.zip
>>>
>>>    .
>>
>> Have you tried with and without the text index to get a information
>> about where the time is going?
>>
>> This is a combination setup so it is harder to say where time is going
>> without an experiment.
>>
>>>
>>> This method in DatasetGraph is called :
>>>       public void add(Node g, Node s, Node p, Node o) ;
>>>
>>> With logging at debug level, it appeared that most of the elapsed time is
>>> taken by removing the graph, one entity at a time.
>>   >
>>> In fact I explicitly call *removeGraph()* before, because the data is
>>> stored in provenance specific graphs in this database.
>>
>> The text index has to be updated as well, and I think there is nothing
>> special about removeGraph for a test index so it undoes all the indexing.
>>
>> Also - lucene indexing may be slower that the TDB part.
>>
>>>
>>> Is there a way to accelerate things ?
>>> I wondered if wrapping removeGraph()operation in a transaction is
>> mandatory
>>> or useful.
>>
>> useful - If you don't have a transaction, TDB1 is going to be less safe
>> for your data.
>>
>>> At runtime Jena does not protest about that ...
>>
>> TDB1 does not ... but it is better to use a transaction and its
>> mandatory for TDB2.
>>
>> Adding an autocommit mode is not as good as it may seem. Like in SQL,
>> autocommit is nothing more than an automatic transaction around each
>> step and very easily becomes extremely slow.
>>
>>       Andy
>>
>>>
>>> A typical block in the data:
>>> <http://taxref.mnhn.fr/lod/taxon/629656/12.0>
>>>           a                            owl:Class ;
>>>           rdfs:isDefinedBy             <
>>> http://taxref.mnhn.fr/lod/taxref-ld/12.0> ;
>>>
>>> *        rdfs:label                   "Eranthemum pulchellum" ;*
>>> rdfs:subClassOf              <
>> http://taxref.mnhn.fr/lod/taxon/452421/12.0> ;
>>>           schema:mainEntityOfPage      <
>>> https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> ;
>>>           taxrefprop:habitat           taxrefhab:FreshWater ,
>>> taxrefhab:Terrestrial ;
>>>           taxrefprop:hasRank           taxrefrk:Species ;
>>>           taxrefprop:hasReferenceName  <
>> http://taxref.mnhn.fr/lod/name/629656>
>>> ;
>>>           taxrefprop:hasSynonym        <
>> http://taxref.mnhn.fr/lod/name/633029>
>>> , <http://taxref.mnhn.fr/lod/name/637984> , <
>>> http://taxref.mnhn.fr/lod/name/634312> ;
>>>           foaf:homepage                <
>>> https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> .
>>>
>>> Jean-Marc Vanel
>>> <
>> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
>>>
>>> +33 (0)6 89 16 29 52
>>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>>>    Chroniques jardin
>>> <
>> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
>>>
>>>
>>
>

Re: slow loading in TDB with Lucene

Posted by Jean-Marc Vanel <je...@gmail.com>.

Sorry for the late answer ;
I'm aware of the bad side of autocommit which I never use.
I did wrap In Transaction the call to removeGraph
I'll make measurements you asked to assert the respective CPU and elapsed
times for loading RDF and indexing the text.

But for the time being, I had to solve my issue of loading data without
stopping my SPARQL + HTML server .
So I wrote a client RDF uploader, talking to the SPARQL graph store
protocol :
https://www.w3.org/TR/sparql11-http-rdf-update/
splitting given RDF file in chunks of 10000 triples for sending :
https://github.com/jmvanel/semantic_forms/blob/master/scala/clients/src/main/scala/deductions/runtime/clients/RDFuploader.scala#L66
I used for the first time the Riot parser with callback
(org.apache.jena.riot.system.StreamRDFBase) , which I'll also test for
performance. It is understandable that it can be slow, since the input was
a Turtle file , not N-Triple .

On server side, I modularized my code, so that now several instances TDB(1)
are created on the same directory, which is not a problem for TDB.
But apparently this is a problem for Lucene: there is a
LockObtainFailedException: "Lock held by this virtual machine:
../LUCENE/write.lock" when creating the second TDB instance connected to
Lucene.
So 'll ensure that only one TDB database is instantiated.
Or maybe I use badly the API (it's configured by API not RDF configuration).

NOTES

   - I'm not sure if LUCENE/write.lock is deleted in all cases when closing
   the TDB, although it has been specified at text index creation:
               TextDatasetFactory.create(... closeIndexOnDSGClose = true)
   - using the GUI Luke in lucene-8.5.2 is useful to inspect Lucene index


Jean-Marc Vanel
<http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
+33
(0)6 89 16 29 52


Le sam. 6 juin 2020 à 11:45, Andy Seaborne <an...@apache.org> a écrit :

>
>
> On 04/06/2020 10:25, Jean-Marc Vanel wrote:
> > Hi
> >
> > It took hours loading a TTL document with text indexing (in TDB 3.15.0).
> > The TTL document is Taxrefld_taxonomy_classes.ttl (size: 2_676_428
> triples)
> > in zip taxref12-core.zip
> > <
> https://github.com/frmichel/taxref-ld/blob/master/dataset/12.0/taxref12-core.zip
> >
> >   .
>
> Have you tried with and without the text index to get a information
> about where the time is going?
>
> This is a combination setup so it is harder to say where time is going
> without an experiment.
>
> >
> > This method in DatasetGraph is called :
> >      public void add(Node g, Node s, Node p, Node o) ;
> >
> > With logging at debug level, it appeared that most of the elapsed time is
> > taken by removing the graph, one entity at a time.
>  >
> > In fact I explicitly call *removeGraph()* before, because the data is
> > stored in provenance specific graphs in this database.
>
> The text index has to be updated as well, and I think there is nothing
> special about removeGraph for a test index so it undoes all the indexing.
>
> Also - lucene indexing may be slower that the TDB part.
>
> >
> > Is there a way to accelerate things ?
> > I wondered if wrapping removeGraph()operation in a transaction is
> mandatory
> > or useful.
>
> useful - If you don't have a transaction, TDB1 is going to be less safe
> for your data.
>
> > At runtime Jena does not protest about that ...
>
> TDB1 does not ... but it is better to use a transaction and its
> mandatory for TDB2.
>
> Adding an autocommit mode is not as good as it may seem. Like in SQL,
> autocommit is nothing more than an automatic transaction around each
> step and very easily becomes extremely slow.
>
>      Andy
>
> >
> > A typical block in the data:
> > <http://taxref.mnhn.fr/lod/taxon/629656/12.0>
> >          a                            owl:Class ;
> >          rdfs:isDefinedBy             <
> > http://taxref.mnhn.fr/lod/taxref-ld/12.0> ;
> >
> > *        rdfs:label                   "Eranthemum pulchellum" ;*
> > rdfs:subClassOf              <
> http://taxref.mnhn.fr/lod/taxon/452421/12.0> ;
> >          schema:mainEntityOfPage      <
> > https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> ;
> >          taxrefprop:habitat           taxrefhab:FreshWater ,
> > taxrefhab:Terrestrial ;
> >          taxrefprop:hasRank           taxrefrk:Species ;
> >          taxrefprop:hasReferenceName  <
> http://taxref.mnhn.fr/lod/name/629656>
> > ;
> >          taxrefprop:hasSynonym        <
> http://taxref.mnhn.fr/lod/name/633029>
> > , <http://taxref.mnhn.fr/lod/name/637984> , <
> > http://taxref.mnhn.fr/lod/name/634312> ;
> >          foaf:homepage                <
> > https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> .
> >
> > Jean-Marc Vanel
> > <
> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
> >
> > +33 (0)6 89 16 29 52
> > Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
> >   Chroniques jardin
> > <
> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
> >
> >
>

Re: slow loading in TDB with Lucene

Posted by Andy Seaborne <an...@apache.org>.


On 04/06/2020 10:25, Jean-Marc Vanel wrote:
> Hi
> 
> It took hours loading a TTL document with text indexing (in TDB 3.15.0).
> The TTL document is Taxrefld_taxonomy_classes.ttl (size: 2_676_428 triples)
> in zip taxref12-core.zip
> <https://github.com/frmichel/taxref-ld/blob/master/dataset/12.0/taxref12-core.zip>
>   .

Have you tried with and without the text index to get a information 
about where the time is going?

This is a combination setup so it is harder to say where time is going 
without an experiment.

> 
> This method in DatasetGraph is called :
>      public void add(Node g, Node s, Node p, Node o) ;
> 
> With logging at debug level, it appeared that most of the elapsed time is
> taken by removing the graph, one entity at a time.
 >
> In fact I explicitly call *removeGraph()* before, because the data is
> stored in provenance specific graphs in this database.

The text index has to be updated as well, and I think there is nothing 
special about removeGraph for a test index so it undoes all the indexing.

Also - lucene indexing may be slower that the TDB part.

> 
> Is there a way to accelerate things ?
> I wondered if wrapping removeGraph()operation in a transaction is mandatory
> or useful. 

useful - If you don't have a transaction, TDB1 is going to be less safe 
for your data.

> At runtime Jena does not protest about that ...

TDB1 does not ... but it is better to use a transaction and its 
mandatory for TDB2.

Adding an autocommit mode is not as good as it may seem. Like in SQL, 
autocommit is nothing more than an automatic transaction around each 
step and very easily becomes extremely slow.

     Andy

> 
> A typical block in the data:
> <http://taxref.mnhn.fr/lod/taxon/629656/12.0>
>          a                            owl:Class ;
>          rdfs:isDefinedBy             <
> http://taxref.mnhn.fr/lod/taxref-ld/12.0> ;
> 
> *        rdfs:label                   "Eranthemum pulchellum" ;*
> rdfs:subClassOf              <http://taxref.mnhn.fr/lod/taxon/452421/12.0> ;
>          schema:mainEntityOfPage      <
> https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> ;
>          taxrefprop:habitat           taxrefhab:FreshWater ,
> taxrefhab:Terrestrial ;
>          taxrefprop:hasRank           taxrefrk:Species ;
>          taxrefprop:hasReferenceName  <http://taxref.mnhn.fr/lod/name/629656>
> ;
>          taxrefprop:hasSynonym        <http://taxref.mnhn.fr/lod/name/633029>
> , <http://taxref.mnhn.fr/lod/name/637984> , <
> http://taxref.mnhn.fr/lod/name/634312> ;
>          foaf:homepage                <
> https://inpn.mnhn.fr/espece/cd_nom/629656?lg=en> .
> 
> Jean-Marc Vanel
> <http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
> +33 (0)6 89 16 29 52
> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>   Chroniques jardin
> <http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle>
>