You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "Neubert, Joachim" <J....@zbw.eu> on 2022/02/18 08:59:30 UTC

Text indexing Wikidata

Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels (in parts using text:LowerCaseKeywordAnalyzer) on the massive parallel machine.

I observed a CPU usage of 100-250 %, and wonder if I could do something to speed up. My command line simply was

java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug --desc=/tmp/temp.ttl

(apache-jena-fuseki-4.5.0-SNAPSHOT)

Cheers, Joachim

--
Joachim Neubert

ZBW - Leibniz Information Centre for Economics
Neuer Jungfernstieg 21
20354 Hamburg
Phone +49-40-42834-462

Re: Text indexing Wikidata

Posted by Andy Seaborne <an...@apache.org>.


On 19/02/2022 08:00, Lorenz Buehmann wrote:
> Hi,
> 
> so far you can't do anything else - the whole indexing pipeline is 
> single-threaded as far as I know. It simply iterates all properties 
> declared to be used for fetching the RDF triple values - Lucene indexing 
> itself would be threadsafe, so the easiest thing would be to apply one 
> writer thread per property. This clearly would not help here when you 
> just set rdfs:label as only property. Thus, we would have to also split 
> the dataset somehow for the given property and then would be able to 
> distribute each split to a separate writer thread.
> 
> The main loop is here and makes it rather easy to understand where we 
> could introduce parallelism: 
> https://github.com/apache/jena/blob/main/jena-text/src/main/java/org/apache/jena/query/text/cmd/textindexer.java#L125-L143 
> 
> 
> Multiple read from a dataset is trivial, we just have to get appropriate 
> splits - not sure how easy this is, maybe a cursor/iterator on the 
> subjects with different offsets or something?

Read single thread on one thread,
split triples
collect blocks of triple (1000? 100000?) and send to a separate thread
other threads do the Lucene indexing

> 
> @Andy what do you think?

Good idea.

     Andy

> 
> On 18.02.22 09:59, Neubert, Joachim wrote:
>> Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels 
>> (in parts using text:LowerCaseKeywordAnalyzer) on the massive parallel 
>> machine.
>>
>> I observed a CPU usage of 100-250 %, and wonder if I could do 
>> something to speed up. My command line simply was
>>
>> java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug 
>> --desc=/tmp/temp.ttl
>>
>> (apache-jena-fuseki-4.5.0-SNAPSHOT)
>>
>> Cheers, Joachim
>>
>> -- 
>> Joachim Neubert
>>
>> ZBW - Leibniz Information Centre for Economics
>> Neuer Jungfernstieg 21
>> 20354 Hamburg
>> Phone +49-40-42834-462
>>
>>

Re: Text indexing Wikidata

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.

Hi,

so far you can't do anything else - the whole indexing pipeline is 
single-threaded as far as I know. It simply iterates all properties 
declared to be used for fetching the RDF triple values - Lucene indexing 
itself would be threadsafe, so the easiest thing would be to apply one 
writer thread per property. This clearly would not help here when you 
just set rdfs:label as only property. Thus, we would have to also split 
the dataset somehow for the given property and then would be able to 
distribute each split to a separate writer thread.

The main loop is here and makes it rather easy to understand where we 
could introduce parallelism: 
https://github.com/apache/jena/blob/main/jena-text/src/main/java/org/apache/jena/query/text/cmd/textindexer.java#L125-L143

Multiple read from a dataset is trivial, we just have to get appropriate 
splits - not sure how easy this is, maybe a cursor/iterator on the 
subjects with different offsets or something?

@Andy what do you think?

On 18.02.22 09:59, Neubert, Joachim wrote:
> Text indexing the truthy Wikidata dump took 13:10 h for 1.5b labels (in parts using text:LowerCaseKeywordAnalyzer) on the massive parallel machine.
>
> I observed a CPU usage of 100-250 %, and wonder if I could do something to speed up. My command line simply was
>
> java -cp /opt/fuseki/fuseki-server.jar jena.textindexer --debug --desc=/tmp/temp.ttl
>
> (apache-jena-fuseki-4.5.0-SNAPSHOT)
>
> Cheers, Joachim
>
> --
> Joachim Neubert
>
> ZBW - Leibniz Information Centre for Economics
> Neuer Jungfernstieg 21
> 20354 Hamburg
> Phone +49-40-42834-462
>
>