You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Wolfgang Fahl <wf...@bitplan.com> on 2020/05/20 05:56:49 UTC

Apache Jena tdbloader performance and limits

Dear Apache Jena users,

Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
list on how to influence the performance of
tdbloader. The issue is currently of interest for me again in the
context of trying to load some 15 billion triples from a
copy of wikidata. At
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
documented what i am trying to accomplish
and a few days ago I placed a question on stackoverflow
https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
with the following three questions:

*What is proven to speed up the import without investing into extra
hardware?*

e.g. splitting the files, changing VM arguments, running multiple
processes ...

*What explains the decreasing speed at higher numbers of triples and how
can this be avoided?*

*What sucessful multi-billion triple imports for Jena do you know of and
what are the circumstances for these?*

There were some 50 fews on the question so far and some comments but
there is no real hint yet on what could improve things.

Especially the Java VM crashes that happened with different Java
environments on the Mac OSX machine are disappointing since event with a
slow speed the import would have been finished after a  while but with a
crash its a never ending story.

I am curious to learn what your experience and advice is.

Yours

  Wolfgang
**

**

-- 


Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de


Re: Apache Jena tdbloader performance and limits

Posted by Dick Murray <da...@gmail.com>.
Laura had a very specific requirement to load the whole of WikiData which I
believe is ~100GB in bz2 format.

The split isn't too complex, the uncompressed file was run through the
sort, then uniq and then split. Split was run with both -b and -l, because
some lines are very long!

These files were then loaded into separate TDB's. I have a script somewhere
which will download the bz2 file, apply the above processing, then bulkload
each output into a TDB.

I no longer work for the company I created this solution for, they were
importing CAD drawings as RDF which produced billions of triples. That said
we regularly imported 25B triples...

I do however now have access to 128 core 2048GB servers so I may revisit
what can be achieved with the loading triples.

On Wed, 20 May 2020 at 11:21, Wolfgang Fahl <wf...@bitplan.com> wrote:

> Thank you Dick for your response.
>
> > Basically, you need hardware!
> That option is very limited with my budget and my current 64 GByte
> Servers up to 12 cores and  4 TB 7200 rpm disks and SSDs of up to 512
> GByte  seem reasonable to me. I'd rather wait a bit longer than pay for
> hardware especially with the risk of thing crashing anyway.
>
> The splitting option you mention seems to be a lot of extra hassle and I
> assume this is based on the approach of "import all of WikiData".
> Currently i see that the hurdles for doing such a "full import" are very
> high. For my usecase I might be able to put up with some 3-5% of
> Wikidata since I am basically interested in what
> https://www.wikidata.org/wiki/Wikidata:Scholia offers for the
>
> https://projects.tib.eu/confident/ ConfIDent project.
>
> What kind of tuning besides the hardware was effective for you?
>
> Does anybody have experience with partial dumps created by
> https://tools.wmflabs.org/wdumps/?
>
> Cheers
>
>   Wolfgang
>
> Am 20.05.20 um 11:22 schrieb Dick Murray:
> > That's a blast from the past!
> >
> > Not all of the details from that exchange are on the Jean list because
> > Laura and myself took the conversation offline...
> >
> > The short story is I imported the WikiData in 3 days using an IBM 24 core
> > 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
> > 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
> > plenty of cycles for the OS to be doing housekeeping, and there was a lot
> > of housekeeping!
> >
> > Basically, you need hardware!
> >
> > I managed to reduce this time to a day by performing 4 imports in
> parallel.
> > This was only possible because my server could absorb this amount of
> > throughput.
> >
> > Importing in parallel resulted in 4 TDB's which were queried using a beta
> > Jena extension (known as Mosaic internally). This has it's own issues
> such
> > as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
> > actions, using Java streams. This led to further work whereby
> preprocessing
> > was performed to guarantee that each quad was unique in the 4 TDB's,
> which
> > meant the .distinct() could be skipped in the stream processing.
> >
> > About a year ago I performed that same test on a Ryzen 2950X based
> system,
> > using the same disks plus 3 M.2 drives and received similar results.
> >
> > You also need to consider what bzip2 lzmash level was used to compress.
> > Wiki use bzip2 because of it's aggressive compression, i.e. they want to
> > reduce the compressed file as much as possible.
> >
> >
> > On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <wf...@bitplan.com> wrote:
> >
> >> Dear Apache Jena users,
> >>
> >> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
> >> list on how to influence the performance of
> >> tdbloader. The issue is currently of interest for me again in the
> context
> >> of trying to load some 15 billion triples from a
> >> copy of wikidata. At
> >> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
> >> documented what i am trying to accomplish
> >> and a few days ago I placed a question on stackoverflow
> >>
> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
> >> with the following three questions:
> >>
> >> *What is proven to speed up the import without investing into extra
> >> hardware?*
> >> e.g. splitting the files, changing VM arguments, running multiple
> >> processes ...
> >>
> >> *What explains the decreasing speed at higher numbers of triples and how
> >> can this be avoided?*
> >>
> >> *What sucessful multi-billion triple imports for Jena do you know of and
> >> what are the circumstances for these?*
> >>
> >> There were some 50 fews on the question so far and some comments but
> there
> >> is no real hint yet on what could improve things.
> >>
> >> Especially the Java VM crashes that happened with different Java
> >> environments on the Mac OSX machine are disappointing since event with a
> >> slow speed the import would have been finished after a  while but with a
> >> crash its a never ending story.
> >>
> >> I am curious to learn what your experience and advice is.
> >>
> >> Yours
> >>
> >>   Wolfgang
> >>
> >> --
> >>
> >>
> >> Wolfgang Fahl
> >> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> >> Tel. +49 2154 811-480, Fax +49 2154 811-481
> >> Web: http://www.bitplan.de
> >>
> >>
> --
>
> BITPlan - smart solutions
> Wolfgang Fahl
> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> Tel. +49 2154 811-480, Fax +49 2154 811-481
> Web: http://www.bitplan.de
> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548,
> Geschäftsführer: Wolfgang Fahl
>
>
>

Re: Apache Jena tdbloader performance and limits

Posted by Dick Murray <da...@gmail.com>.
I've just finished downloading the WikiPedia latest-truthy.nt.gz (39G) and
decompressing (605G) in ~10 hours using Ubuntu 19.10 on a Raspberry Pi4
using a USB3 1TB HDD.

I'll update you on the sort and uniq (from memory there were not that my
duplicates).

Dick


On Wed, 20 May 2020 at 11:21, Wolfgang Fahl <wf...@bitplan.com> wrote:

> Thank you Dick for your response.
>
> > Basically, you need hardware!
> That option is very limited with my budget and my current 64 GByte
> Servers up to 12 cores and  4 TB 7200 rpm disks and SSDs of up to 512
> GByte  seem reasonable to me. I'd rather wait a bit longer than pay for
> hardware especially with the risk of thing crashing anyway.
>
> The splitting option you mention seems to be a lot of extra hassle and I
> assume this is based on the approach of "import all of WikiData".
> Currently i see that the hurdles for doing such a "full import" are very
> high. For my usecase I might be able to put up with some 3-5% of
> Wikidata since I am basically interested in what
> https://www.wikidata.org/wiki/Wikidata:Scholia offers for the
>
> https://projects.tib.eu/confident/ ConfIDent project.
>
> What kind of tuning besides the hardware was effective for you?
>
> Does anybody have experience with partial dumps created by
> https://tools.wmflabs.org/wdumps/?
>
> Cheers
>
>   Wolfgang
>
> Am 20.05.20 um 11:22 schrieb Dick Murray:
> > That's a blast from the past!
> >
> > Not all of the details from that exchange are on the Jean list because
> > Laura and myself took the conversation offline...
> >
> > The short story is I imported the WikiData in 3 days using an IBM 24 core
> > 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
> > 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
> > plenty of cycles for the OS to be doing housekeeping, and there was a lot
> > of housekeeping!
> >
> > Basically, you need hardware!
> >
> > I managed to reduce this time to a day by performing 4 imports in
> parallel.
> > This was only possible because my server could absorb this amount of
> > throughput.
> >
> > Importing in parallel resulted in 4 TDB's which were queried using a beta
> > Jena extension (known as Mosaic internally). This has it's own issues
> such
> > as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
> > actions, using Java streams. This led to further work whereby
> preprocessing
> > was performed to guarantee that each quad was unique in the 4 TDB's,
> which
> > meant the .distinct() could be skipped in the stream processing.
> >
> > About a year ago I performed that same test on a Ryzen 2950X based
> system,
> > using the same disks plus 3 M.2 drives and received similar results.
> >
> > You also need to consider what bzip2 lzmash level was used to compress.
> > Wiki use bzip2 because of it's aggressive compression, i.e. they want to
> > reduce the compressed file as much as possible.
> >
> >
> > On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <wf...@bitplan.com> wrote:
> >
> >> Dear Apache Jena users,
> >>
> >> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
> >> list on how to influence the performance of
> >> tdbloader. The issue is currently of interest for me again in the
> context
> >> of trying to load some 15 billion triples from a
> >> copy of wikidata. At
> >> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
> >> documented what i am trying to accomplish
> >> and a few days ago I placed a question on stackoverflow
> >>
> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
> >> with the following three questions:
> >>
> >> *What is proven to speed up the import without investing into extra
> >> hardware?*
> >> e.g. splitting the files, changing VM arguments, running multiple
> >> processes ...
> >>
> >> *What explains the decreasing speed at higher numbers of triples and how
> >> can this be avoided?*
> >>
> >> *What sucessful multi-billion triple imports for Jena do you know of and
> >> what are the circumstances for these?*
> >>
> >> There were some 50 fews on the question so far and some comments but
> there
> >> is no real hint yet on what could improve things.
> >>
> >> Especially the Java VM crashes that happened with different Java
> >> environments on the Mac OSX machine are disappointing since event with a
> >> slow speed the import would have been finished after a  while but with a
> >> crash its a never ending story.
> >>
> >> I am curious to learn what your experience and advice is.
> >>
> >> Yours
> >>
> >>   Wolfgang
> >>
> >> --
> >>
> >>
> >> Wolfgang Fahl
> >> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> >> Tel. +49 2154 811-480, Fax +49 2154 811-481
> >> Web: http://www.bitplan.de
> >>
> >>
> --
>
> BITPlan - smart solutions
> Wolfgang Fahl
> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> Tel. +49 2154 811-480, Fax +49 2154 811-481
> Web: http://www.bitplan.de
> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548,
> Geschäftsführer: Wolfgang Fahl
>
>
>

Re: Apache Jena tdbloader performance and limits

Posted by Wolfgang Fahl <wf...@bitplan.com>.
Thank you Dick for your response.

> Basically, you need hardware!
That option is very limited with my budget and my current 64 GByte
Servers up to 12 cores and  4 TB 7200 rpm disks and SSDs of up to 512
GByte  seem reasonable to me. I'd rather wait a bit longer than pay for
hardware especially with the risk of thing crashing anyway.

The splitting option you mention seems to be a lot of extra hassle and I
assume this is based on the approach of "import all of WikiData".
Currently i see that the hurdles for doing such a "full import" are very
high. For my usecase I might be able to put up with some 3-5% of
Wikidata since I am basically interested in what  
https://www.wikidata.org/wiki/Wikidata:Scholia offers for the

https://projects.tib.eu/confident/ ConfIDent project.

What kind of tuning besides the hardware was effective for you?

Does anybody have experience with partial dumps created by
https://tools.wmflabs.org/wdumps/?

Cheers

  Wolfgang

Am 20.05.20 um 11:22 schrieb Dick Murray:
> That's a blast from the past!
>
> Not all of the details from that exchange are on the Jean list because
> Laura and myself took the conversation offline...
>
> The short story is I imported the WikiData in 3 days using an IBM 24 core
> 512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
> 1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
> plenty of cycles for the OS to be doing housekeeping, and there was a lot
> of housekeeping!
>
> Basically, you need hardware!
>
> I managed to reduce this time to a day by performing 4 imports in parallel.
> This was only possible because my server could absorb this amount of
> throughput.
>
> Importing in parallel resulted in 4 TDB's which were queried using a beta
> Jena extension (known as Mosaic internally). This has it's own issues such
> as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
> actions, using Java streams. This led to further work whereby preprocessing
> was performed to guarantee that each quad was unique in the 4 TDB's, which
> meant the .distinct() could be skipped in the stream processing.
>
> About a year ago I performed that same test on a Ryzen 2950X based system,
> using the same disks plus 3 M.2 drives and received similar results.
>
> You also need to consider what bzip2 lzmash level was used to compress.
> Wiki use bzip2 because of it's aggressive compression, i.e. they want to
> reduce the compressed file as much as possible.
>
>
> On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <wf...@bitplan.com> wrote:
>
>> Dear Apache Jena users,
>>
>> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
>> list on how to influence the performance of
>> tdbloader. The issue is currently of interest for me again in the context
>> of trying to load some 15 billion triples from a
>> copy of wikidata. At
>> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
>> documented what i am trying to accomplish
>> and a few days ago I placed a question on stackoverflow
>> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
>> with the following three questions:
>>
>> *What is proven to speed up the import without investing into extra
>> hardware?*
>> e.g. splitting the files, changing VM arguments, running multiple
>> processes ...
>>
>> *What explains the decreasing speed at higher numbers of triples and how
>> can this be avoided?*
>>
>> *What sucessful multi-billion triple imports for Jena do you know of and
>> what are the circumstances for these?*
>>
>> There were some 50 fews on the question so far and some comments but there
>> is no real hint yet on what could improve things.
>>
>> Especially the Java VM crashes that happened with different Java
>> environments on the Mac OSX machine are disappointing since event with a
>> slow speed the import would have been finished after a  while but with a
>> crash its a never ending story.
>>
>> I am curious to learn what your experience and advice is.
>>
>> Yours
>>
>>   Wolfgang
>>
>> --
>>
>>
>> Wolfgang Fahl
>> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
>> Tel. +49 2154 811-480, Fax +49 2154 811-481
>> Web: http://www.bitplan.de
>>
>>
-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl 



Re: Apache Jena tdbloader performance and limits

Posted by Dick Murray <da...@gmail.com>.
That's a blast from the past!

Not all of the details from that exchange are on the Jean list because
Laura and myself took the conversation offline...

The short story is I imported the WikiData in 3 days using an IBM 24 core
512GB RAM server and 16 1TB SSD's. The swap was configured to be striped
1TB SSD's. Any thrashing was absorbed by the 24 cores, i.e. there was
plenty of cycles for the OS to be doing housekeeping, and there was a lot
of housekeeping!

Basically, you need hardware!

I managed to reduce this time to a day by performing 4 imports in parallel.
This was only possible because my server could absorb this amount of
throughput.

Importing in parallel resulted in 4 TDB's which were queried using a beta
Jena extension (known as Mosaic internally). This has it's own issues such
as he requirement to de-duplicate 4 streams of quads to answer COUNT(...)
actions, using Java streams. This led to further work whereby preprocessing
was performed to guarantee that each quad was unique in the 4 TDB's, which
meant the .distinct() could be skipped in the stream processing.

About a year ago I performed that same test on a Ryzen 2950X based system,
using the same disks plus 3 M.2 drives and received similar results.

You also need to consider what bzip2 lzmash level was used to compress.
Wiki use bzip2 because of it's aggressive compression, i.e. they want to
reduce the compressed file as much as possible.


On Wed, 20 May 2020 at 06:56, Wolfgang Fahl <wf...@bitplan.com> wrote:

> Dear Apache Jena users,
>
> Some 2 years ago Laura Morlaes and Dick Murray had an exchange on this
> list on how to influence the performance of
> tdbloader. The issue is currently of interest for me again in the context
> of trying to load some 15 billion triples from a
> copy of wikidata. At
> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData i have
> documented what i am trying to accomplish
> and a few days ago I placed a question on stackoverflow
> https://stackoverflow.com/questions/61813248/jena-tdbloader2-performance-and-limits
> with the following three questions:
>
> *What is proven to speed up the import without investing into extra
> hardware?*
> e.g. splitting the files, changing VM arguments, running multiple
> processes ...
>
> *What explains the decreasing speed at higher numbers of triples and how
> can this be avoided?*
>
> *What sucessful multi-billion triple imports for Jena do you know of and
> what are the circumstances for these?*
>
> There were some 50 fews on the question so far and some comments but there
> is no real hint yet on what could improve things.
>
> Especially the Java VM crashes that happened with different Java
> environments on the Mac OSX machine are disappointing since event with a
> slow speed the import would have been finished after a  while but with a
> crash its a never ending story.
>
> I am curious to learn what your experience and advice is.
>
> Yours
>
>   Wolfgang
>
> --
>
>
> Wolfgang Fahl
> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> Tel. +49 2154 811-480, Fax +49 2154 811-481
> Web: http://www.bitplan.de
>
>