You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "Hoffart, Johannes" <Jo...@gs.com> on 2020/06/08 15:54:58 UTC

Resource requirements and configuration for loading a Wikidata dump

Hi,

I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena.

I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much after a couple of 100GB written, and finally, at around 500GB, the progress is almost halted.

Did anyone ingest Wikidata into Jena before? What are the system requirements? Is there a specific tdb2.tdbloader configuration that would speed things up? For example building an index after data ingest?

Thanks
Johannes

Johannes Hoffart, Executive Director, Technology Division
Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 Frankfurt am Main
Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel: +49 (0)69 7532 3558
Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. Matthias Bock
Vorsitzender des Aufsichtsrats: Dermot McDonogh
Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190


________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Marco Neumann <ma...@gmail.com>.

Wolfgang, here is another link (I did not find in your link list yet) this
time to setup wikidata with blazegraph in the Google Cloud (GCE)

https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/


On Thu, Jun 11, 2020 at 7:14 AM Wolfgang Fahl <wf...@bitplan.com> wrote:

>
> Am 10.06.20 um 17:46 schrieb Marco Neumann:
> > Wolfang, I hear you and I've added a dataset today with 1 billion triples
> > and will continue to try to add larger datasets over time.
> > http://www.lotico.com/index.php/JENA_Loader_Benchmarks
> >
> > If you are only specifically interested in the wikidata dump loading
> > process for this thread there is some data available on the wikidata
> > mailing list as well (no data for Jena yet though). It took some users
> 10.2
> > days to load the full Wikidata RDF dump (wikidata-20190513-all-BETA.ttl,
> > 379G) with Blazegraph 2.1.5. and apparently 43 hours with a dev version
> of
> > Virtuoso.
> > https://lists.wikimedia.org/pipermail/wikidata/2019-June/013201.html
>
> Marco - thank you - i have added a section "Performance Reports" now to
> the wiki page "Get your own copy of WikiData"
>
> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Performance_Reports
>
> I'd appreciate to get more reports and pointers to reports for
> successful WikiData dump imports.
>
> Wolfgang
>
> --
>
> BITPlan - smart solutions
> Wolfgang Fahl
> Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
> Tel. +49 2154 811-480, Fax +49 2154 811-481
> Web: http://www.bitplan.de
> BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548,
> Geschäftsführer: Wolfgang Fahl
>
>
>

-- 


---
Marco Neumann
KONA

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Wolfgang Fahl <wf...@bitplan.com>.

Am 10.06.20 um 17:46 schrieb Marco Neumann:
> Wolfang, I hear you and I've added a dataset today with 1 billion triples
> and will continue to try to add larger datasets over time.
> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>
> If you are only specifically interested in the wikidata dump loading
> process for this thread there is some data available on the wikidata
> mailing list as well (no data for Jena yet though). It took some users 10.2
> days to load the full Wikidata RDF dump (wikidata-20190513-all-BETA.ttl,
> 379G) with Blazegraph 2.1.5. and apparently 43 hours with a dev version of
> Virtuoso.
> https://lists.wikimedia.org/pipermail/wikidata/2019-June/013201.html

Marco - thank you - i have added a section "Performance Reports" now to
the wiki page "Get your own copy of WikiData"
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Performance_Reports

I'd appreciate to get more reports and pointers to reports for
successful WikiData dump imports.

Wolfgang

-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Marco Neumann <ma...@gmail.com>.

Exactly Andy, thank you for the additional context, and as a matter of fact
we already query / manipulate 150bn+ triples in a LOD cloud as distributed
sets every day.

But of course we frequently see practitioners in the community who look at
the Semantic Web and Jena specifically primarily as a database technology,
while not paying that much attention to the Web and RDF / SPARQL federation
aspects.

That said a lot of what we do here on the list with Jena is indeed geared
towards performance, optimization and features and hence I will continue to
collect sample data for the lotico benchmarks page. The dataset we have
used so far in the benchmarks process simply hits a sweet spot in terms of
hardware requirements and time it takes to run quick tests. And the tests
already gave me valuable hints at how to scale out clusters for other
non-public data sets. BTW if anyone has access to more powerful hardware
configurations I'd be more than happy to test larger datasets for
benchmarking purposes and would include the results in the page :-) . And
as mentioned by Martynas a page on the Jena project site might be a good
idea as well.

Wolfang, I hear you and I've added a dataset today with 1 billion triples
and will continue to try to add larger datasets over time.
http://www.lotico.com/index.php/JENA_Loader_Benchmarks

If you are only specifically interested in the wikidata dump loading
process for this thread there is some data available on the wikidata
mailing list as well (no data for Jena yet though). It took some users 10.2
days to load the full Wikidata RDF dump (wikidata-20190513-all-BETA.ttl,
379G) with Blazegraph 2.1.5. and apparently 43 hours with a dev version of
Virtuoso.
https://lists.wikimedia.org/pipermail/wikidata/2019-June/013201.html

Marco



On Wed, Jun 10, 2020 at 9:39 AM Andy Seaborne <an...@apache.org> wrote:

>
>
> On 09/06/2020 12:18, Wolfgang Fahl wrote:
> > Marco
> >
> > thank you for sharing your results. Could you please try to make the
> > sample size 10 and 100 times bigger for the discussion we currently have
> > at hand. Getting to a billion triples has not been a problem for the
> > WikiData import. From 1-10 billion triples it gets tougher and
> > for >10 billion triples there is no success story yet that I know of.
> >
> > This brings us to the general question - what will we do in a few years
> > from now when we'd like to work with 100 billion triples or more and the
> > upcoming decades where we might see a rise in data size that stays
> > exponential?
>
> At several levels, the world is going parallel, both one "machine" (a
> computer is a distributed system) and datacenter wide.
>
> Scale comes from multiple machines. There is still mileage in larger
> single machine architectures and better software, but not long term.
>
> At another level - why have all the data in the same place? Convenience.
>
> Search engines are not a feature of WWW architecture. They are an
> emergent effect because it is convenient (simpler, easier) to have one
> place to find things - and that also makes it a winner-takes-all market.
>
> Convenience has limits. Search engine style does not work for all tasks,
> e.g. search within the enterprise for example, or indeed for data. And
> it has consequences in the clandestine data analysis and data abuse.
>
>      Andy
>
> > Wolfgang
> >
> >
> > Am 09.06.20 um 12:17 schrieb Marco Neumann:
> >> same here, I get the best performance on single iron with SSD and fast
> >> DDRAM. The datacenters in the cloud tend to be very selective and you
> can
> >> only get the fast dedicated hardware in a few locations in the cloud.
> >>
> >> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
> >>
> >> In addition keep in mind these are not query benchmarks.
> >>
> >>   Marco
> >>
> >> On Tue, Jun 9, 2020 at 10:27 AM Andy Seaborne <an...@apache.org> wrote:
> >>
> >>> It maybe that SSD is the important factor.
> >>>
> >>> 1/ From a while ago, on truthy:
> >>>
> >>>
> >>>
> https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E
> >>>
> >>> before tdb2.tdbloader was a thing.
> >>>
> >>> 2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with
> >>> a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS
> server
> >>> (local NVMe, but virtualized, SSD).
> >>>
> >>> The laptop was nearly as fast as a big AWS server.
> >>>
> >>> My assumption was that as the database grew, RAM caching become less
> >>> significant and the speed of I/O was dominant.
> >>>
> >>> FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate
> >>> the I/O.
> >>>
> >>> ----
> >>>
> >>> I don't have access to hardware (or ad hoc AWS machines) at the moment
> >>> otherwise I'd give this a try.
> >>>
> >>> Previously, downloading the data to AWS is much faster and much more
> >>> reliable than to my local setup. That said, I think
> dumps.wikimedia.org
> >>> does some rate limiting of downloads as well or my route to the site
> >>> ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained
> >>> download speed a lot out of working hours.
> >>>
> >>>       Andy
> >>>
> >>> On 09/06/2020 08:04, Wolfgang Fahl wrote:
> >>>> Hi Johannes,
> >>>>
> >>>> thank you for bringing the issue to this mailinglist again.
> >>>>
> >>>> At
> >>>>
> >>>
> https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
> >>>> there is a question describing the issue and at
> >>>>
> >>>
> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
> >>>> a documentation of my own attempts. There has been some feedback by a
> >>>> few people in the mean time but i have no report of a success yet.
> Also
> >>>> the only hints to achieve better performance are currently related to
> >>>> RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
> >>>> some 2 Terrabyte) was mentioned. I asked at my local IT center and the
> >>>> machine with such RAM is around 30-60 thousand EUR and definitely out
> of
> >>>> my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
> >>>> sure that this would solve the problem. At this time i doubt it since
> >>>> the software keeps crashing on me and there seem to be bugs in
> Operating
> >>>> System, Java Virtual Machine and Jena itself that prevent the success
> as
> >>>> well as the severe degradation in performance for multi-billion triple
> >>>> imports that make it almost impossible to test given a estimated time
> of
> >>>> finish of half a year on (old but sophisticated) hardware that i am
> >>>> using daily.
> >>>>
> >>>> Cheers
> >>>>     Wolfgang
> >>>>
> >>>> Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
> >>>>> Hi,
> >>>>>
> >>>>> I want to load the full Wikidata dump, available at
> >>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2
> to
> >>> use in Jena.
> >>>>> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G.
> >>> Initially, the progress (measured by dataset size) is quick. It slows
> down
> >>> very much after a couple of 100GB written, and finally, at around
> 500GB,
> >>> the progress is almost halted.
> >>>>> Did anyone ingest Wikidata into Jena before? What are the system
> >>> requirements? Is there a specific tdb2.tdbloader configuration that
> would
> >>> speed things up? For example building an index after data ingest?
> >>>>> Thanks
> >>>>> Johannes
> >>>>>
> >>>>> Johannes Hoffart, Executive Director, Technology Division
> >>>>> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 |
> D-60329
> >>> Frankfurt am Main
> >>>>> Email: johannes.hoffart@gs.com<ma...@gs.com> |
> Tel:
> >>> +49 (0)69 7532 3558
> >>>>> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen |
> Dr.
> >>> Matthias Bock
> >>>>> Vorsitzender des Aufsichtsrats: Dermot McDonogh
> >>>>> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
> >>>>>
> >>>>>
> >>>>> ________________________________
> >>>>>
> >>>>> Your Personal Data: We may collect and process information about you
> >>> that may be subject to data protection laws. For more information
> about how
> >>> we use and disclose your personal data, how we protect your
> information,
> >>> our legal basis to use your information, your rights and who you can
> >>> contact, please refer to: www.gs.com/privacy-notices<
> >>> http://www.gs.com/privacy-notices>
> >>
>


-- 


---
Marco Neumann
KONA

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Andy Seaborne <an...@apache.org>.


On 09/06/2020 12:18, Wolfgang Fahl wrote:
> Marco
> 
> thank you for sharing your results. Could you please try to make the
> sample size 10 and 100 times bigger for the discussion we currently have
> at hand. Getting to a billion triples has not been a problem for the
> WikiData import. From 1-10 billion triples it gets tougher and
> for >10 billion triples there is no success story yet that I know of.
> 
> This brings us to the general question - what will we do in a few years
> from now when we'd like to work with 100 billion triples or more and the
> upcoming decades where we might see a rise in data size that stays
> exponential?

At several levels, the world is going parallel, both one "machine" (a 
computer is a distributed system) and datacenter wide.

Scale comes from multiple machines. There is still mileage in larger 
single machine architectures and better software, but not long term.

At another level - why have all the data in the same place? Convenience.

Search engines are not a feature of WWW architecture. They are an 
emergent effect because it is convenient (simpler, easier) to have one 
place to find things - and that also makes it a winner-takes-all market.

Convenience has limits. Search engine style does not work for all tasks, 
e.g. search within the enterprise for example, or indeed for data. And 
it has consequences in the clandestine data analysis and data abuse.

     Andy

> Wolfgang
> 
> 
> Am 09.06.20 um 12:17 schrieb Marco Neumann:
>> same here, I get the best performance on single iron with SSD and fast
>> DDRAM. The datacenters in the cloud tend to be very selective and you can
>> only get the fast dedicated hardware in a few locations in the cloud.
>>
>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>>
>> In addition keep in mind these are not query benchmarks.
>>
>>   Marco
>>
>> On Tue, Jun 9, 2020 at 10:27 AM Andy Seaborne <an...@apache.org> wrote:
>>
>>> It maybe that SSD is the important factor.
>>>
>>> 1/ From a while ago, on truthy:
>>>
>>>
>>> https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E
>>>
>>> before tdb2.tdbloader was a thing.
>>>
>>> 2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with
>>> a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS server
>>> (local NVMe, but virtualized, SSD).
>>>
>>> The laptop was nearly as fast as a big AWS server.
>>>
>>> My assumption was that as the database grew, RAM caching become less
>>> significant and the speed of I/O was dominant.
>>>
>>> FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate
>>> the I/O.
>>>
>>> ----
>>>
>>> I don't have access to hardware (or ad hoc AWS machines) at the moment
>>> otherwise I'd give this a try.
>>>
>>> Previously, downloading the data to AWS is much faster and much more
>>> reliable than to my local setup. That said, I think dumps.wikimedia.org
>>> does some rate limiting of downloads as well or my route to the site
>>> ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained
>>> download speed a lot out of working hours.
>>>
>>>       Andy
>>>
>>> On 09/06/2020 08:04, Wolfgang Fahl wrote:
>>>> Hi Johannes,
>>>>
>>>> thank you for bringing the issue to this mailinglist again.
>>>>
>>>> At
>>>>
>>> https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
>>>> there is a question describing the issue and at
>>>>
>>> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
>>>> a documentation of my own attempts. There has been some feedback by a
>>>> few people in the mean time but i have no report of a success yet. Also
>>>> the only hints to achieve better performance are currently related to
>>>> RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
>>>> some 2 Terrabyte) was mentioned. I asked at my local IT center and the
>>>> machine with such RAM is around 30-60 thousand EUR and definitely out of
>>>> my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
>>>> sure that this would solve the problem. At this time i doubt it since
>>>> the software keeps crashing on me and there seem to be bugs in Operating
>>>> System, Java Virtual Machine and Jena itself that prevent the success as
>>>> well as the severe degradation in performance for multi-billion triple
>>>> imports that make it almost impossible to test given a estimated time of
>>>> finish of half a year on (old but sophisticated) hardware that i am
>>>> using daily.
>>>>
>>>> Cheers
>>>>     Wolfgang
>>>>
>>>> Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
>>>>> Hi,
>>>>>
>>>>> I want to load the full Wikidata dump, available at
>>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to
>>> use in Jena.
>>>>> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G.
>>> Initially, the progress (measured by dataset size) is quick. It slows down
>>> very much after a couple of 100GB written, and finally, at around 500GB,
>>> the progress is almost halted.
>>>>> Did anyone ingest Wikidata into Jena before? What are the system
>>> requirements? Is there a specific tdb2.tdbloader configuration that would
>>> speed things up? For example building an index after data ingest?
>>>>> Thanks
>>>>> Johannes
>>>>>
>>>>> Johannes Hoffart, Executive Director, Technology Division
>>>>> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329
>>> Frankfurt am Main
>>>>> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel:
>>> +49 (0)69 7532 3558
>>>>> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr.
>>> Matthias Bock
>>>>> Vorsitzender des Aufsichtsrats: Dermot McDonogh
>>>>> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
>>>>>
>>>>>
>>>>> ________________________________
>>>>>
>>>>> Your Personal Data: We may collect and process information about you
>>> that may be subject to data protection laws. For more information about how
>>> we use and disclose your personal data, how we protect your information,
>>> our legal basis to use your information, your rights and who you can
>>> contact, please refer to: www.gs.com/privacy-notices<
>>> http://www.gs.com/privacy-notices>
>>

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Wolfgang Fahl <wf...@bitplan.com>.

Marco

thank you for sharing your results. Could you please try to make the
sample size 10 and 100 times bigger for the discussion we currently have
at hand. Getting to a billion triples has not been a problem for the
WikiData import. From 1-10 billion triples it gets tougher and
for >10 billion triples there is no success story yet that I know of.

This brings us to the general question - what will we do in a few years
from now when we'd like to work with 100 billion triples or more and the
upcoming decades where we might see a rise in data size that stays
exponential?

Wolfgang


Am 09.06.20 um 12:17 schrieb Marco Neumann:
> same here, I get the best performance on single iron with SSD and fast
> DDRAM. The datacenters in the cloud tend to be very selective and you can
> only get the fast dedicated hardware in a few locations in the cloud.
>
> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
>
> In addition keep in mind these are not query benchmarks.
>
>  Marco
>
> On Tue, Jun 9, 2020 at 10:27 AM Andy Seaborne <an...@apache.org> wrote:
>
>> It maybe that SSD is the important factor.
>>
>> 1/ From a while ago, on truthy:
>>
>>
>> https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E
>>
>> before tdb2.tdbloader was a thing.
>>
>> 2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with
>> a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS server
>> (local NVMe, but virtualized, SSD).
>>
>> The laptop was nearly as fast as a big AWS server.
>>
>> My assumption was that as the database grew, RAM caching become less
>> significant and the speed of I/O was dominant.
>>
>> FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate
>> the I/O.
>>
>> ----
>>
>> I don't have access to hardware (or ad hoc AWS machines) at the moment
>> otherwise I'd give this a try.
>>
>> Previously, downloading the data to AWS is much faster and much more
>> reliable than to my local setup. That said, I think dumps.wikimedia.org
>> does some rate limiting of downloads as well or my route to the site
>> ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained
>> download speed a lot out of working hours.
>>
>>      Andy
>>
>> On 09/06/2020 08:04, Wolfgang Fahl wrote:
>>> Hi Johannes,
>>>
>>> thank you for bringing the issue to this mailinglist again.
>>>
>>> At
>>>
>> https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
>>> there is a question describing the issue and at
>>>
>> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
>>> a documentation of my own attempts. There has been some feedback by a
>>> few people in the mean time but i have no report of a success yet. Also
>>> the only hints to achieve better performance are currently related to
>>> RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
>>> some 2 Terrabyte) was mentioned. I asked at my local IT center and the
>>> machine with such RAM is around 30-60 thousand EUR and definitely out of
>>> my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
>>> sure that this would solve the problem. At this time i doubt it since
>>> the software keeps crashing on me and there seem to be bugs in Operating
>>> System, Java Virtual Machine and Jena itself that prevent the success as
>>> well as the severe degradation in performance for multi-billion triple
>>> imports that make it almost impossible to test given a estimated time of
>>> finish of half a year on (old but sophisticated) hardware that i am
>>> using daily.
>>>
>>> Cheers
>>>    Wolfgang
>>>
>>> Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
>>>> Hi,
>>>>
>>>> I want to load the full Wikidata dump, available at
>> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to
>> use in Jena.
>>>> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G.
>> Initially, the progress (measured by dataset size) is quick. It slows down
>> very much after a couple of 100GB written, and finally, at around 500GB,
>> the progress is almost halted.
>>>> Did anyone ingest Wikidata into Jena before? What are the system
>> requirements? Is there a specific tdb2.tdbloader configuration that would
>> speed things up? For example building an index after data ingest?
>>>> Thanks
>>>> Johannes
>>>>
>>>> Johannes Hoffart, Executive Director, Technology Division
>>>> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329
>> Frankfurt am Main
>>>> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel:
>> +49 (0)69 7532 3558
>>>> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr.
>> Matthias Bock
>>>> Vorsitzender des Aufsichtsrats: Dermot McDonogh
>>>> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
>>>>
>>>>
>>>> ________________________________
>>>>
>>>> Your Personal Data: We may collect and process information about you
>> that may be subject to data protection laws. For more information about how
>> we use and disclose your personal data, how we protect your information,
>> our legal basis to use your information, your rights and who you can
>> contact, please refer to: www.gs.com/privacy-notices<
>> http://www.gs.com/privacy-notices>
>
-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Marco Neumann <ma...@gmail.com>.

same here, I get the best performance on single iron with SSD and fast
DDRAM. The datacenters in the cloud tend to be very selective and you can
only get the fast dedicated hardware in a few locations in the cloud.

http://www.lotico.com/index.php/JENA_Loader_Benchmarks

In addition keep in mind these are not query benchmarks.

 Marco

On Tue, Jun 9, 2020 at 10:27 AM Andy Seaborne <an...@apache.org> wrote:

> It maybe that SSD is the important factor.
>
> 1/ From a while ago, on truthy:
>
>
> https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E
>
> before tdb2.tdbloader was a thing.
>
> 2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with
> a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS server
> (local NVMe, but virtualized, SSD).
>
> The laptop was nearly as fast as a big AWS server.
>
> My assumption was that as the database grew, RAM caching become less
> significant and the speed of I/O was dominant.
>
> FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate
> the I/O.
>
> ----
>
> I don't have access to hardware (or ad hoc AWS machines) at the moment
> otherwise I'd give this a try.
>
> Previously, downloading the data to AWS is much faster and much more
> reliable than to my local setup. That said, I think dumps.wikimedia.org
> does some rate limiting of downloads as well or my route to the site
> ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained
> download speed a lot out of working hours.
>
>      Andy
>
> On 09/06/2020 08:04, Wolfgang Fahl wrote:
> > Hi Johannes,
> >
> > thank you for bringing the issue to this mailinglist again.
> >
> > At
> >
> https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
> > there is a question describing the issue and at
> >
> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
> > a documentation of my own attempts. There has been some feedback by a
> > few people in the mean time but i have no report of a success yet. Also
> > the only hints to achieve better performance are currently related to
> > RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
> > some 2 Terrabyte) was mentioned. I asked at my local IT center and the
> > machine with such RAM is around 30-60 thousand EUR and definitely out of
> > my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
> > sure that this would solve the problem. At this time i doubt it since
> > the software keeps crashing on me and there seem to be bugs in Operating
> > System, Java Virtual Machine and Jena itself that prevent the success as
> > well as the severe degradation in performance for multi-billion triple
> > imports that make it almost impossible to test given a estimated time of
> > finish of half a year on (old but sophisticated) hardware that i am
> > using daily.
> >
> > Cheers
> >    Wolfgang
> >
> > Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
> >> Hi,
> >>
> >> I want to load the full Wikidata dump, available at
> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to
> use in Jena.
> >>
> >> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G.
> Initially, the progress (measured by dataset size) is quick. It slows down
> very much after a couple of 100GB written, and finally, at around 500GB,
> the progress is almost halted.
> >>
> >> Did anyone ingest Wikidata into Jena before? What are the system
> requirements? Is there a specific tdb2.tdbloader configuration that would
> speed things up? For example building an index after data ingest?
> >>
> >> Thanks
> >> Johannes
> >>
> >> Johannes Hoffart, Executive Director, Technology Division
> >> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329
> Frankfurt am Main
> >> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel:
> +49 (0)69 7532 3558
> >> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr.
> Matthias Bock
> >> Vorsitzender des Aufsichtsrats: Dermot McDonogh
> >> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
> >>
> >>
> >> ________________________________
> >>
> >> Your Personal Data: We may collect and process information about you
> that may be subject to data protection laws. For more information about how
> we use and disclose your personal data, how we protect your information,
> our legal basis to use your information, your rights and who you can
> contact, please refer to: www.gs.com/privacy-notices<
> http://www.gs.com/privacy-notices>
> >>
>


-- 


---
Marco Neumann
KONA

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Andy Seaborne <an...@apache.org>.

It maybe that SSD is the important factor.

1/ From a while ago, on truthy:

https://lists.apache.org/thread.html/70dde8e3d99ce3d69de613b5013c3f4c583d96161dec494ece49a412%40%3Cusers.jena.apache.org%3E

before tdb2.tdbloader was a thing.

2/ I did some (not open) testing on a mere 800M and tdb2.tdbloader with 
a Dell XPS laptop (2015 model, 16G RAM, 1T M.2 SSD) and a big AWS server 
(local NVMe, but virtualized, SSD).

The laptop was nearly as fast as a big AWS server.

My assumption was that as the database grew, RAM caching become less 
significant and the speed of I/O was dominant.

FYI When "tdb2.tdbloader --loader=parallel" gets going it will saturate 
the I/O.

----

I don't have access to hardware (or ad hoc AWS machines) at the moment 
otherwise I'd give this a try.

Previously, downloading the data to AWS is much faster and much more 
reliable than to my local setup. That said, I think dumps.wikimedia.org 
does some rate limiting of downloads as well or my route to the site 
ends up on a virtual T3 - I get the magic number of 5MBytes/s sustained 
download speed a lot out of working hours.

     Andy

On 09/06/2020 08:04, Wolfgang Fahl wrote:
> Hi Johannes,
> 
> thank you for bringing the issue to this mailinglist again.
> 
> At
> https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
> there is a question describing the issue and at
> http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
> a documentation of my own attempts. There has been some feedback by a
> few people in the mean time but i have no report of a success yet. Also
> the only hints to achieve better performance are currently related to
> RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
> some 2 Terrabyte) was mentioned. I asked at my local IT center and the
> machine with such RAM is around 30-60 thousand EUR and definitely out of
> my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
> sure that this would solve the problem. At this time i doubt it since
> the software keeps crashing on me and there seem to be bugs in Operating
> System, Java Virtual Machine and Jena itself that prevent the success as
> well as the severe degradation in performance for multi-billion triple
> imports that make it almost impossible to test given a estimated time of
> finish of half a year on (old but sophisticated) hardware that i am
> using daily.
> 
> Cheers
>    Wolfgang
> 
> Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
>> Hi,
>>
>> I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena.
>>
>> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much after a couple of 100GB written, and finally, at around 500GB, the progress is almost halted.
>>
>> Did anyone ingest Wikidata into Jena before? What are the system requirements? Is there a specific tdb2.tdbloader configuration that would speed things up? For example building an index after data ingest?
>>
>> Thanks
>> Johannes
>>
>> Johannes Hoffart, Executive Director, Technology Division
>> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 Frankfurt am Main
>> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel: +49 (0)69 7532 3558
>> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. Matthias Bock
>> Vorsitzender des Aufsichtsrats: Dermot McDonogh
>> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
>>
>>
>> ________________________________
>>
>> Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
>>

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Wolfgang Fahl <wf...@bitplan.com>.

Hi Johannes,

thank you for bringing the issue to this mailinglist again.

At
https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
there is a question describing the issue and at
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Test_with_Apache_Jena
a documentation of my own attempts. There has been some feedback by a
few people in the mean time but i have no report of a success yet. Also
the only hints to achieve better performance are currently related to
RAM and disk so using lots of RAM (up to 2 Terrrabyte) and SSDs (also
some 2 Terrabyte) was mentioned. I asked at my local IT center and the
machine with such RAM is around 30-60 thousand EUR and definitely out of
my budget. I might invest in a 200 EUR 2 Terrabyte SSD if i could be
sure that this would solve the problem. At this time i doubt it since
the software keeps crashing on me and there seem to be bugs in Operating
System, Java Virtual Machine and Jena itself that prevent the success as
well as the severe degradation in performance for multi-billion triple
imports that make it almost impossible to test given a estimated time of
finish of half a year on (old but sophisticated) hardware that i am
using daily.

Cheers
  Wolfgang

Am 08.06.20 um 17:54 schrieb Hoffart, Johannes:
> Hi,
>
> I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena.
>
> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much after a couple of 100GB written, and finally, at around 500GB, the progress is almost halted.
>
> Did anyone ingest Wikidata into Jena before? What are the system requirements? Is there a specific tdb2.tdbloader configuration that would speed things up? For example building an index after data ingest?
>
> Thanks
> Johannes
>
> Johannes Hoffart, Executive Director, Technology Division
> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 Frankfurt am Main
> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel: +49 (0)69 7532 3558
> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. Matthias Bock
> Vorsitzender des Aufsichtsrats: Dermot McDonogh
> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
>
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
>
-- 

BITPlan - smart solutions
Wolfgang Fahl
Pater-Delp-Str. 1, D-47877 Willich Schiefbahn
Tel. +49 2154 811-480, Fax +49 2154 811-481
Web: http://www.bitplan.de
BITPlan GmbH, Willich - HRB 6820 Krefeld, Steuer-Nr.: 10258040548, Geschäftsführer: Wolfgang Fahl

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Ahmed El-Roby <ah...@gmail.com>.

Thanks Johannes for starting this thread. I am facing the exact same
problem with tdb2. For any significantly large file for that matter, it
takes forever to load. I hope this problem has a solution.
Thank you.
-Ahmed

On Mon, Jun 8, 2020 at 11:55 AM Hoffart, Johannes <Jo...@gs.com>
wrote:

> Hi,
>
> I want to load the full Wikidata dump, available at
> https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to
> use in Jena.
>
> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G.
> Initially, the progress (measured by dataset size) is quick. It slows down
> very much after a couple of 100GB written, and finally, at around 500GB,
> the progress is almost halted.
>
> Did anyone ingest Wikidata into Jena before? What are the system
> requirements? Is there a specific tdb2.tdbloader configuration that would
> speed things up? For example building an index after data ingest?
>
> Thanks
> Johannes
>
> Johannes Hoffart, Executive Director, Technology Division
> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329
> Frankfurt am Main
> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel: +49
> (0)69 7532 3558
> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr.
> Matthias Bock
> Vorsitzender des Aufsichtsrats: Dermot McDonogh
> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
>
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you that
> may be subject to data protection laws. For more information about how we
> use and disclose your personal data, how we protect your information, our
> legal basis to use your information, your rights and who you can contact,
> please refer to: www.gs.com/privacy-notices<
> http://www.gs.com/privacy-notices>
>

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

Wouldn't it be a good idea to have a page in the Fuseki/TDB2
documentation with benchmark results and/or user-reported loading
statistics, including hardware specs?

It would also be useful to map such specs to the AWS instance types:
https://aws.amazon.com/ec2/instance-types/

On Mon, Jun 8, 2020 at 11:43 PM Andy Seaborne <an...@apache.org> wrote:
>
> Hi Johannes,
>
> On 08/06/2020 16:54, Hoffart, Johannes wrote:
> > Hi,
> >
> > I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena.
> >
> > I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much after a couple of 100GB written, and finally, at around 500GB, the progress is almost halted.
>
> Loading performance is sensitive to the hardware used.  Large RAM, high
> performance SSD.
>
> Setting the heap size larger actually slows the process down. The
> database indexes are cached outside the heap in the main OS filesystem
> case so a cache size of 120G is taking space away from that space.
> A heap size of ~8G should be more than enough.
>
> The other factor is the storage. A large SSD, and best of an M.2
> connected local SSD, is significantly faster.
>
> It can be worthwhile to build the database on a machine spec'ed for
> loading and move it elsewhere for query use. The database, once built,
> can be file-copied.
>
> It will take many hours to load under optimal conditions - it has been
> reported it takes over an hour just to count the lines in the
> latest-all.ttl.bz2 file using the standard unix tools (no java in
> sight!). I'm trying to just parse the file and the parser is taking
> hours. There are ea lot of warnings (you can ignore them - they are just
> warnings, not errors).
>
> latest-truthy is a significantly smaller. Getting the process working
> (it's only in NT format but you can just load the prefixes taken from
> the TTL version separately)
>
> And check the download of any of these large files - I have had it
> truncate in one attempt I made.
>
>      Andy
>
> > Did anyone ingest Wikidata into Jena before? What are the system requirements? Is there a specific tdb2.tdbloader configuration that would speed things up? For example building an index after data ingest?
>
> tdb2.tdbloader has options for loader algorithm. --loader=parallel is
> probably fastest if you have the SSD space.
>
> >
> > Thanks
> > Johannes
> >
> > Johannes Hoffart, Executive Director, Technology Division
> > Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 Frankfurt am Main
> > Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel: +49 (0)69 7532 3558
> > Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. Matthias Bock
> > Vorsitzender des Aufsichtsrats: Dermot McDonogh
> > Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
> >
> >
> > ________________________________
> >
> > Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
> >

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Andy Seaborne <an...@apache.org>.

Hi Johannes,

On 08/06/2020 16:54, Hoffart, Johannes wrote:
> Hi,
> 
> I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena.
> 
> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much after a couple of 100GB written, and finally, at around 500GB, the progress is almost halted.

Loading performance is sensitive to the hardware used.  Large RAM, high 
performance SSD.

Setting the heap size larger actually slows the process down. The 
database indexes are cached outside the heap in the main OS filesystem 
case so a cache size of 120G is taking space away from that space.
A heap size of ~8G should be more than enough.

The other factor is the storage. A large SSD, and best of an M.2 
connected local SSD, is significantly faster.

It can be worthwhile to build the database on a machine spec'ed for 
loading and move it elsewhere for query use. The database, once built, 
can be file-copied.

It will take many hours to load under optimal conditions - it has been 
reported it takes over an hour just to count the lines in the 
latest-all.ttl.bz2 file using the standard unix tools (no java in 
sight!). I'm trying to just parse the file and the parser is taking 
hours. There are ea lot of warnings (you can ignore them - they are just 
warnings, not errors).

latest-truthy is a significantly smaller. Getting the process working
(it's only in NT format but you can just load the prefixes taken from 
the TTL version separately)

And check the download of any of these large files - I have had it 
truncate in one attempt I made.

     Andy

> Did anyone ingest Wikidata into Jena before? What are the system requirements? Is there a specific tdb2.tdbloader configuration that would speed things up? For example building an index after data ingest?

tdb2.tdbloader has options for loader algorithm. --loader=parallel is 
probably fastest if you have the SSD space.

> 
> Thanks
> Johannes
> 
> Johannes Hoffart, Executive Director, Technology Division
> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 Frankfurt am Main
> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel: +49 (0)69 7532 3558
> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. Matthias Bock
> Vorsitzender des Aufsichtsrats: Dermot McDonogh
> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
> 
> 
> ________________________________
> 
> Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
>

RE: Resource requirements and configuration for loading a Wikidata dump

Posted by "Hoffart, Johannes" <Jo...@gs.com>.

Hi Andy,

Thanks for the helpful pointers by you and others.

I will change the heap settings to see if this at least allows the process to finish. For reference, the machine has 128GB of main memory and a regular HDD attached.

I also changed the logging settings to see the progress (would be nice to have this enabled by default).

Thanks
Johannes

-----Original Message-----
From: Andy Seaborne <an...@apache.org>
Sent: Monday, June 8, 2020 11:43 PM
To: users@jena.apache.org
Subject: Re: Resource requirements and configuration for loading a Wikidata dump

Hi Johannes,

On 08/06/2020 16:54, Hoffart, Johannes wrote:
> Hi,
>
> I want to load the full Wikidata dump, available at https://urldefense.proofpoint.com/v2/url?u=https-3A__dumps.wikimedia.org_wikidatawiki_entities_latest-2Dall.ttl.bz2&d=DwIC-g&c=7563p3e2zaQw0AB1wrFVgyagb2IE5rTZOYPxLxfZlX4&r=xf6--uwdcCl8ABKwQSkT2uFj8PgnlEqThex0udypM28&m=GvGO8rPB3XdHz-_iF_4fClXgvmdy_32YrUUTUvRRxQQ&s=x5WddEwbXWPtCCFiaZ2ytRIxJIRL_kIvxtOIkOcsNzg&e=  to use in Jena.
>
> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much after a couple of 100GB written, and finally, at around 500GB, the progress is almost halted.

Loading performance is sensitive to the hardware used.  Large RAM, high performance SSD.

Setting the heap size larger actually slows the process down. The database indexes are cached outside the heap in the main OS filesystem case so a cache size of 120G is taking space away from that space.
A heap size of ~8G should be more than enough.

The other factor is the storage. A large SSD, and best of an M.2 connected local SSD, is significantly faster.

It can be worthwhile to build the database on a machine spec'ed for loading and move it elsewhere for query use. The database, once built, can be file-copied.

It will take many hours to load under optimal conditions - it has been reported it takes over an hour just to count the lines in the
latest-all.ttl.bz2 file using the standard unix tools (no java in sight!). I'm trying to just parse the file and the parser is taking hours. There are ea lot of warnings (you can ignore them - they are just warnings, not errors).

latest-truthy is a significantly smaller. Getting the process working (it's only in NT format but you can just load the prefixes taken from the TTL version separately)

And check the download of any of these large files - I have had it truncate in one attempt I made.

     Andy

> Did anyone ingest Wikidata into Jena before? What are the system requirements? Is there a specific tdb2.tdbloader configuration that would speed things up? For example building an index after data ingest?

tdb2.tdbloader has options for loader algorithm. --loader=parallel is probably fastest if you have the SSD space.

>
> Thanks
> Johannes
>
> Johannes Hoffart, Executive Director, Technology Division Goldman
> Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329
> Frankfurt am Main
> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel:
> +49 (0)69 7532 3558
> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen |
> Dr. Matthias Bock Vorsitzender des Aufsichtsrats: Dermot McDonogh
> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
>
>
> ________________________________
>
> Your Personal Data: We may collect and process information about you
> that may be subject to data protection laws. For more information
> about how we use and disclose your personal data, how we protect your
> information, our legal basis to use your information, your rights and
> who you can contact, please refer to:
> http://www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
>

________________________________

Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>

Re: Resource requirements and configuration for loading a Wikidata dump

Posted by Andy Seaborne <an...@apache.org>.

Hi Johannes,

On 08/06/2020 16:54, Hoffart, Johannes wrote:
> Hi,
> 
> I want to load the full Wikidata dump, available at https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.ttl.bz2 to use in Jena.
> 
> I tried it using the tdb2.tdbloader with $JVM_ARGS set to -Xmx120G. Initially, the progress (measured by dataset size) is quick. It slows down very much after a couple of 100GB written, and finally, at around 500GB, the progress is almost halted.

Loading performance is sensitive to the hardware used.  Large RAM, high 
performance SSD.

Setting the heap size larger actually slows the process down. The 
database indexes are cached outside the heap in the main OS filesystem 
case so a cache size of 120G is taking space away from that space.
A heap size of ~8G should be more than enough.

The other factor is the storage. A large SSD, and best of an M.2 
connected local SSD, is significantly faster.

It can be worthwhile to build the database on a machine spec'ed for 
loading and move it elsewhere for query use. The database, once built, 
can be file-copied.

It will take many hours to load under optimal conditions - it has been 
reported it takes over an hour just to count the lines in the 
latest-all.ttl.bz2 file using the standard unix tools (no java in 
sight!). I'm trying to just parse the file and the parser is taking 
hours. There are ea lot of warnings (you can ignore them - they are just 
warnings, not errors).

latest-truthy is a significantly smaller. Getting the process working
(it's only in NT format but you can just load the prefixes taken from 
the TTL version separately)

And check the download of any of these large files - I have had it 
truncate in one attempt I made.

     Andy

> Did anyone ingest Wikidata into Jena before? What are the system requirements? Is there a specific tdb2.tdbloader configuration that would speed things up? For example building an index after data ingest?

tdb2.tdbloader has options for loader algorithm. --loader=parallel is 
probably fastest if you have the SSD space.

> 
> Thanks
> Johannes
> 
> Johannes Hoffart, Executive Director, Technology Division
> Goldman Sachs Bank Europe SE | Marienturm | Taunusanlage 9-10 | D-60329 Frankfurt am Main
> Email: johannes.hoffart@gs.com<ma...@gs.com> | Tel: +49 (0)69 7532 3558
> Vorstand: Dr. Wolfgang Fink (Vorsitzender) | Thomas Degn-Petersen | Dr. Matthias Bock
> Vorsitzender des Aufsichtsrats: Dermot McDonogh
> Sitz: Frankfurt am Main | Amtsgericht Frankfurt am Main HRB 114190
> 
> 
> ________________________________
> 
> Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>
>