You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Alessandro Adamou <ad...@cs.unibo.it> on 2013/07/29 13:49:52 UTC

Indexing DBPedia 3.8 : hardware requirements?

Hi,

I've been trying to build a new custom Solr index DBPedia 3.8, because 
it seems the one uploaded on

http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/

does not index the DBPedia ontology properties as in the 
en/mappingbased_properties_en.nt dump

So I:
* initialized the EntityHub indexing tool
* Downloaded 1.2 GB worth of bzipped dumps to indexing/resources/rdfdata
* Added http://dbpedia.org/ontology/* to indexing/config/mappings.txt
* Executed the indexing tool

It has been a very intensive process so far, and I had to restart it 
four times due to resource issues.

On a 8 GiB RAM Mac running on a hard-disk it was taking like ten minutes 
every 10k indexed items, i.e. after 36 hours it was still quite a way to 
go. Plus, the system started thrashing out of page faults even with a 
6GiB Java heap.

Tried again on an external SSD and a 7GiB heap. It went through all the 
triples in about 8 hours, but hit several OutOfMemoryError on the 
org.apache.lucene.index.IndexWriter.forceMerge

So I'm asking: who has managed to build an entire DBpedia index so far, 
and on what hardware specs (especially heap size)?

Thanks

Alessandro


-- 
Alessandro Adamou, Ph.D.

Knowledge Media Institute
The Open University
Walton Hall, Milton Keynes MK7 6AA
United Kingdom


"I will give you everything, just don't demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice


Re: Indexing DBPedia 3.8 : hardware requirements?

Posted by Alessandro Adamou <ad...@cs.unibo.it>.
Hi Rupert, thanks for following up!

On 01/08/2013 14:25, Rupert Westenthaler wrote:
> Only
> the final optimization of  the Solr index will overflow the physical
> RAM - meaning that it will be very slow. But even so this step should
> not take longer as 1-2h.

Indeed the finalization was the longest and most taxing process - nearly 
5h of those 6h30m were taken by that step, actually.

I should have mentioned that in the previous count:
- the TDB had already been populated ( ~ 14M triples)
- I used the genericrdf indexer with a custom configuration but no 
custom entity scores or field boosts.

Now I am trying again with the dbpedia indexer and the 
incoming_links.txt too, as well as a slightly larger dataset (I added 
the Yago stuff now). I am also redoing the TDB from scratch.

> For really big datasets is is good practice to kill the Indexing tool
> after all RDF triples are imported to Jena TDB. This is because Jena
> TDB uses memory mapped files. After the import of the RDF data most of
> the available Memory will be occupied by Jena what does slow down the
> indexing process. Killing the Tool and restarting it can therefore
> improve the indexing performance.

Thanks for the tip, I indeed just restarted it as soon as the TDB was 
fully populated - resident memory footprint was about 15g by then!

I will post here whatever figures I can get

Thanks again!

Alessandro


On Wed, Jul 31, 2013 at 1:43 PM, Alessandro Adamou <ad...@cs.unibo.it> 
wrote:
>> Just a follow-up. My success story, if you can call it that, with this task
>> was on my sixth attempt and the following setting:
>>
>> CPU : Intel Xeon E5-2640 @ 2.50GHz
>> RAM : 24 GiB
>> JVM heap size : 18g out - an additional 4g swap was used in the process
>> No info on the disk, but likely a SSD
>>
>> time: 6 hours 30 mins
>>
>> Once again, I used the same datasets as in
>> http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/ plus
>> en/mappingbased_properties_en.nt
>>
>> In mappings.txt I also added all dbpedia-owl entities and removed mappings
>> from dc:title and foaf:name to rdfs:label
>>
>> Cheers
>>
>> Alessandro
>>
>>
>>
>> On 29/07/2013 12:49, Alessandro Adamou wrote:
>>> Hi,
>>>
>>> I've been trying to build a new custom Solr index DBPedia 3.8, because it
>>> seems the one uploaded on
>>>
>>> http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/
>>>
>>> does not index the DBPedia ontology properties as in the
>>> en/mappingbased_properties_en.nt dump
>>>
>>> So I:
>>> * initialized the EntityHub indexing tool
>>> * Downloaded 1.2 GB worth of bzipped dumps to indexing/resources/rdfdata
>>> * Added http://dbpedia.org/ontology/* to indexing/config/mappings.txt
>>> * Executed the indexing tool
>>>
>>> It has been a very intensive process so far, and I had to restart it four
>>> times due to resource issues.
>>>
>>> On a 8 GiB RAM Mac running on a hard-disk it was taking like ten minutes
>>> every 10k indexed items, i.e. after 36 hours it was still quite a way to go.
>>> Plus, the system started thrashing out of page faults even with a 6GiB Java
>>> heap.
>>>
>>> Tried again on an external SSD and a 7GiB heap. It went through all the
>>> triples in about 8 hours, but hit several OutOfMemoryError on the
>>> org.apache.lucene.index.IndexWriter.forceMerge
>>>
>>> So I'm asking: who has managed to build an entire DBpedia index so far,
>>> and on what hardware specs (especially heap size)?
>>>
>>> Thanks
>>>
>>> Alessandro
>>>
>>>
>>
>> --
>> Alessandro Adamou, Ph.D.
>>
>> Knowledge Media Institute
>> The Open University
>> Walton Hall, Milton Keynes MK7 6AA
>> United Kingdom
>>
>>
>> "I will give you everything, just don't demand anything."
>> (Ettore Petrolini, 1917)
>>
>> Not sent from my iSnobTechDevice
>>
>
>


-- 
Alessandro Adamou, Ph.D.

Knowledge Media Institute
The Open University
Walton Hall, Milton Keynes MK7 6AA
United Kingdom


"I will give you everything, just don't demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice


Re: Indexing DBPedia 3.8 : hardware requirements?

Posted by Rupert Westenthaler <ru...@gmail.com>.
HI Alessandro

Solr4 requires a lot of Heap for optimizing the index (basically the
final join of the index segments). For the Freebase index this needs
about 25GByte RAM. If I remember correctly DBpdedia needed about
18GByte.

You can safely configure higher -Xmx values as physical RAM is
available on the machine. During indexing the Heap will not grow to
values higher as ~10 times the size of an Entity. So if you do not use
some complex LDPath program that fetches a lot of relations (e.g. all
incoming Wiki Links) memory consumption should be below about 1GByte.
So during Indexing time there should be no paging on OS level. Only
the final optimization of  the Solr index will overflow the physical
RAM - meaning that it will be very slow. But even so this step should
not take longer as 1-2h.

For really big datasets is is good practice to kill the Indexing tool
after all RDF triples are imported to Jena TDB. This is because Jena
TDB uses memory mapped files. After the import of the RDF data most of
the available Memory will be occupied by Jena what does slow down the
indexing process. Killing the Tool and restarting it can therefore
improve the indexing performance.

best
Rupert


best
Rupert

On Wed, Jul 31, 2013 at 1:43 PM, Alessandro Adamou <ad...@cs.unibo.it> wrote:
> Just a follow-up. My success story, if you can call it that, with this task
> was on my sixth attempt and the following setting:
>
> CPU : Intel Xeon E5-2640 @ 2.50GHz
> RAM : 24 GiB
> JVM heap size : 18g out - an additional 4g swap was used in the process
> No info on the disk, but likely a SSD
>
> time: 6 hours 30 mins
>
> Once again, I used the same datasets as in
> http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/ plus
> en/mappingbased_properties_en.nt
>
> In mappings.txt I also added all dbpedia-owl entities and removed mappings
> from dc:title and foaf:name to rdfs:label
>
> Cheers
>
> Alessandro
>
>
>
> On 29/07/2013 12:49, Alessandro Adamou wrote:
>>
>> Hi,
>>
>> I've been trying to build a new custom Solr index DBPedia 3.8, because it
>> seems the one uploaded on
>>
>> http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/
>>
>> does not index the DBPedia ontology properties as in the
>> en/mappingbased_properties_en.nt dump
>>
>> So I:
>> * initialized the EntityHub indexing tool
>> * Downloaded 1.2 GB worth of bzipped dumps to indexing/resources/rdfdata
>> * Added http://dbpedia.org/ontology/* to indexing/config/mappings.txt
>> * Executed the indexing tool
>>
>> It has been a very intensive process so far, and I had to restart it four
>> times due to resource issues.
>>
>> On a 8 GiB RAM Mac running on a hard-disk it was taking like ten minutes
>> every 10k indexed items, i.e. after 36 hours it was still quite a way to go.
>> Plus, the system started thrashing out of page faults even with a 6GiB Java
>> heap.
>>
>> Tried again on an external SSD and a 7GiB heap. It went through all the
>> triples in about 8 hours, but hit several OutOfMemoryError on the
>> org.apache.lucene.index.IndexWriter.forceMerge
>>
>> So I'm asking: who has managed to build an entire DBpedia index so far,
>> and on what hardware specs (especially heap size)?
>>
>> Thanks
>>
>> Alessandro
>>
>>
>
>
> --
> Alessandro Adamou, Ph.D.
>
> Knowledge Media Institute
> The Open University
> Walton Hall, Milton Keynes MK7 6AA
> United Kingdom
>
>
> "I will give you everything, just don't demand anything."
> (Ettore Petrolini, 1917)
>
> Not sent from my iSnobTechDevice
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen

Re: Indexing DBPedia 3.8 : hardware requirements?

Posted by Alessandro Adamou <ad...@cs.unibo.it>.
Just a follow-up. My success story, if you can call it that, with this 
task was on my sixth attempt and the following setting:

CPU : Intel Xeon E5-2640 @ 2.50GHz
RAM : 24 GiB
JVM heap size : 18g out - an additional 4g swap was used in the process
No info on the disk, but likely a SSD

time: 6 hours 30 mins

Once again, I used the same datasets as in 
http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/ plus 
en/mappingbased_properties_en.nt

In mappings.txt I also added all dbpedia-owl entities and removed 
mappings from dc:title and foaf:name to rdfs:label

Cheers

Alessandro


On 29/07/2013 12:49, Alessandro Adamou wrote:
> Hi,
>
> I've been trying to build a new custom Solr index DBPedia 3.8, because 
> it seems the one uploaded on
>
> http://dev.iks-project.eu/downloads/stanbol-indices/dbpedia-3.8/
>
> does not index the DBPedia ontology properties as in the 
> en/mappingbased_properties_en.nt dump
>
> So I:
> * initialized the EntityHub indexing tool
> * Downloaded 1.2 GB worth of bzipped dumps to indexing/resources/rdfdata
> * Added http://dbpedia.org/ontology/* to indexing/config/mappings.txt
> * Executed the indexing tool
>
> It has been a very intensive process so far, and I had to restart it 
> four times due to resource issues.
>
> On a 8 GiB RAM Mac running on a hard-disk it was taking like ten 
> minutes every 10k indexed items, i.e. after 36 hours it was still 
> quite a way to go. Plus, the system started thrashing out of page 
> faults even with a 6GiB Java heap.
>
> Tried again on an external SSD and a 7GiB heap. It went through all 
> the triples in about 8 hours, but hit several OutOfMemoryError on the 
> org.apache.lucene.index.IndexWriter.forceMerge
>
> So I'm asking: who has managed to build an entire DBpedia index so 
> far, and on what hardware specs (especially heap size)?
>
> Thanks
>
> Alessandro
>
>


-- 
Alessandro Adamou, Ph.D.

Knowledge Media Institute
The Open University
Walton Hall, Milton Keynes MK7 6AA
United Kingdom


"I will give you everything, just don't demand anything."
(Ettore Petrolini, 1917)

Not sent from my iSnobTechDevice