You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Jean-Marc Vanel <je...@gmail.com> on 2011/12/07 17:07:35 UTC

tdbloader OutOfMemoryException with musicbrainz nt dump

Hi

I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump

musicbrainz dump in N-Triples is BIG !!!
 % ls -l  musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin  14:58 musicbrainz_ngs_dump.rdf.ttl

The wc command needs 13mn to traverse it !
 % time wc musicbrainz_ngs_dump.rdf.ttl
  178995221   829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl  710,46s user 20,37s system 94% cpu
12:50,43 total

Which means 179 millions of triples !

I had this stack :

Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
  Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.overlay(BPTreeNodeMgr.java:194)
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.access$100(BPTreeNodeMgr.java:22)
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr$Block2BPTreeNode.fromByteBuffer(BPTreeNodeMgr.java:141)
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.get(BPTreeNodeMgr.java:84)
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(BPTreeNode.java:127)
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:379)
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
        at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
        at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
        at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.accessIndex(NodeTableNative.java:133)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableNative._idForNode(NodeTableNative.java:98)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getAllocateNodeId(NodeTableNative.java:67)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableCache._idForNode(NodeTableCache.java:108)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getAllocateNodeId(NodeTableCache.java:67)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getAllocateNodeId(NodeTableWrapper.java:32)
        at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:39)
        at com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:72)
        at com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
        at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:203)
        at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:186)
        at org.openjena.riot.lang.LangTurtle.emit(LangTurtle.java:52)
        at org.openjena.riot.lang.LangTurtleBase.checkEmitTriple(LangTurtleBase.java:475)
        at org.openjena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:341)
        at org.openjena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:273)
        at org.openjena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:254)
        at org.openjena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:245)
        at org.openjena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:206)
        at org.openjena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:34)
        at org.openjena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:132)
        at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
        at org.openjena.riot.RiotReader.parseTriples(RiotReader.java:85)
tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl  4380,35s user
111,88s system 47% cpu 2:39:01,08 total

I think I had already 9 million of triples from dbPedia in the
database (not sure) before loading mbz .
I kept the original size 1.2Gb of the script.

This was with TDB-0.8.10 .

% java -version
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
% uname -a
Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
x86_64 GNU/Linux
( Debian )

Is the current state of the data base corrupted ?

Of course I can reload with more memory, but I need to understand
better what TDB does while loading.
Apparently it populates a bplustree in memory while loading .
Does it also happen in normal functioning ? I mean for querying.
For loading this dataset, is it just a matter of splitting before
loading in several steps?
Then the tool should do it itself .

Is there any hope that this dataset works on TDB ?
Have I reached the limit ?

PS
There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
The default storage of each indexes

-- 
Jean-Marc Vanel
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
http://jmvanel.free.fr/ - EulerGUI, a turntable GUI for Semantic Web +
rules, XML, UML, eCore, Java bytecode
+33 (0)6 89 16 29 52
chat :  irc://irc.freenode.net#eulergui

Re: tdbloader OutOfMemoryException with musicbrainz nt dump

Posted by Andy Seaborne <an...@apache.org>.
On 07/12/11 20:37, Andy Seaborne wrote:
> On 07/12/11 16:07, Jean-Marc Vanel wrote:
>> Hi
>
> Hi there,
>
>>
>> I 'm trying to load in TDB the musicbrainz n-triples dump from :
>> http://linkedbrainz.c4dmpresents.org/content/rdf-dump
>>
>> musicbrainz dump in N-Triples is BIG !!!
>> % ls -l musicbrainz_ngs_dump.rdf.ttl
>> -rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58
>> musicbrainz_ngs_dump.rdf.ttl
>>
>> The wc command needs 13mn to traverse it !
>> % time wc musicbrainz_ngs_dump.rdf.ttl
>> 178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
>> wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
>> 12:50,43 total
>>
>> Which means 179 millions of triples !
>
> Large but not that large.
>
>>
>> I had this stack :
>>
>> Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
>> Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
> A possible cause is many large literals in the data.
>
> (I don't know the data set)

It looks like it's because the data (which isn't Turtle, or as claimed 
N3, but N-triples already) has huge numbers of bNodes in it.

The parser keeps a map of the bnode labels  so that when it's reused in 
a file, it's the same bNode, but that requires state to be keep and the 
file seems to have a massive number of bNodes.

Set the heap large and hope.

I tried 4G and the parser ran -

There are internally to the parser different bNode label policies but 
these aren't exposed in a convenient way currently.

Another way is to convert the bNodes to URIs - the data is simply full 
for synthetic URIs so why it uses bNodes I don't know.

	Andy

Re: tdbloader OutOfMemoryException with musicbrainz nt dump

Posted by Andy Seaborne <an...@apache.org>.
On 07/12/11 16:07, Jean-Marc Vanel wrote:
> Hi

Hi there,

>
> I 'm trying to load in TDB the musicbrainz n-triples dump from :
> http://linkedbrainz.c4dmpresents.org/content/rdf-dump
>
> musicbrainz dump in N-Triples is BIG !!!
>   % ls -l  musicbrainz_ngs_dump.rdf.ttl
> -rw-r--r-- 1 jmv jmv 25719386678 16 juin  14:58 musicbrainz_ngs_dump.rdf.ttl
>
> The wc command needs 13mn to traverse it !
>   % time wc musicbrainz_ngs_dump.rdf.ttl
>    178995221   829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
> wc musicbrainz_ngs_dump.rdf.ttl  710,46s user 20,37s system 94% cpu
> 12:50,43 total
>
> Which means 179 millions of triples !

Large but not that large.

>
> I had this stack :
>
> Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
>    Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

A possible cause is many large literals in the data.

(I don't know the data set)

Your on a 64bit machine so you can set the heapsize larger.  The default 
in the script works on a 32 bit machine (where java is limited to ~1.5G 
heap)

...

> tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl  4380,35s user
> 111,88s system 47% cpu 2:39:01,08 total
>
> I think I had already 9 million of triples from dbPedia in the
> database (not sure) before loading mbz .

The loader is faster on an empty database.

> I kept the original size 1.2Gb of the script.
>
> This was with TDB-0.8.10 .
>
> % java -version
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
> % uname -a
> Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
> x86_64 GNU/Linux
> ( Debian )

Try tdbloader2 (which currently only works on Linux and only on empty 
databases).

Laptops are slower than servers.
SSDs are faster than mag disks.

>
> Is the current state of the data base corrupted ?

Most likely.

>
> Of course I can reload with more memory, but I need to understand
> better what TDB does while loading.
> Apparently it populates a bplustree in memory while loading .

It just happens the limit at that point.  The actual point it hits the 
heap limit isn't always the cause of most of the memory usage.

> Does it also happen in normal functioning ? I mean for querying.
> For loading this dataset, is it just a matter of splitting before
> loading in several steps?
> Then the tool should do it itself .

No need to split input.

I usually suggest parsing to N-triples first though "riot --validate" to 
check the data because you don't want to get part way through a load and 
find it's got bad data in it.  If you do check , keep the N-triples as 
loading is faster from N-triples.

>
> Is there any hope that this dataset works on TDB ?

Yes.

> Have I reached the limit ?

The preset limits are of necessity a guess.

>
> PS
> There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
> The default storage of each indexes

thanks