You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Jean-Marc Vanel <je...@gmail.com> on 2011/12/07 17:07:35 UTC
tdbloader OutOfMemoryException with musicbrainz nt dump
Hi
I 'm trying to load in TDB the musicbrainz n-triples dump from :
http://linkedbrainz.c4dmpresents.org/content/rdf-dump
musicbrainz dump in N-Triples is BIG !!!
% ls -l musicbrainz_ngs_dump.rdf.ttl
-rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58 musicbrainz_ngs_dump.rdf.ttl
The wc command needs 13mn to traverse it !
% time wc musicbrainz_ngs_dump.rdf.ttl
178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
12:50,43 total
Which means 179 millions of triples !
I had this stack :
Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.overlay(BPTreeNodeMgr.java:194)
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.access$100(BPTreeNodeMgr.java:22)
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr$Block2BPTreeNode.fromByteBuffer(BPTreeNodeMgr.java:141)
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNodeMgr.get(BPTreeNodeMgr.java:84)
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.get(BPTreeNode.java:127)
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:379)
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.internalInsert(BPTreeNode.java:399)
at com.hp.hpl.jena.tdb.index.bplustree.BPTreeNode.insert(BPTreeNode.java:167)
at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.addAndReturnOld(BPlusTree.java:297)
at com.hp.hpl.jena.tdb.index.bplustree.BPlusTree.add(BPlusTree.java:289)
at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.accessIndex(NodeTableNative.java:133)
at com.hp.hpl.jena.tdb.nodetable.NodeTableNative._idForNode(NodeTableNative.java:98)
at com.hp.hpl.jena.tdb.nodetable.NodeTableNative.getAllocateNodeId(NodeTableNative.java:67)
at com.hp.hpl.jena.tdb.nodetable.NodeTableCache._idForNode(NodeTableCache.java:108)
at com.hp.hpl.jena.tdb.nodetable.NodeTableCache.getAllocateNodeId(NodeTableCache.java:67)
at com.hp.hpl.jena.tdb.nodetable.NodeTableWrapper.getAllocateNodeId(NodeTableWrapper.java:32)
at com.hp.hpl.jena.tdb.nodetable.NodeTableInline.getAllocateNodeId(NodeTableInline.java:39)
at com.hp.hpl.jena.tdb.nodetable.NodeTupleTableConcrete.addRow(NodeTupleTableConcrete.java:72)
at com.hp.hpl.jena.tdb.store.bulkloader.LoaderNodeTupleTable.load(LoaderNodeTupleTable.java:112)
at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:203)
at com.hp.hpl.jena.tdb.store.bulkloader.BulkLoader$1.send(BulkLoader.java:186)
at org.openjena.riot.lang.LangTurtle.emit(LangTurtle.java:52)
at org.openjena.riot.lang.LangTurtleBase.checkEmitTriple(LangTurtleBase.java:475)
at org.openjena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:341)
at org.openjena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:273)
at org.openjena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:254)
at org.openjena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:245)
at org.openjena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:206)
at org.openjena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:34)
at org.openjena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:132)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:71)
at org.openjena.riot.RiotReader.parseTriples(RiotReader.java:85)
tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl 4380,35s user
111,88s system 47% cpu 2:39:01,08 total
I think I had already 9 million of triples from dbPedia in the
database (not sure) before loading mbz .
I kept the original size 1.2Gb of the script.
This was with TDB-0.8.10 .
% java -version
java version "1.7.0"
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
% uname -a
Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
x86_64 GNU/Linux
( Debian )
Is the current state of the data base corrupted ?
Of course I can reload with more memory, but I need to understand
better what TDB does while loading.
Apparently it populates a bplustree in memory while loading .
Does it also happen in normal functioning ? I mean for querying.
For loading this dataset, is it just a matter of splitting before
loading in several steps?
Then the tool should do it itself .
Is there any hope that this dataset works on TDB ?
Have I reached the limit ?
PS
There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
The default storage of each indexes
--
Jean-Marc Vanel
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
http://jmvanel.free.fr/ - EulerGUI, a turntable GUI for Semantic Web +
rules, XML, UML, eCore, Java bytecode
+33 (0)6 89 16 29 52
chat : irc://irc.freenode.net#eulergui
Re: tdbloader OutOfMemoryException with musicbrainz nt dump
Posted by Andy Seaborne <an...@apache.org>.
On 07/12/11 20:37, Andy Seaborne wrote:
> On 07/12/11 16:07, Jean-Marc Vanel wrote:
>> Hi
>
> Hi there,
>
>>
>> I 'm trying to load in TDB the musicbrainz n-triples dump from :
>> http://linkedbrainz.c4dmpresents.org/content/rdf-dump
>>
>> musicbrainz dump in N-Triples is BIG !!!
>> % ls -l musicbrainz_ngs_dump.rdf.ttl
>> -rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58
>> musicbrainz_ngs_dump.rdf.ttl
>>
>> The wc command needs 13mn to traverse it !
>> % time wc musicbrainz_ngs_dump.rdf.ttl
>> 178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
>> wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
>> 12:50,43 total
>>
>> Which means 179 millions of triples !
>
> Large but not that large.
>
>>
>> I had this stack :
>>
>> Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
>> Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>
> A possible cause is many large literals in the data.
>
> (I don't know the data set)
It looks like it's because the data (which isn't Turtle, or as claimed
N3, but N-triples already) has huge numbers of bNodes in it.
The parser keeps a map of the bnode labels so that when it's reused in
a file, it's the same bNode, but that requires state to be keep and the
file seems to have a massive number of bNodes.
Set the heap large and hope.
I tried 4G and the parser ran -
There are internally to the parser different bNode label policies but
these aren't exposed in a convenient way currently.
Another way is to convert the bNodes to URIs - the data is simply full
for synthetic URIs so why it uses bNodes I don't know.
Andy
Re: tdbloader OutOfMemoryException with musicbrainz nt dump
Posted by Andy Seaborne <an...@apache.org>.
On 07/12/11 16:07, Jean-Marc Vanel wrote:
> Hi
Hi there,
>
> I 'm trying to load in TDB the musicbrainz n-triples dump from :
> http://linkedbrainz.c4dmpresents.org/content/rdf-dump
>
> musicbrainz dump in N-Triples is BIG !!!
> % ls -l musicbrainz_ngs_dump.rdf.ttl
> -rw-r--r-- 1 jmv jmv 25719386678 16 juin 14:58 musicbrainz_ngs_dump.rdf.ttl
>
> The wc command needs 13mn to traverse it !
> % time wc musicbrainz_ngs_dump.rdf.ttl
> 178995221 829703178 25719386678 musicbrainz_ngs_dump.rdf.ttl
> wc musicbrainz_ngs_dump.rdf.ttl 710,46s user 20,37s system 94% cpu
> 12:50,43 total
>
> Which means 179 millions of triples !
Large but not that large.
>
> I had this stack :
>
> Add: 36 000 000 triples (Batch: 204 / Avg: 4 740)
> Elapsed: 7 594,09 seconds [2011/12/05 23:16:31 CET]
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
A possible cause is many large literals in the data.
(I don't know the data set)
Your on a 64bit machine so you can set the heapsize larger. The default
in the script works on a 32 bit machine (where java is limited to ~1.5G
heap)
...
> tdbloader --loc ~/tdb_data musicbrainz_ngs_dump.rdf.ttl 4380,35s user
> 111,88s system 47% cpu 2:39:01,08 total
>
> I think I had already 9 million of triples from dbPedia in the
> database (not sure) before loading mbz .
The loader is faster on an empty database.
> I kept the original size 1.2Gb of the script.
>
> This was with TDB-0.8.10 .
>
> % java -version
> java version "1.7.0"
> Java(TM) SE Runtime Environment (build 1.7.0-b147)
> Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
> % uname -a
> Linux oem-laptop 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011
> x86_64 GNU/Linux
> ( Debian )
Try tdbloader2 (which currently only works on Linux and only on empty
databases).
Laptops are slower than servers.
SSDs are faster than mag disks.
>
> Is the current state of the data base corrupted ?
Most likely.
>
> Of course I can reload with more memory, but I need to understand
> better what TDB does while loading.
> Apparently it populates a bplustree in memory while loading .
It just happens the limit at that point. The actual point it hits the
heap limit isn't always the cause of most of the memory usage.
> Does it also happen in normal functioning ? I mean for querying.
> For loading this dataset, is it just a matter of splitting before
> loading in several steps?
> Then the tool should do it itself .
No need to split input.
I usually suggest parsing to N-triples first though "riot --validate" to
check the data because you don't want to get part way through a load and
find it's got bad data in it. If you do check , keep the N-triples as
loading is faster from N-triples.
>
> Is there any hope that this dataset works on TDB ?
Yes.
> Have I reached the limit ?
The preset limits are of necessity a guess.
>
> PS
> There is an unfinished sentence in http://openjena.org/wiki/TDB/Architecture :
> The default storage of each indexes
thanks