You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2012/03/05 08:07:43 UTC

Re: Strategies for loading large (>500m triples) datasets

Paolo Castagna wrote:
> I have some code to convert Freebase dumps in RDF, it's ~600 million
> triples, I'll use that to gather some numbers. Ideally, comparing
> tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of
> time and costs).

FYI

Code to convert Freebase dumps in RDF is here:
https://github.com/castagna/freebase2rdf

I have been using Amazon EC2 instances to run a few experiments during
the last couple of days with m1.xlarge instances (i.e. 15 GB memory).

tdbloader didn't complete, it was just getting slower and slower...


With tdbloader2 I had a java.lang.OutOfMemoryError:

Mar  5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / Avg: 21,206)
Mar  5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Mar  5 05:35:10 ip-10-53-58-155 build: #011at java.util.HashMap.<init>(HashMap.java:209)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at org.apache.xerces.impl.validation.ValidationState.<init>(Unknown Source)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.<init>(LiteralLabelImpl.java:96)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)

I'll try giving the JVM more RAM.


tdbloader3 run out of disk space (because it is writing temporary files
in /tmp and the available instance disk space is mounted on /mnt :-()
I'll see how to change/fix this and re-run.

Paolo

Re: Strategies for loading large (>500m triples) datasets

Posted by Paolo Castagna <ca...@googlemail.com>.
FYI

Paolo Castagna wrote:
> This time, UNIX sort filled /tmp...
> I'll try specifying the --temporary-directory=DIR or, better, via $TMPDIR
> env variable (this way there is no need to change tdbloader2 script).

This time, I was able to load the Freebase data dump (converted into RDF) using tdbloader2 (which is included in TDB).

This is how I run tdbloader2 using an EC2 m1.xlarge instance (i.e. 15 GB memory):

export JVM_ARGS="-Xmx4096m -server"
export TMPDIR=/mnt/data/tmp
tdbloader2 --loc /mnt/data/freebase /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz

Total elapsed time to load 618,465,279 triples:
~12 hours (i.e. ~10,000 triples/s overall speed)

This is the log:
Mar  7 13:11:37 ip-10-54-167-166 build:  13:11:37 -- TDB Bulk Loader Start
Mar  7 13:11:37 ip-10-54-167-166 build:  13:11:37 Data phase
Mar  7 13:11:39 ip-10-54-167-166 build: Load: /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/07 13:11:38 UTC
Mar  7 13:11:42 ip-10-54-167-166 build: Add: 50,000 Data (Batch: 16,550 / Avg: 16,550)
Mar  7 13:11:43 ip-10-54-167-166 build: Add: 100,000 Data (Batch: 39,184 / Avg: 23,272)
[...]
Mar  7 19:13:51 ip-10-54-167-166 build: Add: 618,450,000 Data (Batch: 53,078 / Avg: 28,457)
Mar  7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session opened for user root by (uid=0)
Mar  7 19:17:01 ip-10-54-167-166 CRON[7727]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Mar  7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session closed for user root
Mar  7 19:24:44 ip-10-54-167-166 build: Total: 618,465,279 tuples : 22,385.15 seconds : 27,628.37 tuples/sec [2012/03/07 19:24:44 UTC]
Mar  7 19:24:45 ip-10-54-167-166 build:  19:24:44 Index phase
Mar  7 19:24:45 ip-10-54-167-166 build:  19:24:45 Index SPO
Mar  7 21:03:18 ip-10-54-167-166 build:  21:03:18 Build SPO
Mar  7 21:14:24 ip-10-54-167-166 build:  21:14:24 Index POS
Mar  7 23:38:28 ip-10-54-167-166 build:  23:38:28 Build POS
Mar  7 23:49:03 ip-10-54-167-166 build:  23:49:03 Index OSP
Mar  8 00:56:13 ip-10-54-167-166 build:  00:56:13 Build OSP
Mar  8 01:08:17 ip-10-54-167-166 build:  01:08:17 Index phase end
Mar  8 01:08:59 ip-10-54-167-166 build:  01:08:59 -- TDB Bulk Loader Finish
Mar  8 01:08:59 ip-10-54-167-166 build:  01:08:59 -- 43000 seconds

Paolo

Re: Strategies for loading large (>500m triples) datasets

Posted by Paolo Castagna <ca...@googlemail.com>.
Paolo Castagna wrote:
>> With tdbloader2 I had a java.lang.OutOfMemoryError:

[...]

>> I'll try giving the JVM more RAM.
>
> I tried with -Xmx2048m, but I had the same problem.
> I'll try with -Xmx4096m.

This time, UNIX sort filled /tmp...
I'll try specifying the --temporary-directory=DIR or, better, via $TMPDIR
env variable (this way there is no need to change tdbloader2 script).

>> tdbloader3 run out of disk space (because it is writing temporary files
>> in /tmp and the available instance disk space is mounted on /mnt :-()
>> I'll see how to change/fix this and re-run.
>
> This run almost to completion this time, but I was using --spill-size-auto policy which clearly need improvements.
>

[...]

>
> I'll try with a fixed --spill-size 10000000.

This time, I was able to load the Freebase data dump (converted into
RDF) using tdbloader3.

This is how I run tdbloader3 using an EC2 m1.xlarge instance (i.e. 15
GB memory):
java -Djava.io.tmpdir=/mnt/data/tmp -cp
target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar
-server -d64 -Xmx12288M cmd.tdbloader3 --no-stats --compression
--spill-size 10000000 --loc /mnt/data/freebase
/mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz

Total elapsed time to load 618,465,279 triples:
Total: 618,465,279 tuples : 53,608.12 seconds : 11,536.78 tuples/sec

This is the log:
Mar  6 11:43:59 ip-10-53-130-32 build: INFO  Load:
/mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/06
11:43:59 UTC
Mar  6 11:44:00 ip-10-53-130-32 build: INFO  Add: 50,000 tuples
(Batch: 35,335 / Avg: 35,335)
Mar  6 11:44:01 ip-10-53-130-32 build: INFO  Add: 100,000 tuples
(Batch: 68,212 / Avg: 46,554)
[...]
Mar  6 15:32:38 ip-10-53-130-32 build: INFO  Add: 618,450,000 tuples
(Batch: 89,766 / Avg: 45,079)
Mar  6 15:32:38 ip-10-53-130-32 build: INFO  Node Table (1/3):
building nodes.dat and sorting hash|id ...
Mar  6 17:24:46 ip-10-53-130-32 build: INFO  Add: 50,000 records for
node table (1/3) phase (Batch: 7 / Avg: 7)
Mar  6 17:24:47 ip-10-53-130-32 build: INFO  Add: 100,000 records for
node table (1/3) phase (Batch: 82,236 / Avg: 14)
[...]
Mar  6 21:23:09 ip-10-53-130-32 build: INFO  Add: 1,855,350,000
records for node table (1/3) phase (Batch: 216,450 / Avg: 88,220)
Mar  6 21:23:09 ip-10-53-130-32 build: INFO  Total: 1,855,395,837
tuples : 21,031.01 seconds : 88,221.91 tuples/sec [2012/03/06 21:23:09
UTC]
Mar  6 21:23:40 ip-10-53-130-32 build: INFO  Node Table (2/3):
generating input data using node ids...
Mar  6 23:00:17 ip-10-53-130-32 build: INFO  Add: 50,000 records for
node table (2/3) phase (Batch: 8 / Avg: 8)
Mar  6 23:00:17 ip-10-53-130-32 build: INFO  Add: 100,000 records for
node table (2/3) phase (Batch: 96,899 / Avg: 17)
[...]
Mar  7 01:04:18 ip-10-53-130-32 build: INFO  Add: 618,450,000 records
for node table (2/3) phase (Batch: 95,969 / Avg: 46,718)
Mar  7 01:04:18 ip-10-53-130-32 build: INFO  Total: 618,463,448 tuples
: 13,237.97 seconds : 46,718.90 tuples/sec [2012/03/07 01:04:18 UTC]
Mar  7 01:04:23 ip-10-53-130-32 build: INFO  Node Table (3/3):
building node table B+Tree index (i.e. node2id.dat and node2id.idn
files)...
Mar  7 01:04:38 ip-10-53-130-32 build: INFO  Add: 50,000 records for
node table (3/3) phase (Batch: 3,511 / Avg: 3,511)
Mar  7 01:04:38 ip-10-53-130-32 build: INFO  Add: 100,000 records for
node table (3/3) phase (Batch: 375,939 / Avg: 6,958)
[...]
Mar  7 01:07:21 ip-10-53-130-32 build: INFO  Add: 149,050,000 records
for node table (3/3) phase (Batch: 980,392 / Avg: 838,537)
Mar  7 01:07:24 ip-10-53-130-32 build: INFO  Total: 149,066,002 tuples
: 180.42 seconds : 826,225.75 tuples/sec [2012/03/07 01:07:24 UTC]
Mar  7 01:07:27 ip-10-53-130-32 build: INFO  Index: creating SPO index...
Mar  7 01:08:14 ip-10-53-130-32 build: INFO  Add: 50,000 records to
SPO (Batch: 1,065 / Avg: 1,065)
Mar  7 01:08:15 ip-10-53-130-32 build: INFO  Add: 100,000 records to
SPO (Batch: 54,764 / Avg: 2,090)
[...]
Mar  7 01:18:47 ip-10-53-130-32 build: INFO  Add: 618,450,000 records
to SPO (Batch: 1,020,408 / Avg: 908,977)
Mar  7 01:18:50 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples
: 682.99 seconds : 905,528.69 tuples/sec [2012/03/07 01:18:50 UTC]
Mar  7 01:18:50 ip-10-53-130-32 build: INFO  Index: creating GSPO index...
Mar  7 01:18:50 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.12
seconds : 0.00 tuples/sec [2012/03/07 01:18:50 UTC]
Mar  7 01:18:56 ip-10-53-130-32 build: INFO  Index: sorting data for
POS index...
Mar  7 01:18:57 ip-10-53-130-32 build: INFO  Add: 50,000 records to
POS (Batch: 210,084 / Avg: 210,084)
Mar  7 01:18:57 ip-10-53-130-32 build: INFO  Add: 100,000 records to
POS (Batch: 1,724,137 / Avg: 374,531)
[...]
Mar  7 01:47:03 ip-10-53-130-32 build: INFO  Add: 618,450,000 records
to POS (Batch: 4,545,454 / Avg: 366,790)
Mar  7 01:47:03 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples
: 1,686.18 seconds : 366,783.97 tuples/sec [2012/03/07 01:47:03 UTC]
Mar  7 01:47:03 ip-10-53-130-32 build: INFO  Index: creating POS index...
Mar  7 01:47:41 ip-10-53-130-32 build: INFO  Add: 50,000 records to
POS (Batch: 1,321 / Avg: 1,321)
Mar  7 01:47:41 ip-10-53-130-32 build: INFO  Add: 100,000 records to
POS (Batch: 1,086,956 / Avg: 2,639)
[...]
Mar  7 01:57:37 ip-10-53-130-32 build: INFO  Add: 618,450,000 records
to POS (Batch: 1,162,790 / Avg: 974,417)
Mar  7 01:57:42 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples
: 638.92 seconds : 967,976.50 tuples/sec [2012/03/07 01:57:42 UTC]
Mar  7 01:57:47 ip-10-53-130-32 build: INFO  Index: sorting data for
OSP index...
Mar  7 01:57:47 ip-10-53-130-32 build: INFO  Add: 50,000 records to
OSP (Batch: 373,134 / Avg: 373,134)
Mar  7 01:57:47 ip-10-53-130-32 build: INFO  Add: 100,000 records to
OSP (Batch: 549,450 / Avg: 444,444)
[...]
Mar  7 02:26:23 ip-10-53-130-32 build: INFO  Add: 618,450,000 records
to OSP (Batch: 4,166,666 / Avg: 360,257)
Mar  7 02:26:23 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples
: 1,716.69 seconds : 360,264.44 tuples/sec [2012/03/07 02:26:23 UTC]
Mar  7 02:26:23 ip-10-53-130-32 build: INFO  Index: creating OSP index...
Mar  7 02:27:02 ip-10-53-130-32 build: INFO  Add: 50,000 records to
OSP (Batch: 1,284 / Avg: 1,284)
Mar  7 02:27:03 ip-10-53-130-32 build: INFO  Add: 100,000 records to
OSP (Batch: 364,963 / Avg: 2,560)
[...]
Mar  7 02:37:18 ip-10-53-130-32 build: INFO  Add: 618,450,000 records
to OSP (Batch: 1,020,408 / Avg: 944,877)
Mar  7 02:37:22 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples
: 658.94 seconds : 938,578.94 tuples/sec [2012/03/07 02:37:22 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for
GPOS index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.03
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating GPOS index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for
GOSP index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating GOSP index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for
POSG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating POSG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for
OSPG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating OSPG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for
SPOG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating SPOG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00
seconds : 0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 618,465,279 tuples
: 53,608.12 seconds : 11,536.78 tuples/sec [2012/03/07 02:37:27 UTC]

Paolo

Re: Strategies for loading large (>500m triples) datasets

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi

Paolo Castagna wrote:
> Paolo Castagna wrote:
>> I have some code to convert Freebase dumps in RDF, it's ~600 million
>> triples, I'll use that to gather some numbers. Ideally, comparing
>> tdbloader, tdbloader2, tdbloader3 and tdbloader4 (both in terms of
>> time and costs).
> 
> FYI
> 
> Code to convert Freebase dumps in RDF is here:
> https://github.com/castagna/freebase2rdf
> 
> I have been using Amazon EC2 instances to run a few experiments during
> the last couple of days with m1.xlarge instances (i.e. 15 GB memory).
> 
> tdbloader didn't complete, it was just getting slower and slower...
> 
> 
> With tdbloader2 I had a java.lang.OutOfMemoryError:
> 
> Mar  5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / Avg: 21,206)
> Mar  5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at java.util.HashMap.<init>(HashMap.java:209)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at org.apache.xerces.impl.validation.ValidationState.<init>(Unknown Source)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.<init>(LiteralLabelImpl.java:96)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)
> 
> I'll try giving the JVM more RAM.

I tried with -Xmx2048m, but I had the same problem.
I'll try with -Xmx4096m.

> tdbloader3 run out of disk space (because it is writing temporary files
> in /tmp and the available instance disk space is mounted on /mnt :-()
> I'll see how to change/fix this and re-run.

This run almost to completion this time, but I was using --spill-size-auto policy which clearly need improvements.

...
Mar  6 04:28:11 ip-10-54-171-206 build: INFO  Add: 77,550,000 records to POS (Batch: 605 / Avg: 144,190)
Mar  6 04:29:15 ip-10-54-171-206 build: INFO  Add: 77,600,000 records to POS (Batch: 777 / Avg: 128,869)
Mar  6 04:30:20 ip-10-54-171-206 build: INFO  Add: 77,650,000 records to POS (Batch: 776 / Avg: 116,492)
Mar  6 04:47:11 ip-10-54-171-206 build: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
Mar  6 04:47:11 ip-10-54-171-206 build: #011at java.lang.Long.valueOf(Long.java:557)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3$2.convert(tdbloader3.java:367)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3$2.convert(tdbloader3.java:363)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at org.openjena.atlas.iterator.Iter$4.next(Iter.java:293)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at org.openjena.atlas.data.AbstractDataBag.addAll(AbstractDataBag.java:76)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.createBPlusTreeIndex(tdbloader3.java:378)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.exec(tdbloader3.java:252)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
Mar  6 04:47:11 ip-10-54-171-206 build: #011at cmd.tdbloader3.main(tdbloader3.java:129)

I'll try with a fixed --spill-size 10000000.

Paolo

> 
> Paolo

Re: Strategies for loading large (>500m triples) datasets

Posted by Paolo Castagna <ca...@googlemail.com>.
Paolo Castagna wrote:
> Trying to load a small test.nt file via tdbloader2 script:
> 
> Mar  5 09:44:34 ip-10-54-162-193 build: --2012-03-05 09:44:34--  http://www.w3.org/2000/10/rdf-tests/rdfcore/ntriples/test.nt
> Mar  5 09:44:34 ip-10-54-162-193 build: Resolving www.w3.org... 128.30.52.37
> Mar  5 09:44:34 ip-10-54-162-193 build: Connecting to www.w3.org|128.30.52.37|:80... connected.
> Mar  5 09:44:34 ip-10-54-162-193 build: HTTP request sent, awaiting response... 200 OK
> Mar  5 09:44:34 ip-10-54-162-193 build: Length: 4081 (4.0K) [text/plain]
> Mar  5 09:44:34 ip-10-54-162-193 build: Saving to: `test.nt'
> Mar  5 09:44:34 ip-10-54-162-193 build:
> Mar  5 09:44:34 ip-10-54-162-193 build:      0K ...                                                   100%  122M=0s
> Mar  5 09:44:34 ip-10-54-162-193 build:
> Mar  5 09:44:34 ip-10-54-162-193 build: 2012-03-05 09:44:34 (122 MB/s) - `test.nt' saved [4081/4081]
> Mar  5 09:44:34 ip-10-54-162-193 build:
> Mar  5 09:44:35 ip-10-54-162-193 build:  09:44:35 -- TDB Bulk Loader Start
> Mar  5 09:44:35 ip-10-54-162-193 build:  09:44:35 Data phase
> Mar  5 09:44:35 ip-10-54-162-193 build: Exception in thread "main" java.lang.NoClassDefFoundError: arq/cmdline/CmdGeneral
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.defineClass1(Native Method)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.security.AccessController.doPrivileged(Native Method)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> Mar  5 09:44:35 ip-10-54-162-193 build: Caused by: java.lang.ClassNotFoundException: arq.cmdline.CmdGeneral
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.security.AccessController.doPrivileged(Native Method)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> Mar  5 09:44:35 ip-10-54-162-193 build: #011... 12 more
> Mar  5 09:44:35 ip-10-54-162-193 build: Could not find the main class: com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.  Program will exit.
> 

Another run/try, this time with a find target before running tdbloader2 script.

Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/RecordLib.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/CmdNodeTableBuilder.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/CmdNodeTableBuilder$NodeTableBuilder.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/WriteRows.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/CmdIndexBuild.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/RecordsFromInput.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/IndexFactory.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/CmdIndexCopy.class
Mar  5 11:07:37 ip-10-250-190-91 build: ./classes/com/hp/hpl/jena/tdb/store/bulkloader2/ProgressLogger.class

CmdNodeTableBuilder.class is definitely there.

Paolo

Re: Strategies for loading large (>500m triples) datasets

Posted by Paolo Castagna <ca...@googlemail.com>.

Paolo Castagna wrote:
> Paolo Castagna wrote:
>> If I run this locally, everything works. But, I have this error when I run on an EC2 instance:
>>
>>  08:15:34 -- TDB Bulk Loader Start
>>  08:15:34 Data phase
>> Exception in thread "main" java.lang.NoClassDefFoundError: arq/cmdline/CmdGeneral
>> #011at java.lang.ClassLoader.defineClass1(Native Method)
>> #011at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
>> #011at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
>> #011at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
>> #011at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
>> #011at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
>> #011at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
>> #011at java.security.AccessController.doPrivileged(Native Method)
>> #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>> #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>> Caused by: java.lang.ClassNotFoundException: arq.cmdline.CmdGeneral
>> #011at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
>> #011at java.security.AccessController.doPrivileged(Native Method)
>> #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
>> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
>> #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
>> #011... 12 more
>> Could not find the main class: com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.  Program will exit.
>>
>>
>> What I don't understand is why mvn package succeed, there are the TDB classes in the classpath, but there seems not to be ARQ in the classpath.
> 
> Or, is it com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder which is not on the classpath?

TDB builds correctly:

...
Mar  5 09:44:34 ip-10-54-162-193 build: [INFO] ------------------------------------------------------------------------
Mar  5 09:44:34 ip-10-54-162-193 build: [INFO] BUILD SUCCESS
Mar  5 09:44:34 ip-10-54-162-193 build: [INFO] ------------------------------------------------------------------------
Mar  5 09:44:34 ip-10-54-162-193 build: [INFO] Total time: 2:54.151s
Mar  5 09:44:34 ip-10-54-162-193 build: [INFO] Finished at: Mon Mar 05 09:44:34 UTC 2012
Mar  5 09:44:34 ip-10-54-162-193 build: [INFO] Final Memory: 32M/241M
Mar  5 09:44:34 ip-10-54-162-193 build: [INFO] ------------------------------------------------------------------------
Mar  5 09:44:34 ip-10-54-162-193 build: total 21100

./target/classes is there:

Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x 14 root root     4096 Mar  5 09:44 .
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x 13 root root     4096 Mar  5 09:42 ..
Mar  5 09:44:34 ip-10-54-162-193 build: -rw-r--r--  1 root root       30 Mar  5 09:42 .plxarc
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  2 root root     4096 Mar  5 09:42 antrun
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  5 root root     4096 Mar  5 09:44 apidocs
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  2 root root     4096 Mar  5 09:44 archive-tmp
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  6 root root     4096 Mar  5 09:42 classes
Mar  5 09:44:34 ip-10-54-162-193 build: -rw-r--r--  1 root root       87 Mar  5 09:42 filter.properties
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  4 root root     4096 Mar  5 09:42 generated-sources
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  2 root root     4096 Mar  5 09:44 javadoc-bundle-options
Mar  5 09:44:34 ip-10-54-162-193 build: -rw-r--r--  1 root root 10127246 Mar  5 09:44 jena-tdb-0.9.1-incubating-SNAPSHOT-distribution.tar.gz
Mar  5 09:44:34 ip-10-54-162-193 build: -rw-r--r--  1 root root 10144969 Mar  5 09:44 jena-tdb-0.9.1-incubating-SNAPSHOT-distribution.zip
Mar  5 09:44:34 ip-10-54-162-193 build: -rw-r--r--  1 root root   168108 Mar  5 09:44 jena-tdb-0.9.1-incubating-SNAPSHOT-javadoc.jar
Mar  5 09:44:34 ip-10-54-162-193 build: -rw-r--r--  1 root root   461072 Mar  5 09:44 jena-tdb-0.9.1-incubating-SNAPSHOT-sources.jar
Mar  5 09:44:34 ip-10-54-162-193 build: -rw-r--r--  1 root root   581788 Mar  5 09:42 jena-tdb-0.9.1-incubating-SNAPSHOT.jar
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  2 root root     4096 Mar  5 09:42 maven-archiver
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  3 root root     4096 Mar  5 09:42 maven-shared-archive-resources
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  2 root root     4096 Mar  5 09:44 surefire
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  2 root root     4096 Mar  5 09:42 surefire-reports
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  3 root root     4096 Mar  5 09:42 tdb-testing
Mar  5 09:44:34 ip-10-54-162-193 build: drwxr-xr-x  4 root root     4096 Mar  5 09:42 test-classes

The output of tdb_path is as expected:

Mar  5 09:44:34 ip-10-54-162-193 build:
/mnt/data/tdb/target/classes:/mnt/.m2/repository/org/apache/jena/jena-arq/2.9.0-incubating/jena-arq-2.9.0-incubating.jar:/mnt/.m2/repository/org/apache/jena/jena-arq/2.9.0-incubating/jena-arq-2.9.0-incubating-tests.jar:/mnt/.m2/repository/org/apache/jena/jena-iri/0.9.0-incubating/jena-iri-0.9.0-incubating.jar:/mnt/.m2/repository/org/apache/jena/jena-core/2.7.0-incubating/jena-core-2.7.0-incubating-tests.jar:/mnt/.m2/repository/org/apache/jena/jena-core/2.7.0-incubating/jena-core-2.7.0-incubating.jar:/mnt/.m2/repository/com/ibm/icu/icu4j/3.4.4/icu4j-3.4.4.jar:/mnt/.m2/repository/junit/junit/4.9/junit-4.9.jar:/mnt/.m2/repository/log4j/log4j/1.2.16/log4j-1.2.16.jar:/mnt/.m2/repository/org/slf4j/slf4j-api/1.6.4/slf4j-api-1.6.4.jar:/mnt/.m2/repository/org/slf4j/slf4j-log4j12/1.6.4/slf4j-log4j12-1.6.4.jar:/mnt/.m2/repository/xerces/xercesImpl/2.10.0/xercesImpl-2.10.0.jar:/mnt/.m2/repository/xml-apis/xml-apis/1.4.01/xml-apis-1.4.01.jar

Trying to load a small test.nt file via tdbloader2 script:

Mar  5 09:44:34 ip-10-54-162-193 build: --2012-03-05 09:44:34--  http://www.w3.org/2000/10/rdf-tests/rdfcore/ntriples/test.nt
Mar  5 09:44:34 ip-10-54-162-193 build: Resolving www.w3.org... 128.30.52.37
Mar  5 09:44:34 ip-10-54-162-193 build: Connecting to www.w3.org|128.30.52.37|:80... connected.
Mar  5 09:44:34 ip-10-54-162-193 build: HTTP request sent, awaiting response... 200 OK
Mar  5 09:44:34 ip-10-54-162-193 build: Length: 4081 (4.0K) [text/plain]
Mar  5 09:44:34 ip-10-54-162-193 build: Saving to: `test.nt'
Mar  5 09:44:34 ip-10-54-162-193 build:
Mar  5 09:44:34 ip-10-54-162-193 build:      0K ...                                                   100%  122M=0s
Mar  5 09:44:34 ip-10-54-162-193 build:
Mar  5 09:44:34 ip-10-54-162-193 build: 2012-03-05 09:44:34 (122 MB/s) - `test.nt' saved [4081/4081]
Mar  5 09:44:34 ip-10-54-162-193 build:
Mar  5 09:44:35 ip-10-54-162-193 build:  09:44:35 -- TDB Bulk Loader Start
Mar  5 09:44:35 ip-10-54-162-193 build:  09:44:35 Data phase
Mar  5 09:44:35 ip-10-54-162-193 build: Exception in thread "main" java.lang.NoClassDefFoundError: arq/cmdline/CmdGeneral
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.defineClass1(Native Method)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.security.AccessController.doPrivileged(Native Method)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Mar  5 09:44:35 ip-10-54-162-193 build: Caused by: java.lang.ClassNotFoundException: arq.cmdline.CmdGeneral
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.security.AccessController.doPrivileged(Native Method)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
Mar  5 09:44:35 ip-10-54-162-193 build: #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Mar  5 09:44:35 ip-10-54-162-193 build: #011... 12 more
Mar  5 09:44:35 ip-10-54-162-193 build: Could not find the main class: com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.  Program will exit.


Re: Strategies for loading large (>500m triples) datasets

Posted by Paolo Castagna <ca...@googlemail.com>.
Paolo Castagna wrote:
> If I run this locally, everything works. But, I have this error when I run on an EC2 instance:
> 
>  08:15:34 -- TDB Bulk Loader Start
>  08:15:34 Data phase
> Exception in thread "main" java.lang.NoClassDefFoundError: arq/cmdline/CmdGeneral
> #011at java.lang.ClassLoader.defineClass1(Native Method)
> #011at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
> #011at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
> #011at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
> #011at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
> #011at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
> #011at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
> #011at java.security.AccessController.doPrivileged(Native Method)
> #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> Caused by: java.lang.ClassNotFoundException: arq.cmdline.CmdGeneral
> #011at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> #011at java.security.AccessController.doPrivileged(Native Method)
> #011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> #011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> #011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> #011... 12 more
> Could not find the main class: com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.  Program will exit.
> 
> 
> What I don't understand is why mvn package succeed, there are the TDB classes in the classpath, but there seems not to be ARQ in the classpath.

Or, is it com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder which is not on the classpath?

Paolo

Re: Strategies for loading large (>500m triples) datasets

Posted by Paolo Castagna <ca...@googlemail.com>.
Paolo Castagna wrote:
> With tdbloader2 I had a java.lang.OutOfMemoryError:
> 
> Mar  5 05:22:30 ip-10-53-58-155 build: Add: 618,450,000 Data (Batch: 6,547 / Avg: 21,206)
> Mar  5 05:35:10 ip-10-53-58-155 build: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at java.util.HashMap.<init>(HashMap.java:209)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at org.apache.xerces.impl.validation.ValidationState.<init>(Unknown Source)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.XSDDatatype.parse(XSDDatatype.java:270)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.datatypes.xsd.impl.XSDBaseNumericType.parse(XSDBaseNumericType.java:165)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setValue(LiteralLabelImpl.java:213)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.setLiteralLabel_1(LiteralLabelImpl.java:107)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelImpl.<init>(LiteralLabelImpl.java:96)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.impl.LiteralLabelFactory.createLiteralLabel(LiteralLabelFactory.java:28)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.graph.Node.createLiteral(Node.java:103)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.sparql.util.NodeFactory.intToNode(NodeFactory.java:79)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.format(Stats.java:195)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.solver.stats.Stats.write(Stats.java:72)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.exec(CmdNodeTableBuilder.java:178)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
> Mar  5 05:35:10 ip-10-53-58-155 build: #011at com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.main(CmdNodeTableBuilder.java:79)
> 
> I'll try giving the JVM more RAM.

The tdbloader2 script now allows users to specify JVM arguments via JVM_ARGS environment variable (as the others TDB or ARQ scripts).

However, I am having a strange problem, this is what I do:

cd /mnt/data
svn co https://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/ tdb
cd tdb
mvn package
export TDBROOT=/mnt/data/tdb
export JVM_ARGS="-Xmx2048m -server"
wget http://www.w3.org/2000/10/rdf-tests/rdfcore/ntriples/test.nt
./bin/tdbloader2 --loc /tmp/test test.nt

If I run this locally, everything works. But, I have this error when I run on an EC2 instance:

 08:15:34 -- TDB Bulk Loader Start
 08:15:34 Data phase
Exception in thread "main" java.lang.NoClassDefFoundError: arq/cmdline/CmdGeneral
#011at java.lang.ClassLoader.defineClass1(Native Method)
#011at java.lang.ClassLoader.defineClassCond(ClassLoader.java:631)
#011at java.lang.ClassLoader.defineClass(ClassLoader.java:615)
#011at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:141)
#011at java.net.URLClassLoader.defineClass(URLClassLoader.java:283)
#011at java.net.URLClassLoader.access$000(URLClassLoader.java:58)
#011at java.net.URLClassLoader$1.run(URLClassLoader.java:197)
#011at java.security.AccessController.doPrivileged(Native Method)
#011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
#011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
#011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
#011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Caused by: java.lang.ClassNotFoundException: arq.cmdline.CmdGeneral
#011at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
#011at java.security.AccessController.doPrivileged(Native Method)
#011at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
#011at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
#011at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
#011at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
#011... 12 more
Could not find the main class: com.hp.hpl.jena.tdb.store.bulkloader2.CmdNodeTableBuilder.  Program will exit.


What I don't understand is why mvn package succeed, there are the TDB classes in the classpath, but there seems not to be ARQ in the classpath.

Paolo