You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2012/03/08 14:37:19 UTC

Loading Freebase data (previously converted in RDF) into TDB using tdbloader2 and tdbloader3...

Hi,
I want to share a couple of experiments consisting in loading Freebase RDF data
dump into TDB using tdbloader2 and tdbloader3.


tdbloader2
==========

This is how I run tdbloader2 using an EC2 m1.xlarge instance (i.e. 15 GB memory):

export JVM_ARGS="-Xmx4096m -server"
export TMPDIR=/mnt/data/tmp
tdbloader2 --loc /mnt/data/freebase
/mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz

Total elapsed time to load 618,465,279 triples:
~12 hours (i.e. ~10,000 triples/s overall speed)

This is the log:
Mar  7 13:11:37 ip-10-54-167-166 build:  13:11:37 -- TDB Bulk Loader Start
Mar  7 13:11:37 ip-10-54-167-166 build:  13:11:37 Data phase
Mar  7 13:11:39 ip-10-54-167-166 build: Load:
/mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/07 13:11:38 UTC
Mar  7 13:11:42 ip-10-54-167-166 build: Add: 50,000 Data (Batch: 16,550 / Avg:
16,550)
Mar  7 13:11:43 ip-10-54-167-166 build: Add: 100,000 Data (Batch: 39,184 / Avg:
23,272)
[...]
Mar  7 19:13:51 ip-10-54-167-166 build: Add: 618,450,000 Data (Batch: 53,078 /
Avg: 28,457)
Mar  7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session
opened for user root by (uid=0)
Mar  7 19:17:01 ip-10-54-167-166 CRON[7727]: (root) CMD (   cd / && run-parts
--report /etc/cron.hourly)
Mar  7 19:17:01 ip-10-54-167-166 CRON[7725]: pam_unix(cron:session): session
closed for user root
Mar  7 19:24:44 ip-10-54-167-166 build: Total: 618,465,279 tuples : 22,385.15
seconds : 27,628.37 tuples/sec [2012/03/07 19:24:44 UTC]
Mar  7 19:24:45 ip-10-54-167-166 build:  19:24:44 Index phase
Mar  7 19:24:45 ip-10-54-167-166 build:  19:24:45 Index SPO
Mar  7 21:03:18 ip-10-54-167-166 build:  21:03:18 Build SPO
Mar  7 21:14:24 ip-10-54-167-166 build:  21:14:24 Index POS
Mar  7 23:38:28 ip-10-54-167-166 build:  23:38:28 Build POS
Mar  7 23:49:03 ip-10-54-167-166 build:  23:49:03 Index OSP
Mar  8 00:56:13 ip-10-54-167-166 build:  00:56:13 Build OSP
Mar  8 01:08:17 ip-10-54-167-166 build:  01:08:17 Index phase end
Mar  8 01:08:59 ip-10-54-167-166 build:  01:08:59 -- TDB Bulk Loader Finish
Mar  8 01:08:59 ip-10-54-167-166 build:  01:08:59 -- 43000 seconds


tdbloader3
==========

This is how I run tdbloader3 using an EC2 m1.xlarge instance (i.e. 15 GB memory):

java -Djava.io.tmpdir=/mnt/data/tmp -cp
target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server
-d64 -Xmx12288M cmd.tdbloader3 --no-stats --compression --spill-size 10000000
--loc /mnt/data/freebase /mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz

Total elapsed time to load 618,465,279 triples:
Total: 618,465,279 tuples : 53,608.12 seconds : 11,536.78 tuples/sec

This is the log:
Mar  6 11:43:59 ip-10-53-130-32 build: INFO  Load:
/mnt/data/freebase2rdf/freebase-datadump-rdf.nt.gz -- 2012/03/06 11:43:59 UTC
Mar  6 11:44:00 ip-10-53-130-32 build: INFO  Add: 50,000 tuples (Batch: 35,335 /
Avg: 35,335)
Mar  6 11:44:01 ip-10-53-130-32 build: INFO  Add: 100,000 tuples (Batch: 68,212
/ Avg: 46,554)
[...]
Mar  6 15:32:38 ip-10-53-130-32 build: INFO  Add: 618,450,000 tuples (Batch:
89,766 / Avg: 45,079)
Mar  6 15:32:38 ip-10-53-130-32 build: INFO  Node Table (1/3): building
nodes.dat and sorting hash|id ...
Mar  6 17:24:46 ip-10-53-130-32 build: INFO  Add: 50,000 records for node table
(1/3) phase (Batch: 7 / Avg: 7)
Mar  6 17:24:47 ip-10-53-130-32 build: INFO  Add: 100,000 records for node table
(1/3) phase (Batch: 82,236 / Avg: 14)
[...]
Mar  6 21:23:09 ip-10-53-130-32 build: INFO  Add: 1,855,350,000 records for node
table (1/3) phase (Batch: 216,450 / Avg: 88,220)
Mar  6 21:23:09 ip-10-53-130-32 build: INFO  Total: 1,855,395,837 tuples :
21,031.01 seconds : 88,221.91 tuples/sec [2012/03/06 21:23:09
UTC]
Mar  6 21:23:40 ip-10-53-130-32 build: INFO  Node Table (2/3): generating input
data using node ids...
Mar  6 23:00:17 ip-10-53-130-32 build: INFO  Add: 50,000 records for node table
(2/3) phase (Batch: 8 / Avg: 8)
Mar  6 23:00:17 ip-10-53-130-32 build: INFO  Add: 100,000 records for node table
(2/3) phase (Batch: 96,899 / Avg: 17)
[...]
Mar  7 01:04:18 ip-10-53-130-32 build: INFO  Add: 618,450,000 records for node
table (2/3) phase (Batch: 95,969 / Avg: 46,718)
Mar  7 01:04:18 ip-10-53-130-32 build: INFO  Total: 618,463,448 tuples :
13,237.97 seconds : 46,718.90 tuples/sec [2012/03/07 01:04:18 UTC]
Mar  7 01:04:23 ip-10-53-130-32 build: INFO  Node Table (3/3):
building node table B+Tree index (i.e. node2id.dat and node2id.idn
files)...
Mar  7 01:04:38 ip-10-53-130-32 build: INFO  Add: 50,000 records for node table
(3/3) phase (Batch: 3,511 / Avg: 3,511)
Mar  7 01:04:38 ip-10-53-130-32 build: INFO  Add: 100,000 records for node table
(3/3) phase (Batch: 375,939 / Avg: 6,958)
[...]
Mar  7 01:07:21 ip-10-53-130-32 build: INFO  Add: 149,050,000 records for node
table (3/3) phase (Batch: 980,392 / Avg: 838,537)
Mar  7 01:07:24 ip-10-53-130-32 build: INFO  Total: 149,066,002 tuples : 180.42
seconds : 826,225.75 tuples/sec [2012/03/07 01:07:24 UTC]
Mar  7 01:07:27 ip-10-53-130-32 build: INFO  Index: creating SPO index...
Mar  7 01:08:14 ip-10-53-130-32 build: INFO  Add: 50,000 records to SPO (Batch:
1,065 / Avg: 1,065)
Mar  7 01:08:15 ip-10-53-130-32 build: INFO  Add: 100,000 records to SPO (Batch:
54,764 / Avg: 2,090)
[...]
Mar  7 01:18:47 ip-10-53-130-32 build: INFO  Add: 618,450,000 records to SPO
(Batch: 1,020,408 / Avg: 908,977)
Mar  7 01:18:50 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples: 682.99
seconds : 905,528.69 tuples/sec [2012/03/07 01:18:50 UTC]
Mar  7 01:18:50 ip-10-53-130-32 build: INFO  Index: creating GSPO index...
Mar  7 01:18:50 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.12 seconds :
0.00 tuples/sec [2012/03/07 01:18:50 UTC]
Mar  7 01:18:56 ip-10-53-130-32 build: INFO  Index: sorting data for POS index...
Mar  7 01:18:57 ip-10-53-130-32 build: INFO  Add: 50,000 records to POS (Batch:
210,084 / Avg: 210,084)
Mar  7 01:18:57 ip-10-53-130-32 build: INFO  Add: 100,000 records to POS (Batch:
1,724,137 / Avg: 374,531)
[...]
Mar  7 01:47:03 ip-10-53-130-32 build: INFO  Add: 618,450,000 records to POS
(Batch: 4,545,454 / Avg: 366,790)
Mar  7 01:47:03 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples :
1,686.18 seconds : 366,783.97 tuples/sec [2012/03/07 01:47:03 UTC]
Mar  7 01:47:03 ip-10-53-130-32 build: INFO  Index: creating POS index...
Mar  7 01:47:41 ip-10-53-130-32 build: INFO  Add: 50,000 records to POS (Batch:
1,321 / Avg: 1,321)
Mar  7 01:47:41 ip-10-53-130-32 build: INFO  Add: 100,000 records to POS (Batch:
1,086,956 / Avg: 2,639)
[...]
Mar  7 01:57:37 ip-10-53-130-32 build: INFO  Add: 618,450,000 records to POS
(Batch: 1,162,790 / Avg: 974,417)
Mar  7 01:57:42 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples : 638.92
seconds : 967,976.50 tuples/sec [2012/03/07 01:57:42 UTC]
Mar  7 01:57:47 ip-10-53-130-32 build: INFO  Index: sorting data for OSP index...
Mar  7 01:57:47 ip-10-53-130-32 build: INFO  Add: 50,000 records to OSP (Batch:
373,134 / Avg: 373,134)
Mar  7 01:57:47 ip-10-53-130-32 build: INFO  Add: 100,000 records to OSP (Batch:
549,450 / Avg: 444,444)
[...]
Mar  7 02:26:23 ip-10-53-130-32 build: INFO  Add: 618,450,000 records to OSP
(Batch: 4,166,666 / Avg: 360,257)
Mar  7 02:26:23 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples :
1,716.69 seconds : 360,264.44 tuples/sec [2012/03/07 02:26:23 UTC]
Mar  7 02:26:23 ip-10-53-130-32 build: INFO  Index: creating OSP index...
Mar  7 02:27:02 ip-10-53-130-32 build: INFO  Add: 50,000 records to OSP (Batch:
1,284 / Avg: 1,284)
Mar  7 02:27:03 ip-10-53-130-32 build: INFO  Add: 100,000 records to OSP (Batch:
364,963 / Avg: 2,560)
[...]
Mar  7 02:37:18 ip-10-53-130-32 build: INFO  Add: 618,450,000 records to OSP
(Batch: 1,020,408 / Avg: 944,877)
Mar  7 02:37:22 ip-10-53-130-32 build: INFO  Total: 618,463,449 tuples : 658.94
seconds : 938,578.94 tuples/sec [2012/03/07 02:37:22 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for GPOS index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.03 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating GPOS index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for GOSP index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating GOSP index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for POSG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating POSG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for OSPG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating OSPG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: sorting data for SPOG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Index: creating SPOG index...
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 0 tuples : 0.00 seconds :
0.00 tuples/sec [2012/03/07 02:37:27 UTC]
Mar  7 02:37:27 ip-10-53-130-32 build: INFO  Total: 618,465,279 tuples :
53,608.12 seconds : 11,536.78 tuples/sec [2012/03/07 02:37:27 UTC]

tdbloader3 leverages most of the code and tricks of tdbloader2 (and it would not
have been possible without that, kudos to Andy), but it replaces the UNIX sort
with a pure Java implementation of an external sort (which can works with binary
files). This stuff was contributed by Stephen, so kudos to Stephen. The
algorithm is the same to a MapReduce implementation (a.k.a. tdbloader4) and the
aim is to have only good IO going on (i.e. no disk seeks) while building the
indexes.

Paolo