You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Sarven Capadisli (Commented) (JIRA)" <ji...@apache.org> on 2012/03/02 17:07:59 UTC
[jira] [Commented] (JENA-117) A pure Java version of tdbloader2,
a.k.a. tdbloader3
[ https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221016#comment-13221016 ]
Sarven Capadisli commented on JENA-117:
---------------------------------------
I was wondering if you could dumb these options down for me. I don't understand how they work exactly:
--compression Use compression for intermediate files
I've tried this:
$ java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx2000M cmd.tdbloader3 --no-stats --compression --spill-size 1500000 --loc /usr/lib/fuseki/DB/WorldBank /tmp/indicators.tar.gz
INFO Load: /tmp/indicators.tar.gz -- 2012/03/02 10:49:39 EST
ERROR [line: 1, col: 13] Unknown char: (0)
Exception in thread "main" org.openjena.riot.RiotException: [line: 1, col: 13] Unknown char: (0)
at org.openjena.riot.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:125)
at org.openjena.riot.lang.LangEngine.raiseException(LangEngine.java:169)
at org.openjena.riot.lang.LangEngine.nextToken(LangEngine.java:116)
at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:50)
at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:34)
at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:69)
at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
at cmd.tdbloader3.exec(tdbloader3.java:233)
at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
at cmd.tdbloader3.main(tdbloader3.java:129)
/tmp/indicators.tar.gz contains multiple .nt files
--buffer-size The size of buffers for IO in bytes
What's the default for this? How would I determine optimal value for what I'm trying to import (whether it is compressed file or a directory with multiple N-Triples)
--gzip-outside GZIP...(Buffered...())
No idea.
--spill-size The size of spillable segments in tuples|records
--spill-size-auto Automatically set the size of spillable segments
No idea. Again, how can I determine optimal value?
--no-stats Do not generate the stats file
How much does this effect performance?
--no-buffer Do not use Buffered{Input|Output}Stream
When should I?
--max-merge-files Specify the maximum number of files to merge at the same time (default: 100)
This is not clear for me.
I've managed to get it going with this:
java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx2000M cmd.tdbloader3 --spill-size 1500000 --loc /usr/lib/fuseki/DB/WorldBank /tmp/*.nt
INFO Load: /tmp/countries.nt -- 2012/03/02 10:55:34 EST
INFO Load: /tmp/incomeLevels.nt -- 2012/03/02 10:55:35 EST
INFO Load: /tmp/indicators.nt -- 2012/03/02 10:55:35 EST
INFO Add: 50,000 tuples (Batch: 29,940 / Avg: 29,940)
INFO Load: /tmp/lendingTypes.nt -- 2012/03/02 10:55:36 EST
INFO Load: /tmp/regions.nt -- 2012/03/02 10:55:36 EST
INFO Load: /tmp/sources.nt -- 2012/03/02 10:55:36 EST
INFO Load: /tmp/topics.nt -- 2012/03/02 10:55:36 EST
INFO Node Table (1/3): building nodes.dat and sorting hash|id ...
INFO Add: 50,000 records for node table (1/3) phase (Batch: 24,789 / Avg: 24,789)
INFO Add: 100,000 records for node table (1/3) phase (Batch: 204,081 / Avg: 44,208)
INFO Add: 150,000 records for node table (1/3) phase (Batch: 274,725 / Avg: 61,374)
INFO Total: 166,728 tuples : 2.50 seconds : 66,664.54 tuples/sec [2012/03/02 10:55:39 EST]
INFO Node Table (2/3): generating input data using node ids...
INFO Add: 50,000 records for node table (2/3) phase (Batch: 70,721 / Avg: 70,721)
INFO Total: 55,560 tuples : 0.74 seconds : 75,081.08 tuples/sec [2012/03/02 10:55:39 EST]
INFO Node Table (3/3): building node table B+Tree index (i.e. node2id.dat and node2id.idn files)...
INFO Total: 26,120 tuples : 0.20 seconds : 129,306.93 tuples/sec [2012/03/02 10:55:40 EST]
INFO Index: creating SPO index...
INFO Add: 50,000 records to SPO (Batch: 131,233 / Avg: 131,233)
INFO Total: 55,561 tuples : 0.53 seconds : 105,629.27 tuples/sec [2012/03/02 10:55:40 EST]
INFO Index: creating GSPO index...
INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:40 EST]
INFO Index: sorting data for POS index...
INFO Add: 50,000 records to POS (Batch: 684,931 / Avg: 684,931)
INFO Total: 55,561 tuples : 0.08 seconds : 731,065.81 tuples/sec [2012/03/02 10:55:40 EST]
INFO Index: creating POS index...
INFO Add: 50,000 records to POS (Batch: 200,000 / Avg: 200,000)
INFO Total: 55,561 tuples : 0.40 seconds : 139,600.50 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: sorting data for OSP index...
INFO Add: 50,000 records to OSP (Batch: 2,083,333 / Avg: 2,083,333)
INFO Total: 55,561 tuples : 0.03 seconds : 1,792,290.38 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: creating OSP index...
INFO Add: 50,000 records to OSP (Batch: 181,818 / Avg: 181,818)
INFO Total: 55,561 tuples : 0.43 seconds : 130,731.76 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: sorting data for GPOS index...
INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: creating GPOS index...
INFO Total: 0 tuples : 0.07 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: sorting data for GOSP index...
INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: creating GOSP index...
INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: sorting data for POSG index...
INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: creating POSG index...
INFO Total: 0 tuples : 0.09 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: sorting data for OSPG index...
INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO Index: creating OSPG index...
INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST]
INFO Index: sorting data for SPOG index...
INFO Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST]
INFO Index: creating SPOG index...
INFO Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST]
INFO Total: 55,576 tuples : 7.33 seconds : 7,580.96 tuples/sec [2012/03/02 10:55:42 EST]
However, I face this:
$ java tdb.tdbquery --desc=/usr/lib/fuseki/tdb2.worldbank.ttl 'SELECT * WHERE { ?s ?p ?o . } LIMIT 100'
10:56:30 WARN ModTDBDataset :: Unexpected: Not a TDB dataset for type DatasetTDB
-------------
| s | p | o |
=============
-------------
One final thing I'd like to know how to do is assigning graph names. --graph is not available as it was in tdbloader. This is fairly important for me because my dataset is close to 500m triples (I think).
I'd appreciate it if you could help me clarify these issues.
-Sarven
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
> Key: JENA-117
> URL: https://issues.apache.org/jira/browse/JENA-117
> Project: Apache Jena
> Issue Type: Improvement
> Components: TDB
> Reporter: Paolo Castagna
> Assignee: Paolo Castagna
> Priority: Minor
> Labels: performance, tdbloader2
> Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in replacing the UNIX sort over text files with an external sorting pure Java implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
> ThresholdPolicyCount<Tuple<Long>> policy = new ThresholdPolicyCount<Tuple<Long>>(1000000);
> SerializationFactory<Tuple<Long>> serializerFactory = new TupleSerializationFactory();
> Comparator<Tuple<Long>> comparator = new TupleComparator();
> SortedDataBag<Tuple<Long>> sortedDataBag = new SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are wrappers around DataInputStream|DataOutputStream. TupleComparator is trivial.
> Preliminary results seems promising and show that the Java implementation can be faster than UNIX sort since it uses smaller binary files (instead of text files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is that we could stream results directly into the BPlusTreeRewriter rather than on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant improvement.
> Using compression for intermediate files might help, but more experiments are necessary to establish if it is worthwhile or not.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira