You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "Sarven Capadisli (Commented) (JIRA)" <ji...@apache.org> on 2012/03/02 17:07:59 UTC
[jira] [Commented] (JENA-117) A pure Java version of tdbloader2, a.k.a. tdbloader3

    [ https://issues.apache.org/jira/browse/JENA-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13221016#comment-13221016 ] 

Sarven Capadisli commented on JENA-117:
---------------------------------------

I was wondering if you could dumb these options down for me. I don't understand how they work exactly:

      --compression          Use compression for intermediate files

I've tried this:

$ java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx2000M cmd.tdbloader3 --no-stats --compression --spill-size 1500000 --loc /usr/lib/fuseki/DB/WorldBank /tmp/indicators.tar.gz
INFO  Load: /tmp/indicators.tar.gz -- 2012/03/02 10:49:39 EST
ERROR [line: 1, col: 13] Unknown char: (0)
Exception in thread "main" org.openjena.riot.RiotException: [line: 1, col: 13] Unknown char: (0)
	at org.openjena.riot.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:125)
	at org.openjena.riot.lang.LangEngine.raiseException(LangEngine.java:169)
	at org.openjena.riot.lang.LangEngine.nextToken(LangEngine.java:116)
	at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:50)
	at org.openjena.riot.lang.LangNQuads.parseOne(LangNQuads.java:34)
	at org.openjena.riot.lang.LangNTuple.runParser(LangNTuple.java:69)
	at org.openjena.riot.lang.LangBase.parse(LangBase.java:43)
	at cmd.tdbloader3.exec(tdbloader3.java:233)
	at arq.cmdline.CmdMain.mainMethod(CmdMain.java:97)
	at arq.cmdline.CmdMain.mainRun(CmdMain.java:59)
	at arq.cmdline.CmdMain.mainRun(CmdMain.java:46)
	at cmd.tdbloader3.main(tdbloader3.java:129)

/tmp/indicators.tar.gz contains multiple .nt files

      --buffer-size          The size of buffers for IO in bytes

What's the default for this? How would I determine optimal value for what I'm trying to import (whether it is compressed file or a directory with multiple N-Triples)

      --gzip-outside         GZIP...(Buffered...())

No idea.

      --spill-size           The size of spillable segments in tuples|records
      --spill-size-auto      Automatically set the size of spillable segments

No idea. Again, how can I determine optimal value?

      --no-stats             Do not generate the stats file

How much does this effect performance?

      --no-buffer            Do not use Buffered{Input|Output}Stream

When should I?

      --max-merge-files      Specify the maximum number of files to merge at the same time (default: 100)

This is not clear for me.

I've managed to get it going with this:


java -cp target/jena-tdbloader3-0.1-incubating-SNAPSHOT-jar-with-dependencies.jar -server -d64 -Xmx2000M cmd.tdbloader3 --spill-size 1500000 --loc /usr/lib/fuseki/DB/WorldBank /tmp/*.nt
INFO  Load: /tmp/countries.nt -- 2012/03/02 10:55:34 EST
INFO  Load: /tmp/incomeLevels.nt -- 2012/03/02 10:55:35 EST
INFO  Load: /tmp/indicators.nt -- 2012/03/02 10:55:35 EST
INFO  Add: 50,000 tuples (Batch: 29,940 / Avg: 29,940)
INFO  Load: /tmp/lendingTypes.nt -- 2012/03/02 10:55:36 EST
INFO  Load: /tmp/regions.nt -- 2012/03/02 10:55:36 EST
INFO  Load: /tmp/sources.nt -- 2012/03/02 10:55:36 EST
INFO  Load: /tmp/topics.nt -- 2012/03/02 10:55:36 EST
INFO  Node Table (1/3): building nodes.dat and sorting hash|id ...
INFO  Add: 50,000 records for node table (1/3) phase (Batch: 24,789 / Avg: 24,789)
INFO  Add: 100,000 records for node table (1/3) phase (Batch: 204,081 / Avg: 44,208)
INFO  Add: 150,000 records for node table (1/3) phase (Batch: 274,725 / Avg: 61,374)
INFO  Total: 166,728 tuples : 2.50 seconds : 66,664.54 tuples/sec [2012/03/02 10:55:39 EST]
INFO  Node Table (2/3): generating input data using node ids...
INFO  Add: 50,000 records for node table (2/3) phase (Batch: 70,721 / Avg: 70,721)
INFO  Total: 55,560 tuples : 0.74 seconds : 75,081.08 tuples/sec [2012/03/02 10:55:39 EST]
INFO  Node Table (3/3): building node table B+Tree index (i.e. node2id.dat and node2id.idn files)...
INFO  Total: 26,120 tuples : 0.20 seconds : 129,306.93 tuples/sec [2012/03/02 10:55:40 EST]
INFO  Index: creating SPO index...
INFO  Add: 50,000 records to SPO (Batch: 131,233 / Avg: 131,233)
INFO  Total: 55,561 tuples : 0.53 seconds : 105,629.27 tuples/sec [2012/03/02 10:55:40 EST]
INFO  Index: creating GSPO index...
INFO  Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:40 EST]
INFO  Index: sorting data for POS index...
INFO  Add: 50,000 records to POS (Batch: 684,931 / Avg: 684,931)
INFO  Total: 55,561 tuples : 0.08 seconds : 731,065.81 tuples/sec [2012/03/02 10:55:40 EST]
INFO  Index: creating POS index...
INFO  Add: 50,000 records to POS (Batch: 200,000 / Avg: 200,000)
INFO  Total: 55,561 tuples : 0.40 seconds : 139,600.50 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: sorting data for OSP index...
INFO  Add: 50,000 records to OSP (Batch: 2,083,333 / Avg: 2,083,333)
INFO  Total: 55,561 tuples : 0.03 seconds : 1,792,290.38 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: creating OSP index...
INFO  Add: 50,000 records to OSP (Batch: 181,818 / Avg: 181,818)
INFO  Total: 55,561 tuples : 0.43 seconds : 130,731.76 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: sorting data for GPOS index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: creating GPOS index...
INFO  Total: 0 tuples : 0.07 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: sorting data for GOSP index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: creating GOSP index...
INFO  Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: sorting data for POSG index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: creating POSG index...
INFO  Total: 0 tuples : 0.09 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: sorting data for OSPG index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:41 EST]
INFO  Index: creating OSPG index...
INFO  Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST]
INFO  Index: sorting data for SPOG index...
INFO  Total: 0 tuples : 0.00 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST]
INFO  Index: creating SPOG index...
INFO  Total: 0 tuples : 0.08 seconds : 0.00 tuples/sec [2012/03/02 10:55:42 EST]
INFO  Total: 55,576 tuples : 7.33 seconds : 7,580.96 tuples/sec [2012/03/02 10:55:42 EST]

However, I face this:

$ java tdb.tdbquery --desc=/usr/lib/fuseki/tdb2.worldbank.ttl 'SELECT * WHERE { ?s ?p ?o . } LIMIT 100'
10:56:30 WARN  ModTDBDataset        :: Unexpected: Not a TDB dataset for type DatasetTDB
-------------
| s | p | o |
=============
-------------


One final thing I'd like to know how to do is assigning graph names. --graph is not available as it was in tdbloader. This is fairly important for me because my dataset is close to 500m triples (I think).

I'd appreciate it if you could help me clarify these issues.


-Sarven
                
> A pure Java version of tdbloader2, a.k.a. tdbloader3
> ----------------------------------------------------
>
>                 Key: JENA-117
>                 URL: https://issues.apache.org/jira/browse/JENA-117
>             Project: Apache Jena
>          Issue Type: Improvement
>          Components: TDB
>            Reporter: Paolo Castagna
>            Assignee: Paolo Castagna
>            Priority: Minor
>              Labels: performance, tdbloader2
>         Attachments: TDB_JENA-117_r1171714.patch
>
>
> There is probably a significant performance improvement for tdbloader2 in replacing the UNIX sort over text files with an external sorting pure Java implementation.
> Since JENA-99 we now have a SortedDataBag which does exactly that.
>     ThresholdPolicyCount<Tuple<Long>> policy = new ThresholdPolicyCount<Tuple<Long>>(1000000);
>     SerializationFactory<Tuple<Long>> serializerFactory = new TupleSerializationFactory();
>     Comparator<Tuple<Long>> comparator = new TupleComparator();
>     SortedDataBag<Tuple<Long>> sortedDataBag = new SortedDataBag<Tuple<Long>>(policy, serializerFactory, comparator);
> TupleSerializationFactory greates TupleInputStream|TupleOutputStream which are wrappers around DataInputStream|DataOutputStream. TupleComparator is trivial.
> Preliminary results seems promising and show that the Java implementation can be faster than UNIX sort since it uses smaller binary files (instead of text files) and it does comparisons of long values rather than strings.
> An example of ExternalSort which compare SortedDataBag vs. UNIX sort is available here:
> https://github.com/castagna/tdbloader3/blob/hadoop-0.20.203.0/src/main/java/com/talis/labs/tdb/tdbloader3/dev/ExternalSort.java
> A further advantage in doing the sorting with Java rather than UNIX sort is that we could stream results directly into the BPlusTreeRewriter rather than on disk and then reading them from disk into the BPlusTreeRewriter.
> I've not done an experiment yet to see if this is actually a significant improvement.
> Using compression for intermediate files might help, but more experiments are necessary to establish if it is worthwhile or not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira