You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2012/03/08 12:53:09 UTC

tdbloader3 : time to incoproate in the codebase?

Paolo,

> Both tdbloader3 [1] and tdbloader4 [2] are (should be?) correct,
> I've been testing them with datasets in the 500-700 million triples
> range but I consider them (still) *experimental*.

Is now the right time to incorporate tdbloader3 into the main TDB 
codebase as "tdbloader3"?

It does not disturb anything else (does it?) and makes it more 
accessible to users to try out.

Or ... what does it take for it not to be "experimental"?

	Andy


Re: tdbloader3 : time to incoproate in the codebase?

Posted by Paolo Castagna <ca...@googlemail.com>.
Andy Seaborne wrote:
> It does not disturb anything else (does it?) and makes it more
> accessible to users to try out.

tdbloader3 is using Hex from commons-codec which is a dependency in ARQ.
This is not in the TDB's .classpath (which is edited manually).

So, I added (manually, :-() these:

+	<classpathentry excluding="**/*.java|**/.svn/" kind="src"
+    <classpathentry kind="var"
path="M2_REPO/commons-codec/commons-codec/1.4/commons-codec-1.4.jar"
sourcepath="M2_REPO/commons-codec/commons-codec/1.4/commons-codec-1.4-sources.jar"/>

mvn eclispe:eclipse for TDB does not work:

[INFO] Resource directory's path matches an existing source directory. Resources
will be merged with the source directory src/main/resources
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------

I usually go with mvn eclipse:eclipse and my advice would be to do that not to
lose the big advantage of managing direct and transitive dependencies with a
dependency engine (in this case Maven).
Manually editing the .classpath file is bound to human mistakes, for example, in
this case TDB's .classpath is missing transitive dependencies coming from ARQ.

However, I am going on and edit the .classpath manually as shown above and
forget mvn eclipse:eclipse for TDB.

Paolo

Re: tdbloader3 : time to incoproate in the codebase?

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi Andy

Andy Seaborne wrote:
> Paolo,
> 
>> Both tdbloader3 [1] and tdbloader4 [2] are (should be?) correct,
>> I've been testing them with datasets in the 500-700 million triples
>> range but I consider them (still) *experimental*.
> 
> Is now the right time to incorporate tdbloader3 into the main TDB
> codebase as "tdbloader3"?

Yes. I'll do it, soon after TDB is released and the [VOTE] closes.

It has no additional dependencies, other than TDB. :-)
Tests are logging out at INFO level, I need to double check that
and make it silent. There are just 6 and they run in ~10 seconds.
I also want to check I am using all the new stuff to create TDB
stuff... but this, again, isn't necessarily something which needs
to be done before we incorporate it.

> It does not disturb anything else (does it?) and makes it more
> accessible to users to try out.

Correct, it does not disturb anything else and it will be easier
for others to try out (and, eventually, use).

The big advantage is that, it should scale better on machines
with lower RAM constraints. The external sort is pure Java and
it's faster than UNIX sort because we can use binary files
instead of text files to sort our 64 bits node ids.

The draw back is that the first phase to build the node table
and the relative index (i.e. nodes.data, node2id.idx and
node2id.dat) is done in multiple passes.

> Or ... what does it take for it not to be "experimental"?

I'd like to run a couple of more tests with ~1 billion size datasets,
but this can happen after tdbloader3 has been incorporated into TDB.

... and, last but not least, similar tests for tdbloader4 (i.e. the
MapReduce implementation). :-)

Next? Anyone into jCUDA? We all have hundreds of cores in our GPUs
sitting most of the time idle. Maybe sorting stuff there is faster,
even if I don't believe is going to do much of the difference for
the first phase.

I also want to continue looking to the hash values as node ids...

Cheers,
Paolo

> 
>     Andy
> 

Re: tdbloader3 : time to incoproate in the codebase?

Posted by Paolo Castagna <ca...@googlemail.com>.
Andy Seaborne wrote:
> On 09/03/12 10:18, Paolo Castagna wrote:
>> Andy Seaborne wrote:
>>> Is now the right time to incorporate tdbloader3 into the main TDB
>>> codebase as "tdbloader3"?
>>
>> Package names:
>>
>>   - most of the code is in: org.apache.jena.tdb.store.bulkloader3.*
>>   - tests for the code above in:
>> org.apache.jena.tdb.store.bulkloader3.* (but
>> src/test/java/ instead of src/main/java)
>>   - command line is named tdbloader3 in the tdb.* package
>>   - test for the command line is in tdb.* package (but in src/test/java)
>>
>> Data for the tests is in src/test/resources/tdbloader3 folder
>>
>> I am ready, shout if any of the above is not what you would expect.
>>
>> Paolo
> 
> Fine - personally I'd put it in com.hp.hpl.jena.* so its alongside the
> other TDB code.  It may be a while before we can convert classes in TDB
> (if nothing else, it takes time to process and work through the release
> cycles).  But it can move around once in the codebase if it becomes
> inconvenient under org.apache.jena.

Let's start with org.apache.jena.* and see if it is going to cause
any trouble to anyone. If that happens, I'll move it.

I do not foresee any issues and I rather prefer to start using
org.apache.jena package names, at least for the new things which
do not have backward compatibility consequences.

(Little push back, but if you insist, I'll give up and go ahead with
com.hp.hpl.jena.*). :-)

Paolo

> 
>     Andy


Re: tdbloader3 : time to incoproate in the codebase?

Posted by Andy Seaborne <an...@apache.org>.
On 09/03/12 10:18, Paolo Castagna wrote:
> Andy Seaborne wrote:
>> Is now the right time to incorporate tdbloader3 into the main TDB
>> codebase as "tdbloader3"?
>
> Package names:
>
>   - most of the code is in: org.apache.jena.tdb.store.bulkloader3.*
>   - tests for the code above in: org.apache.jena.tdb.store.bulkloader3.* (but
> src/test/java/ instead of src/main/java)
>   - command line is named tdbloader3 in the tdb.* package
>   - test for the command line is in tdb.* package (but in src/test/java)
>
> Data for the tests is in src/test/resources/tdbloader3 folder
>
> I am ready, shout if any of the above is not what you would expect.
>
> Paolo

Fine - personally I'd put it in com.hp.hpl.jena.* so its alongside the 
other TDB code.  It may be a while before we can convert classes in TDB 
(if nothing else, it takes time to process and work through the release 
cycles).  But it can move around once in the codebase if it becomes 
inconvenient under org.apache.jena.

	Andy

Re: tdbloader3 : time to incoproate in the codebase?

Posted by Paolo Castagna <ca...@googlemail.com>.
Andy Seaborne wrote:
> Is now the right time to incorporate tdbloader3 into the main TDB
> codebase as "tdbloader3"?

Package names:

 - most of the code is in: org.apache.jena.tdb.store.bulkloader3.*
 - tests for the code above in: org.apache.jena.tdb.store.bulkloader3.* (but
src/test/java/ instead of src/main/java)
 - command line is named tdbloader3 in the tdb.* package
 - test for the command line is in tdb.* package (but in src/test/java)

Data for the tests is in src/test/resources/tdbloader3 folder

I am ready, shout if any of the above is not what you would expect.

Paolo