You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Mariano Rodriguez <ro...@inf.unibz.it> on 2011/12/04 14:54:47 UTC

Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Hi all,

We are now benchmarking several triple stores that support inference through forward chaining against a system that does a particular form of query rewriting.

The benchmark we are using is simple, an extended version of LUBM, using big datasets
LUBM 1000, 8000, 15000, 250000. From Jena we would like to benchmark loading time, inference time and query answering time, using both TDB and SDB. Inferences should be done with limited amounts of memory, the less the better. However, we are having difficulties understanding what is the fair way to do this. Also, the system used for this benchmarks should be a simple system, not a cluster or a server with large resources. We would like to ask the community for help to approach this in the best way possible. Hence this email :). Here go some questions and ideas.

Is it the case that the default inference engine of Jena requires all triples to be
in-memory? Is it not possible to do this on this? If this is so, what would be the fair way to benchmark the system? Right now
we are thinking of a workflow as follows:

1. Start a TDB or SDB store.
2. Load 10 LUBMS in memory, compute the closure using

Reasoner reasoner = ReasonerRegistry.getOWLReasoner();
InfModel inf = ModelFactory.createInfModel(reasoner, monto, m);

and storing the result in SDB or TDB. When finished,
3. Query the store directly.

Is this the most efficient way to do it? Are there important parameters (besides the number of universities used in the computation of the closure) that we should tune to guarantee a fair evaluation? Are there any documents that we could use to guide ourselfs during tuning of Jena?

Thank you very much in advance everybody,

Best regards,
Mariano

Mariano Rodriguez Muro
http://www.inf.unibz.it/~rodriguez/
KRDB Research Center
Faculty of Computer Science
Free University of Bozen-Bolzano (FUB)
Piazza Domenicani 3,
I-39100 Bozen-Bolzano BZ, Italy
猴

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Mariano Rodriguez <ro...@inf.unibz.it>.

Hi Andy,

>> 
> >
> > Is it the case that the default inference engine of Jena requires all
> > triples to be in-memory? Is it not possible to do this on this? If
> > this is so, what would be the fair way to benchmark the system?
> There are a couple of dimensions to think about:
> 
> 1/ Do you want to test LUBM or a more general data?
> 2/ What level of inference do you wish to test?
> 
> (1) => For LUBM, there are no inference across universities so you can generate the data for one university, run the forward chain inference on it and move on to the next university knowing that no triples will be generated later that affect the university you have just processed (and so don't need to retain state for it).

At the moment we are going only for LUBM. In one month we will go for more complex benchmarks. However, we always have as a target limited expressivity, specifically RDFS and OWL 2 QL inference which don't require complex reasoning. We would like to be as efficient as possible for those. Ideally, we don't want the test case to have particular tricks at loading time, it should be a generic one-shot procedure (if possible).

> 
> (2) => Inference for LUBM only needs one data triple and access to the ontology to calculate the inferences.  Once a triple has been processed, to can emit the inferred triples and move on.  Again, no data-related state is needed.
> 
> The Jena rules-based reasoner, which is RETE-based, is more powerful than is need for RDFS or LUBM, including rules based on multiple data triples and retraction, but the cost is that it stores internal state in-memory scaling with the size of the data.
> 
> There is also a stream-based forward chaining engine, riotcmd.infer, that keeps the RSF schema in memory but not the state of the data so it uses a fixed amount of space and does not increase with data size.
> 
> This is probably the best way to infer over LUBM at scale.
> 
> This is exploiting the features of LUBM (you only need one university).  I don't have figures I'd expect the riotcmd.infer to be faster as it's less general.
> 
> The flow is:
> 
> infer --rdfs=VOCAB DATA | tdbloader2 --loc DB
> 
> on a 64bit system.  Linux is faster than Windows.
> 
> (tdbloader2 only runs on linux currently - Paolo has a pure java version on github)

This is great info! it sound exactly as the we are looking for! We'll spend some time studying it and if there is any questions Ill get back here. He had not idea this existed.

Thank you very very much for the advice and the info Andy,

Best regards,
Mariano

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> On 05/12/11 10:26, Paolo Castagna wrote:
>> Andy Seaborne wrote:
>>> The flow is:
>>>
>>> infer --rdfs=VOCAB DATA | tdbloader2 --loc DB
>>>
>>> on a 64bit system.  Linux is faster than Windows.
>>>
>>> (tdbloader2 only runs on linux currently - Paolo has a pure java version
>>> on github)
>>
>> tdbloader2 (pure Java version) is here (experimental):
>> http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/
>>
> 
> Clarification:
> There is a tdbloader2 (not pure java) in the TDB distribution already.
> The java one will replace the distributed one sometime.

Yep.

Mariano, you can find the "official" tdbloader2 here:
http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/bin/tdbloader2

As you can see, it's a bash script which uses UNIX sort and Java code from this
package:
http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/src/main/java/com/hp/hpl/jena/tdb/store/bulkloader2/

Paolo

> 
>     Andy

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Andy Seaborne <an...@apache.org>.

On 05/12/11 10:26, Paolo Castagna wrote:
> Andy Seaborne wrote:
>> The flow is:
>>
>> infer --rdfs=VOCAB DATA | tdbloader2 --loc DB
>>
>> on a 64bit system.  Linux is faster than Windows.
>>
>> (tdbloader2 only runs on linux currently - Paolo has a pure java version
>> on github)
>
> tdbloader2 (pure Java version) is here (experimental):
> http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/

Clarification:
There is a tdbloader2 (not pure java) in the TDB distribution already. 
The java one will replace the distributed one sometime.

	Andy

Re: Fuseki query performance

Posted by Andy Seaborne <an...@apache.org>.

On 07/12/11 09:01, Jérôme wrote:
> Le 06/12/11 20:54, Andy Seaborne a écrit :
>> On 6 December 2011 15:44, Jérôme<je...@unicaen.fr> wrote:
>>
>>> Thank you Andy,
>>>
>>> it was the cost of serializing and deserializing.
>>>
>>> My second problem (yes, i have another one ;-) ) is:
>>>
>> By the way - replying to unrelated threads and changing the subject risks
>> you email not being seen. I, for one, don't always check threads that I'm
>> not involved in.
> Yes, i am sorry. But when i wrote this e-mail, i thought the subject
> "fuseki query performance" was appropriate...

My email client (thunderbird) uses the "In-Reply-To:" field to organise 
threads.


>> try
>>
>> SELECT ?Response
>> WHERE
>> {
>> ?Response rdf:type<http://www.tei-c.org/ns/1.0#p> .
>> ?Objet_1 rdf:type<http://example.com#word> .
>> ?Objet_1 ram:contents ?Objet_1_content .
>> FILTER (regex(?Objet_1_content,"example")
>> && regex(?Objet_1_content,"work") )
>> ?Response ram:contains ?Objet_1 .
>> }
>>
> I think this query is not correct, because a word can't satisfy
> "example" and "work" regexps.
> Here a very simplified(much information is missing) example of data:
> <paragraph>
> [...]
> </paragraph>
>
> <paragraph>
> <word>
> <contents>this</contents>
> </word>
> <word>
> <contents>work</contents>
> </word>
> <word>
> <contents>is</contents>
> </word>
> <word>
> <contents>an</contents>
> </work>
> <word>
> <contents>example</contents>
> </work>
> </paragraph>
>
> <paragraph>
> [...]
> </paragraph>
>
> That's why i have to use 2 different objects in my example query: a
> paragraph with the word "example" and with the word "work" -
> Is not it?

Not including the "rdf:type <http://example.com#word>" is going to help, 
possibly greatly, because the checking adds nothing.

Sicne it's the same ?Objet that's of interest (as I understand the 
problem) then use it once in the query but get the content twice and test.

{
   ?Response ram:contains ?Objet .
   ?Objet ram:contents ?x .
   FILTER(regex(?x,"example")
   ?Objet ram:contents ?y .
   FILTER(regex(?y,"help")
}

A different approach would to try LARQ and have a free-text index for 
your content.

	Andy

>
> Thank you.
> Jérôme

Re: Fuseki query performance

Posted by Jérôme <je...@unicaen.fr>.

Le 06/12/11 20:54, Andy Seaborne a écrit :
> On 6 December 2011 15:44, Jérôme<je...@unicaen.fr>  wrote:
>
>> Thank you Andy,
>>
>> it was the cost of serializing and deserializing.
>>
>> My second problem (yes, i have another one ;-) ) is:
>>
> By the way - replying to unrelated threads and changing the subject risks
> you email not being seen.  I, for one, don't always check threads that I'm
> not involved in.
Yes, i am sorry. But when i wrote this e-mail, i thought the subject 
"fuseki query performance" was appropriate...
>
>> The goal of my queries is to find "paragraphs" which are containing
>> "words" which are matching a regex.
>> My triplestore stores approximately 1.600.000 triples.
>> For example: find paragraphs (in my RDF model) containing the word
>> "example" - here the corresponding query:
>>
>> PREFIX ram:<...>
>> PREFIX rdf:<http://www.w3.org/1999/**02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> SELECT ?Response
>> WHERE
>> {
>> ?Response rdf:type<http://www.tei-c.org/ns/1.0#p**>  .
>> ?Objet_1 rdf:type<http://prodescartes.greyc.fr/**annotations#word<http://prodescartes.greyc.fr/annotations#word>>
>> .
>> ?Objet_1 ram:contents ?Objet_1_content .
>> FILTER regex(?Objet_1_content,"**example") .
>> ?Response ram:contains ?Objet_1 .
>> }
>>
>> I get the result in 0.5 seconds
>>
>> Now, when i'm looking for paragrahs containing "example" and "help":
>>
>> SELECT ?Response
>> WHERE
>> {
>>
>> ?Response rdf:type<http://www.tei-c.org/ns/1.0#p**>  .
>>
>> ?Objet_1 rdf:type<http://example.com#word>  .
>> ?Objet_1 ram:contents ?Objet_1_content .
>> FILTER regex(?Objet_1_content,"**example") .
>> ?Response ram:contains ?Objet_1 .
>>
>> ?Objet_2 rdf:type<http://example.com#word>  .
>> ?Objet_2 ram:contents ?Objet_2_content .
>> FILTER regex(?Objet_2_content,"help") .
>> ?Response ram:contains ?Objet_2 .
>>
>> }
>>
>> I get the result in...10 minutes. ResultSet is around 50 results.
>>
>> Why is it so long?
>>
> It's doing a cross-product of the results but you're asking the question a
> complicated way.
>
> try
>
> SELECT ?Response
> WHERE
> {
>    ?Response rdf:type<http://www.tei-c.org/ns/1.0#p>  .
>    ?Objet_1 rdf:type<http://example.com#word>  .
>    ?Objet_1 ram:contents ?Objet_1_content .
>    FILTER (regex(?Objet_1_content,"example")
>         &&  regex(?Objet_1_content,"work") )
>    ?Response ram:contains ?Objet_1 .
> }
>
I think this query is not correct, because a word can't satisfy 
"example" and "work" regexps.
Here a very simplified(much information is missing) example of data:
<paragraph>
     [...]
</paragraph>

<paragraph>
<word>
<contents>this</contents>
</word>
<word>
<contents>work</contents>
</word>
<word>
<contents>is</contents>
</word>
<word>
<contents>an</contents>
</work>
<word>
<contents>example</contents>
</work>
</paragraph>

<paragraph>
     [...]
</paragraph>

That's why i have to use 2 different objects in my example query: a 
paragraph with the word "example" and with the word "work" -
Is not it?

Thank you.
Jérôme

>> The "funniest" is when i remove constraints on words:
>> I remove those 2 lines:
>> ?Objet_1 rdf:type<http://example.com#word>  .
>> ?Objet_2 rdf:type<http://example.com#word>  .
>>
>> Fuseki answers me faster...
>>
> Less work to do.
>
> With cross products in query (two triple patterns not connected by sharing
> a variable) there can be a a multiplication of additional work.  The
> optimizer should have chosen a different strategy but better is to write
> the as above.
>
>
>> Thank you.
>> Jérôme
>>
> Andy
>

Re: Fuseki query performance

Posted by Andy Seaborne <an...@apache.org>.

On 6 December 2011 15:44, Jérôme <je...@unicaen.fr> wrote:

> Thank you Andy,
>
> it was the cost of serializing and deserializing.
>
> My second problem (yes, i have another one ;-) ) is:
>

By the way - replying to unrelated threads and changing the subject risks
you email not being seen.  I, for one, don't always check threads that I'm
not involved in.


>
> The goal of my queries is to find "paragraphs" which are containing
> "words" which are matching a regex.
> My triplestore stores approximately 1.600.000 triples.
> For example: find paragraphs (in my RDF model) containing the word
> "example" - here the corresponding query:
>
> PREFIX ram:<...>
> PREFIX rdf:<http://www.w3.org/1999/**02/22-rdf-syntax-ns#<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >
>
> SELECT ?Response
> WHERE
> {
> ?Response rdf:type <http://www.tei-c.org/ns/1.0#p**> .
> ?Objet_1 rdf:type <http://prodescartes.greyc.fr/**annotations#word<http://prodescartes.greyc.fr/annotations#word>>
> .
> ?Objet_1 ram:contents ?Objet_1_content .
> FILTER regex(?Objet_1_content,"**example") .
> ?Response ram:contains ?Objet_1 .
> }
>
> I get the result in 0.5 seconds
>
> Now, when i'm looking for paragrahs containing "example" and "help":
>
> SELECT ?Response
> WHERE
> {
>
> ?Response rdf:type <http://www.tei-c.org/ns/1.0#p**> .
>
> ?Objet_1 rdf:type <http://example.com#word> .
> ?Objet_1 ram:contents ?Objet_1_content .
> FILTER regex(?Objet_1_content,"**example") .
> ?Response ram:contains ?Objet_1 .
>
> ?Objet_2 rdf:type <http://example.com#word> .
> ?Objet_2 ram:contents ?Objet_2_content .
> FILTER regex(?Objet_2_content,"help") .
> ?Response ram:contains ?Objet_2 .
>
> }
>
> I get the result in...10 minutes. ResultSet is around 50 results.
>
> Why is it so long?
>

It's doing a cross-product of the results but you're asking the question a
complicated way.

try

SELECT ?Response
WHERE
{
  ?Response rdf:type <http://www.tei-c.org/ns/1.0#p> .
  ?Objet_1 rdf:type <http://example.com#word> .
  ?Objet_1 ram:contents ?Objet_1_content .
  FILTER (regex(?Objet_1_content,"example")
       && regex(?Objet_1_content,"work") )
  ?Response ram:contains ?Objet_1 .
}


>
> The "funniest" is when i remove constraints on words:
> I remove those 2 lines:
> ?Objet_1 rdf:type <http://example.com#word> .
> ?Objet_2 rdf:type <http://example.com#word> .
>
> Fuseki answers me faster...
>

Less work to do.

With cross products in query (two triple patterns not connected by sharing
a variable) there can be a a multiplication of additional work.  The
optimizer should have chosen a different strategy but better is to write
the as above.


>
> Thank you.
> Jérôme
>

Andy

Re: Fuseki query performance

Posted by Jérôme <je...@unicaen.fr>.

Thank you Andy,

it was the cost of serializing and deserializing.

My second problem (yes, i have another one ;-) ) is:

The goal of my queries is to find "paragraphs" which are containing 
"words" which are matching a regex.
My triplestore stores approximately 1.600.000 triples.
For example: find paragraphs (in my RDF model) containing the word 
"example" - here the corresponding query:

PREFIX ram:<...>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?Response
WHERE
{
?Response rdf:type <http://www.tei-c.org/ns/1.0#p> .
?Objet_1 rdf:type <http://prodescartes.greyc.fr/annotations#word> .
?Objet_1 ram:contents ?Objet_1_content .
FILTER regex(?Objet_1_content,"example") .
?Response ram:contains ?Objet_1 .
}

I get the result in 0.5 seconds

Now, when i'm looking for paragrahs containing "example" and "help":

SELECT ?Response
WHERE
{

?Response rdf:type <http://www.tei-c.org/ns/1.0#p> .

?Objet_1 rdf:type <http://example.com#word> .
?Objet_1 ram:contents ?Objet_1_content .
FILTER regex(?Objet_1_content,"example") .
?Response ram:contains ?Objet_1 .

?Objet_2 rdf:type <http://example.com#word> .
?Objet_2 ram:contents ?Objet_2_content .
FILTER regex(?Objet_2_content,"help") .
?Response ram:contains ?Objet_2 .

}

I get the result in...10 minutes. ResultSet is around 50 results.

Why is it so long?

The "funniest" is when i remove constraints on words:
I remove those 2 lines:
?Objet_1 rdf:type <http://example.com#word> .
?Objet_2 rdf:type <http://example.com#word> .

Fuseki answers me faster...

Thank you.
Jérôme

Le 06/12/11 13:33, Andy Seaborne a écrit :
> Jérôme,
>
> There are 150K results?
>
> There are streamed back (unlike Joseki) but it will take a while.
>
> Which result format are you getting?  You might try one of the other 
> result formats which might be a bit faster.
>
> It looks like it is simply the cost of serializing and deserialialing 
> the results.  Unlike the second "count(??)" query, the first query has 
> have to access the node table to get the URi/bnode labels for every 
> result.
>
>     Andy
>
> On 06/12/11 10:08, Jérôme wrote:
>> Hi,
>>
>> I'm trying to query my TDB store and I have some performance problems:
>> Here a simple query example:
>>
>> PREFIX test:<...>
>> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> SELECT ?r
>> {
>> ?r rdf:type test:word .
>> }
>>
>> I have to wait around 20 seconds to get a result - how can i optimize 
>> it?
>>
>> The "count" query
>> PREFIX test:<...>
>> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>> SELECT count(?r)
>> {
>> ?r rdf:type test:word .
>> }
>>
>> returns 150 000.
>>
>> My fuseki server is running with a -Xmx1024m parameter.
>>
>> Thank you.
>> Jérôme.
>>
>> My config file:
>> <#service1> rdf:type fuseki:Service ;
>> fuseki:name "test" ; # http://host:port/ds
>> fuseki:serviceQuery "query" ; # SPARQL query service
>> fuseki:serviceQuery "sparql" ; # SPARQL query service
>> fuseki:serviceUpdate "update" ; # SPARQL query service
>> fuseki:serviceUpload "upload" ; # Non-SPARQL upload service
>> fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store protocol
>> (read and write)
>> # A separate ead-only graph store endpoint:
>> fuseki:serviceReadGraphStore "get" ; # SPARQL Graph store protocol (read
>> only)
>> fuseki:dataset <#test> ;
>> .
>>
>>
>> <#test> rdf:type ja:RDFDataset ;
>> rdfs:label "Books" ;
>> ja:defaultGraph
>> [ rdfs:label "discours.rdf" ;
>> a ja:MemoryModel ;
>> ja:content [ja:externalContent <file:Data/discours.rdf> ] ;
>> ] ;
>> .
>>
>

Re: Fuseki query performance

Posted by Andy Seaborne <an...@apache.org>.

Jérôme,

There are 150K results?

There are streamed back (unlike Joseki) but it will take a while.

Which result format are you getting?  You might try one of the other 
result formats which might be a bit faster.

It looks like it is simply the cost of serializing and deserialialing 
the results.  Unlike the second "count(??)" query, the first query has 
have to access the node table to get the URi/bnode labels for every result.

	Andy

On 06/12/11 10:08, Jérôme wrote:
> Hi,
>
> I'm trying to query my TDB store and I have some performance problems:
> Here a simple query example:
>
> PREFIX test:<...>
> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> SELECT ?r
> {
> ?r rdf:type test:word .
> }
>
> I have to wait around 20 seconds to get a result - how can i optimize it?
>
> The "count" query
> PREFIX test:<...>
> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> SELECT count(?r)
> {
> ?r rdf:type test:word .
> }
>
> returns 150 000.
>
> My fuseki server is running with a -Xmx1024m parameter.
>
> Thank you.
> Jérôme.
>
> My config file:
> <#service1> rdf:type fuseki:Service ;
> fuseki:name "test" ; # http://host:port/ds
> fuseki:serviceQuery "query" ; # SPARQL query service
> fuseki:serviceQuery "sparql" ; # SPARQL query service
> fuseki:serviceUpdate "update" ; # SPARQL query service
> fuseki:serviceUpload "upload" ; # Non-SPARQL upload service
> fuseki:serviceReadWriteGraphStore "data" ; # SPARQL Graph store protocol
> (read and write)
> # A separate ead-only graph store endpoint:
> fuseki:serviceReadGraphStore "get" ; # SPARQL Graph store protocol (read
> only)
> fuseki:dataset <#test> ;
> .
>
>
> <#test> rdf:type ja:RDFDataset ;
> rdfs:label "Books" ;
> ja:defaultGraph
> [ rdfs:label "discours.rdf" ;
> a ja:MemoryModel ;
> ja:content [ja:externalContent <file:Data/discours.rdf> ] ;
> ] ;
> .
>

Fuseki query performance

Posted by Jérôme <je...@unicaen.fr>.

Hi,

I'm trying to query my TDB store and I have some performance problems:
Here a simple query example:

PREFIX test:<...>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?r
{
     ?r rdf:type test:word .
}

I have to wait around 20 seconds to get a result - how can i optimize it?

The "count" query
PREFIX test:<...>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT count(?r)
{
     ?r rdf:type test:word .
}

returns 150 000.

My fuseki server is running with a -Xmx1024m parameter.

Thank you.
Jérôme.

My config file:
<#service1> rdf:type fuseki:Service ;
     fuseki:name                     "test" ;       # http://host:port/ds
     fuseki:serviceQuery             "query" ;    # SPARQL query service
     fuseki:serviceQuery             "sparql" ;   # SPARQL query service
     fuseki:serviceUpdate            "update" ;   # SPARQL query service
     fuseki:serviceUpload            "upload" ;   # Non-SPARQL upload 
service
    fuseki:serviceReadWriteGraphStore      "data" ;     # SPARQL Graph 
store protocol (read and write)
     # A separate ead-only graph store endpoint:
     fuseki:serviceReadGraphStore       "get" ;   # SPARQL Graph store 
protocol (read only)
     fuseki:dataset <#test> ;
     .


<#test>    rdf:type ja:RDFDataset ;
     rdfs:label "Books" ;
     ja:defaultGraph
       [ rdfs:label "discours.rdf" ;
         a ja:MemoryModel ;
         ja:content [ja:externalContent <file:Data/discours.rdf> ] ;
       ] ;
     .

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Mariano Rodriguez <ro...@inf.unibz.it>.

I came across the RIOT page for Jena [1] and saw that there are also some "loaders" defined
there, somehow my head I made a connection but to tdbloader and rio but it made no sense ;)

Thanks for the clarification :)

Btw, we start testing now with tdbloader2 Ill report back as soon as possible



[1] http://openjena.org/wiki/RIOT


On Dec 6, 2011, at 9:12 AM, Paolo Castagna wrote:

> Mariano Rodriguez wrote:
>> We do want to test and move to the Hadoop map-reduce setting in the (mid-term) future, 
>> but first we can to have the simple setting as optimal as possible.
> 
> It makes sense. Get back in touch when you move onto MapReduce. :-)
> 
>> By the way Paolo, does tdbloader2 has anything to do with Sesame's RIO? 
> 
> tdbloader2 (both the "official" one [1] and the "experimental" (pure Java) one
> [2]) AFAIK have nothing to do with Sesame's RIO.
> 
> tdbloader2 (both of them) generates indexes (i.e. B+Tree) which are binary files
> for TDB only.
> 
> Why are you asking?
> 
> Paolo
> 
> [1] http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/bin/tdbloader2
> [2] http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/
> 
>> 
>> 
>>> Mariano, do you have an Hadoop cluster @ unibz.it?
>> 
>> 
>> That another reason not to do the map-reduce part yet :) we also don't have yet a cluster at bolzano :( 
>> 
>> 
>>> Cheers,
>>> Paolo
>>> 
>> 
>

Re: SVN/GIT down?

Posted by Mariano Rodriguez <ro...@inf.unibz.it>.

> 
> Easier to download a build:
> 
> https://repository.apache.org/content/repositories/snapshots/org/apache/jena/jena-tdb/0.9.0-incubating-SNAPSHOT/
> 
> (we haven't quite got the first Apache release done yet)

Thanks, I got it now.

> 
>>>> 
>>>> [2]
>>>> http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/
> 
> What problems did you encounter?

Sorry I forgot to specify, its just timout errors, it doesn't manage to download anything. I just connacted the
tech support of our university to see if there is something wrong locally.

Thank you again,
M

Re: SVN/GIT down?

Posted by Andy Seaborne <an...@apache.org>.

On 06/12/11 09:40, Ian Dickinson wrote:
> Hi Mariano,
> On 06/12/11 09:31, Mariano Rodriguez wrote:
>> About checking out
>>
>>> [1]
>>> http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/bin/tdbloader2

Easier to download a build:

https://repository.apache.org/content/repositories/snapshots/org/apache/jena/jena-tdb/0.9.0-incubating-SNAPSHOT/

(we haven't quite got the first Apache release done yet)

>>>
>>> [2]
>>> http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/

What problems did you encounter?

>>>
>>
>> by any chance anybody has problem with the server at the moment? I
>> have been trying to
>> download since yesterday but there is no response from the server…
> I've just tried checking out a fresh copy of both of those repos, and
> both worked for me.
>
> Ian
>
>

Re: SVN/GIT down?

Posted by Ian Dickinson <ia...@epimorphics.com>.

Hi Mariano,
On 06/12/11 09:31, Mariano Rodriguez wrote:
> About checking out
>
>> [1] http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/bin/tdbloader2
>> [2] http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/
>
> by any chance anybody has problem with the server at the moment? I have been trying to
> download since yesterday but there is no response from the server…
I've just tried checking out a fresh copy of both of those repos, and 
both worked for me.

Ian


-- 
____________________________________________________________
Ian Dickinson                   Epimorphics Ltd, Bristol, UK
mailto:ian@epimorphics.com        http://www.epimorphics.com
cell: +44-7786-850536              landline: +44-1275-399069
------------------------------------------------------------
Epimorphics Ltd.  is a limited company registered in England
(no. 7016688). Registered address: Court Lodge, 105 High St,
               Portishead, Bristol BS20 6PT, UK

SVN/GIT down?

Posted by Mariano Rodriguez <ro...@inf.unibz.it>.

Hi again

About checking out

> [1] http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/bin/tdbloader2
> [2] http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/

by any chance anybody has problem with the server at the moment? I have been trying to
download since yesterday but there is no response from the server…

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Paolo Castagna <ca...@googlemail.com>.

Mariano Rodriguez wrote:
> We do want to test and move to the Hadoop map-reduce setting in the (mid-term) future, 
> but first we can to have the simple setting as optimal as possible.

It makes sense. Get back in touch when you move onto MapReduce. :-)

> By the way Paolo, does tdbloader2 has anything to do with Sesame's RIO? 

tdbloader2 (both the "official" one [1] and the "experimental" (pure Java) one
[2]) AFAIK have nothing to do with Sesame's RIO.

tdbloader2 (both of them) generates indexes (i.e. B+Tree) which are binary files
for TDB only.

Why are you asking?

Paolo

 [1] http://svn.apache.org/repos/asf/incubator/jena/Jena2/TDB/trunk/bin/tdbloader2
 [2] http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/

> 
> 
>> Mariano, do you have an Hadoop cluster @ unibz.it?
> 
> 
> That another reason not to do the map-reduce part yet :) we also don't have yet a cluster at bolzano :( 
> 
> 
>> Cheers,
>> Paolo
>>
>

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Andy Seaborne <an...@apache.org>.

On 05/12/11 13:58, Mariano Rodriguez wrote:
> In this case of this first initial round of benchmarks we want to avoid any Hadoop or
> map-reduce approaches. The reason is that
> we want to have raw numbers of the core reasoning techniques, in this case forward chaining
> vs. backward chaining and our technique called semantic indexes which is a bit like backward
> chaining but with a tiny bit of extra work at loading time. We want to avoid evaluating
> benefits from the architecture of the system (map-reduce for example) because the technique that we are
> testing can also be extended with map-reduce and a parallel architecture.

In the past, I've experimented with forward-chaining the schema and 
doing one step of backward chaining in the query.

Merely forward chaining everything (even just the useful subclass, 
subproperty, domain and range as is done by riotcmd.infer) causes triple 
bloat and, at scale, the bloat can reduce the effectiveness of disk caching.

But pure backward chaining has a horrible access pattern on the data 
(walking arbitrary length paths):

?x rdf:type/rdfs:subClassOf* :type

?x ?p ?v . ?p rdfs:subPropertyOf* :property

(obviously you don't have to do it this way - this is just the naive way 
and it can be written in SPARQL 1.1 - it's even in the spec).

Assuming the schema is small compared to the data and fixed, 
preprocessing the schema to have a single table of (type, supertype) 
with the transitive closure turns it into two patterns:

?x rdf:type ?var . table(?var, :type)

LUBM is unusual in several ways.  All systems I know of, load faster on 
LUBM than any other benchmark because it has a low node to triple ratio 
(i.e. it is very interconnected within each university).  RDFS-level 
iInference increase this effect because inference can add triples but 
not create new RDF terms.  Loading nodes means the bytes for the URI or 
literal need to be stored needing more work.

It would be easy to add this to TDB (the prototyping was for SDB where 
it's more important due to JDBC-isms) - doing it as part of the more 
general property tables would be interesting.

TDB scales much better than SDB (load and query).

	Andy

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Mariano Rodriguez <ro...@inf.unibz.it>.

Hi Paolo, 

On Dec 5, 2011, at 11:26 AM, Paolo Castagna wrote:

> Andy Seaborne wrote:
>> The flow is:
>> 
>> infer --rdfs=VOCAB DATA | tdbloader2 --loc DB
>> 
>> on a 64bit system.  Linux is faster than Windows.
>> 
>> (tdbloader2 only runs on linux currently - Paolo has a pure java version
>> on github)
> 
> tdbloader2 (pure Java version) is here (experimental):
> http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/
> 
> If you want to discuss further or help, see JENA-117:
> https://issues.apache.org/jira/browse/JENA-117

Excellent, we'll start reading the docs asap

> 
> Inference a la RIOT infer command line can be done using MapReduce as well
> (a map only job), I doubt you can beat that if you have a medium to large
> Hadoop cluster. ;-)

In this case of this first initial round of benchmarks we want to avoid any Hadoop or 
map-reduce approaches. The reason is that
we want to have raw numbers of the core reasoning techniques, in this case forward chaining
vs. backward chaining and our technique called semantic indexes which is a bit like backward
chaining but with a tiny bit of extra work at loading time. We want to avoid evaluating
benefits from the architecture of the system (map-reduce for example) because the technique that we are
testing can also be extended with map-reduce and a parallel architecture. 

We do want to test and move to the Hadoop map-reduce setting in the (mid-term) future, but first
we can to have the simple setting as optimal as possible.

By the way Paolo, does tdbloader2 has anything to do with Sesame's RIO? 

> Mariano, do you have an Hadoop cluster @ unibz.it?

That another reason not to do the map-reduce part yet :) we also don't have yet a cluster at bolzano :( 

> 
> Cheers,
> Paolo
>

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> The flow is:
> 
> infer --rdfs=VOCAB DATA | tdbloader2 --loc DB
> 
> on a 64bit system.  Linux is faster than Windows.
> 
> (tdbloader2 only runs on linux currently - Paolo has a pure java version
> on github)

tdbloader2 (pure Java version) is here (experimental):
http://svn.apache.org/repos/asf/incubator/jena/Scratch/PC/tdbloader2/trunk/

If you want to discuss further or help, see JENA-117:
https://issues.apache.org/jira/browse/JENA-117

Inference a la RIOT infer command line can be done using MapReduce as well
(a map only job), I doubt you can beat that if you have a medium to large
Hadoop cluster. ;-)

See, for example (... another experimental thing):
https://github.com/castagna/tdbloader3/blob/master/src/main/java/org/apache/jena/tdbloader3/InferDriver.java
https://github.com/castagna/tdbloader3/blob/master/src/main/java/org/apache/jena/tdbloader3/InferMapper.java

Using MapReduce to generate TDB indexes is possible, but not 'easy'.
See, for example: https://github.com/castagna/tdbloader3/

I am planning to investigate the route of having hash node ids which
would simplify parallel generation of TDB indexes as well as merging
existing indexes.

Mariano, do you have an Hadoop cluster @ unibz.it?

Cheers,
Paolo

Re: Fair Benchmarking of SDB, TDB and LUBM 100 > with inference support and limited memory

Posted by Andy Seaborne <an...@apache.org>.

Hi Mariano,

On 04/12/11 13:54, Mariano Rodriguez wrote:
> Hi all,
>
> We are now benchmarking several triple stores that support inference
> through forward chaining against a system that does a particular form
> of query rewriting.
>
> The benchmark we are using is simple, an extended version of LUBM,
> using big datasets LUBM 1000, 8000, 15000, 250000. From Jena we would
> like to benchmark loading time, inference time and query answering
> time, using both TDB and SDB. Inferences should be done with limited
> amounts of memory, the less the better. However, we are having
> difficulties understanding what is the fair way to do this. Also, the
> system used for this benchmarks should be a simple system, not a
> cluster or a server with large resources. We would like to ask the
> community for help to approach this in the best way possible. Hence
> this email :). Here go some questions and ideas.
 >
 > Is it the case that the default inference engine of Jena requires all
 > triples to be in-memory? Is it not possible to do this on this? If
 > this is so, what would be the fair way to benchmark the system?
There are a couple of dimensions to think about:

1/ Do you want to test LUBM or a more general data?
2/ What level of inference do you wish to test?

(1) => For LUBM, there are no inference across universities so you can 
generate the data for one university, run the forward chain inference on 
it and move on to the next university knowing that no triples will be 
generated later that affect the university you have just processed (and 
so don't need to retain state for it).

(2) => Inference for LUBM only needs one data triple and access to the 
ontology to calculate the inferences.  Once a triple has been processed, 
to can emit the inferred triples and move on.  Again, no data-related 
state is needed.

The Jena rules-based reasoner, which is RETE-based, is more powerful 
than is need for RDFS or LUBM, including rules based on multiple data 
triples and retraction, but the cost is that it stores internal state 
in-memory scaling with the size of the data.

There is also a stream-based forward chaining engine, riotcmd.infer, 
that keeps the RSF schema in memory but not the state of the data so it 
uses a fixed amount of space and does not increase with data size.

This is probably the best way to infer over LUBM at scale.

> Right
> now we are thinking of a workflow as follows:
>
> 1. Start a TDB or SDB store.
 > 2. Load 10 LUBMS in memory, compute the
> closure using
>
> Reasoner reasoner = ReasonerRegistry.getOWLReasoner(); InfModel inf =
> ModelFactory.createInfModel(reasoner, monto, m);
>
> and storing the result in SDB or TDB. When finished,
 > 3. Query the store directly.
 >
> Is this the most efficient way to do it? Are there important
> parameters (besides the number of universities used in the
> computation of the closure) that we should tune to guarantee a fair
> evaluation? Are there any documents that we could use to guide
> ourselfs during tuning of Jena?

This is exploiting the features of LUBM (you only need one university). 
  I don't have figures I'd expect the riotcmd.infer to be faster as it's 
less general.

The flow is:

infer --rdfs=VOCAB DATA | tdbloader2 --loc DB

on a 64bit system.  Linux is faster than Windows.

(tdbloader2 only runs on linux currently - Paolo has a pure java version 
on github)

> Thank you very much in advance everybody,
>
> Best regards, Mariano

	Good luck,
	Andy

>
>
>
> Mariano Rodriguez Muro http://www.inf.unibz.it/~rodriguez/ KRDB
> Research Center Faculty of Computer Science Free University of
> Bozen-Bolzano (FUB) Piazza Domenicani 3, I-39100 Bozen-Bolzano BZ,
> Italy 猴
>
>
>
>