You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by David Jordan <Da...@sas.com> on 2011/09/19 17:04:00 UTC

benchmarking

I have switch over from SDB to TDB to see if I can get better performance.
In the following, Database is a class of mine that insulated the code from knowing if it is SDB or TDB.

I do the following, which combines 2 models I have stored in TDB and then reads a third small model from a file that contains some classes I want to “test”. I then have some code that times how long it takes to get a particular class and list its instances.

Model model1 = Database.getICD9inferredModel();
Model model2 = Database.getPatientModel();
OntModel omodel = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF, model1);
omodel.add(model2);

InputStream in = FileManager.get().open(fileName);
omodel.read(in, baseName, "TURTLE");

OntClass oclass = omodel.getOntClass(line);   // access the class

On the first call to getOntClass, I have been seeing a VERY long wait (around an hour) before I get a response.
Then after that first call, subsequent calls are much faster.
But I started looking at the CPU utilization. After the call to getOntClass, CPU utilization is very close to 0.
Is this to be expected?
Is there any form of tracing/logging that can be turned on to determine what (if anything) is happening?

Is there something I am doing wrong in setting up my models?
For the ICD9 ontology I am using, I had read in the OWL data, created an OntModel with it, wrote this OntModel data out.
Then I store the data from the OntModel into TDB, so it supposedly does not have to do as much work at runtime.
Yet, performance is still awful.

David Jordan
Software Developer
SAS Institute Inc.
Health & Life Sciences, Research & Development
Bldg R ▪ Office 4467
600 Research Drive ▪ Cary, NC 27513
Tel: 919 531 1233 ▪ david.jordan@sas.com<ma...@sas.com>
www.sas.com<http://www.sas.com>
SAS® … THE POWER TO KNOW®

Re: benchmarking

Posted by Andy Seaborne <an...@apache.org>.

On 21/09/11 15:54, David Jordan wrote:
> Understood. When you say
>> We're thinking of using it for TDB direct mode
> does that mean it is planned for the future, versus implemented today?

Yes - direct mode of TDB is used on 32 bit machines.  It currently uses 
ByteBuffer.allocate() i.e. in-heap byte buffers.  The direct byte buffer 
(different use of the word "direct"!) should be faster but we need to 
test whether it does have a measurable improvement and does not cause 
difficulties in some other way.

The direct byte buffer area is fixed size - hit the limit and the JVM 
will die so management of the buffers needs testing.

	Andy

>
> -----Original Message-----
> From: Andy Seaborne [mailto:andy.seaborne.apache@gmail.com] On Behalf Of Andy Seaborne
> Sent: Wednesday, September 21, 2011 10:06 AM
> To: jena-users@incubator.apache.org
> Subject: Re: benchmarking
>
> On 21/09/11 11:32, Dave Reynolds wrote:
>> On Tue, 2011-09-20 at 17:22 +0000, David Jordan wrote:
>>> I guess I had the wrong impression of how TDB is implemented. I thought on 64 bit environments that some of the files were memory-mapped directly into the virtual memory space of the application process.
>>
>> They are memory mapped but that doesn't mean they are all in memory
>> unless you have enough memory available, have a cooperative OS, and
>> have set appropriate options such as -XX:MaxDirectMemorySize depending
>> on your JVM.
>
> -XX:MaxDirectMemorySize affects a different area of memory allocation to memory mapped files.
>
> Direct memory is from ByteBuffer.allocateDirect(capacity).  It is really plain old malloc'ed space - it isn't in the heap, isn't a mapped memory segment and does not move due to GC but is free'd when it's becomes unreachable.  I've assumed that the size is fixed so the process can grow the heap via sbrk(2) and still have a contiguous heap.
>
> Such a direct memory ByteBuffer is in the same segment of memory as the rest of the JVM; it is faster to access than a heap ByteBuffer.  We're thinking of using it for TDB direct mode.
>
> Memory mapped files are also direct ByteBuffers but they aren't from allocated space.  There are two kinds of direct memory buffers in Java.
>    The OS manages memory mapped files via segments of the adress space - the Java process does not allocated them and they live in memory mapped segments of your procress.
>
> 	Andy
>

RE: benchmarking

Posted by David Jordan <Da...@sas.com>.

Understood. When you say
> We're thinking of using it for TDB direct mode
does that mean it is planned for the future, versus implemented today?

-----Original Message-----
From: Andy Seaborne [mailto:andy.seaborne.apache@gmail.com] On Behalf Of Andy Seaborne
Sent: Wednesday, September 21, 2011 10:06 AM
To: jena-users@incubator.apache.org
Subject: Re: benchmarking

On 21/09/11 11:32, Dave Reynolds wrote:
> On Tue, 2011-09-20 at 17:22 +0000, David Jordan wrote:
>> I guess I had the wrong impression of how TDB is implemented. I thought on 64 bit environments that some of the files were memory-mapped directly into the virtual memory space of the application process.
>
> They are memory mapped but that doesn't mean they are all in memory 
> unless you have enough memory available, have a cooperative OS, and 
> have set appropriate options such as -XX:MaxDirectMemorySize depending 
> on your JVM.

-XX:MaxDirectMemorySize affects a different area of memory allocation to memory mapped files.

Direct memory is from ByteBuffer.allocateDirect(capacity).  It is really plain old malloc'ed space - it isn't in the heap, isn't a mapped memory segment and does not move due to GC but is free'd when it's becomes unreachable.  I've assumed that the size is fixed so the process can grow the heap via sbrk(2) and still have a contiguous heap.

Such a direct memory ByteBuffer is in the same segment of memory as the rest of the JVM; it is faster to access than a heap ByteBuffer.  We're thinking of using it for TDB direct mode.

Memory mapped files are also direct ByteBuffers but they aren't from allocated space.  There are two kinds of direct memory buffers in Java. 
  The OS manages memory mapped files via segments of the adress space - the Java process does not allocated them and they live in memory mapped segments of your procress.

	Andy

Re: benchmarking

Posted by Andy Seaborne <an...@apache.org>.

On 21/09/11 11:32, Dave Reynolds wrote:
> On Tue, 2011-09-20 at 17:22 +0000, David Jordan wrote:
>> I guess I had the wrong impression of how TDB is implemented. I thought on 64 bit environments that some of the files were memory-mapped directly into the virtual memory space of the application process.
>
> They are memory mapped but that doesn't mean they are all in memory
> unless you have enough memory available, have a cooperative OS, and have
> set appropriate options such as -XX:MaxDirectMemorySize depending on
> your JVM.

-XX:MaxDirectMemorySize affects a different area of memory allocation to 
memory mapped files.

Direct memory is from ByteBuffer.allocateDirect(capacity).  It is really 
plain old malloc'ed space - it isn't in the heap, isn't a mapped memory 
segment and does not move due to GC but is free'd when it's becomes 
unreachable.  I've assumed that the size is fixed so the process can 
grow the heap via sbrk(2) and still have a contiguous heap.

Such a direct memory ByteBuffer is in the same segment of memory as the 
rest of the JVM; it is faster to access than a heap ByteBuffer.  We're 
thinking of using it for TDB direct mode.

Memory mapped files are also direct ByteBuffers but they aren't from 
allocated space.  There are two kinds of direct memory buffers in Java. 
  The OS manages memory mapped files via segments of the adress space - 
the Java process does not allocated them and they live in memory mapped 
segments of your procress.

	Andy

RE: benchmarking

Posted by Dave Reynolds <da...@gmail.com>.

On Tue, 2011-09-20 at 17:22 +0000, David Jordan wrote: 
> I guess I had the wrong impression of how TDB is implemented. I thought on 64 bit environments that some of the files were memory-mapped directly into the virtual memory space of the application process.

They are memory mapped but that doesn't mean they are all in memory
unless you have enough memory available, have a cooperative OS, and have
set appropriate options such as -XX:MaxDirectMemorySize depending on
your JVM.

> A few more questions:
> 1. What I am seeing is that there is a LONG wait the first time I access the OntModel. It seems that all of the reasoning/inferencing work is being done in bulk at that time, because after that first call, things are fast.

Actually not quite. In the rules there is a split between forward (up
front, eager) inference and backward (on demand) inference. Though it
errs towards forward inference since that makes post-prepare queries
more predicable.

You can create custom rule sets using entirely backward rules.

> Is this upfront all-at-once reasoning an aspect of all OWL reasoners, i.e. it is a necessity of the OWL language itself
> or is this just specific to the built-in Jena reasoners?

Not necessary, there are lots of ways you can perform the inference and
different trade-offs are possible.

> Are there third party Jena reasoners that do more of a "lazy evaluation" of entailments?

C&P's new Stardog store does query time reasoning I believe. Certainly
for OWL QL.

Virtuoso can support some backward rules but I don't know about their
OWL support.

> Or are there reasoners that do it upfront, but are much faster than the built-in reasoners?

For DL then Pellet can outperform the built in OWL_MICRO for complex
ontologies, though the reverse can also be true in simpler cases where
OWL_MICRO's coverage is sufficient.

There is also BigOWLIM though I don't know the state of its Jena
support.

> 2. The ICD9 ontology is a fully self-contained ontology. It does not depend on anything outside of itself.
> I put this in its own model, I also created an OntModel of it, used writeAll to output the OntModel, then placed this in a new model in Jena. I did this because I thought it would speed things up considerably. There should not be any additional reasoning that needs to be done on this particular model. It is read-only and fully "reasoned". 

"read only" isn't meaningful or relevant here, the reasoners don't
modify the original data.

> It would be real nice if there was a performance hint that could be given to the OntModel that this particular ICD9 model is fully reasoned, self-contained and requires no further reasoning. I understand this is not currently supported, but I am suggesting that this may be a very useful feature for improving performance.

I can see the attraction but its not obvious how to implement it :(

> We will eventually be including many other similar biomedical ontologies that are read-only, fully self-contained, that we can generate a reasoned model like I have done for ICD9.

If your aim is to reason over patient data with those as background then
it might be you want a custom rule set. Perform just the inferences you
want, using the precomputed closures, rather that doing complete OWL
inference over the entire merged ontology set plus instance data. That's
an option that has worked for me in the past though I don't have an
automated way of setting such a thing up.

Dave


> -----Original Message-----
> From: Dave Reynolds [mailto:dave.e.reynolds@gmail.com] 
> Sent: Tuesday, September 20, 2011 9:23 AM
> To: jena-users@incubator.apache.org
> Subject: RE: benchmarking
> 
> Hi,
> 
> On Mon, 2011-09-19 at 18:16 +0000, David Jordan wrote: 
> > One question is what level of inferencing is really necessary for things like the following:
> > It is not very clear to me yet which OWL constructs require particular inferencing levels.
> > 
> > :Prostate_Cancer a owl:Class ;
> > 	owl:unionOf (
> > 		HOM_ICD9:HOM_ICD_10566	# V10.46  These identify ICD9 codes for Prostate Cancer
> > 		HOM_ICD9:HOM_ICD_1767	# 233.4     which exist in a large class hierarchy
> > 		HOM_ICD9:HOM_ICD_1343	# 185
> > 	) .
> > 
> > :Cohort1 a owl:Class ;
> > 	owl:equivalentClass [
> > 		a owl:Restriction ;
> > 		owl:onProperty patient:hasDiagnosis ;  # which patients have a diagnosis associated with prostate cancer?
> > 		owl:someValuesFrom :Prostate_Cancer
> > 	] .
> > 
> > Except for taking the ICD9 class hierarchy into account, this is not really much more than a simple database query.
> 
> Actually it is a lot more than a database query. For example, if you now assert (:eg a :Cohort1) then a DL reasoner will "know" that :eg has a diagnosis which is within one of the three members of the :Prostate_Cancer union but not which one.  If you then add additional information that constrains some of the other cases you may be able to determine which ICD9 code it is. By using owl:equivalentClass you are enabling reasoning in either direction across that equivalence, including reasoning by cases.
> 
> To determine the level of inference required you either have to assume complete inference and profile the language (DL, EL, QL etc) or you need to *also* specify what specific queries you might want to ask.
> 
> > The nice aspect of doing this in OWL is that we can define these sets, like :Prostate_Cancer and :Cohort1, and then ask other questions of these sets.
> 
> Agreed.
> 
> > I thought that with TDB running on a 64 bit Linux box, doing memory mapped I/O, that TDB could efficiently pull everything into memory quickly, avoiding doing lots of fine grained SQL calls to a MySQL server.
> 
> Think of it as more like a halfway house. Each query to TDB involves walking over the appropriate B-tree indexes, just like in a database.
> Given enough memory and an operating system good enough at disc block caching you might end up with everything, or near everything, paged into memory, but that is not guaranteed. Lots of factors come into play.
> 
> > I did use writeAll for writing the OntModel.
> > 
> > Relative to your suggestion of
> > (1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.
> > 
> > Would I need to do this for EVERYTHING, including the declarations above for Prostate_Cancer and Cohort1?
> 
> Yes. Essentially create an in-memory model containing everything you want to reason over. Then either materialize everything or, if you know the query patterns you are interested in, then ask those queries and materialize the results.
> 
> Alternatively you may want to consider the commercial triple stores that offer inference at scale with Jena compatibility.
> 
> Dave
> 
> > 
> > 
> > -----Original Message-----
> > From: Dave Reynolds [mailto:dave.e.reynolds@gmail.com]
> > Sent: Monday, September 19, 2011 11:39 AM
> > To: jena-users@incubator.apache.org
> > Subject: Re: benchmarking
> > 
> > Hi,
> > 
> > On Mon, 2011-09-19 at 15:04 +0000, David Jordan wrote: 
> > > I have switch over from SDB to TDB to see if I can get better performance.
> > > In the following, Database is a class of mine that insulated the code from knowing if it is SDB or TDB.
> > > 
> > > I do the following, which combines 2 models I have stored in TDB and then reads a third small model from a file that contains some classes I want to “test”. I then have some code that times how long it takes to get a particular class and list its instances.
> > > 
> > > Model model1 = Database.getICD9inferredModel(); Model model2 = 
> > > Database.getPatientModel(); OntModel omodel = 
> > > ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF
> > > ,
> > > model1); omodel.add(model2);
> > 
> > That is running a full rule reasoner over the TDB model. As I've mentioned before the rule inference engines store everything in memory so that doesn't give you any scaling over simply loading the file into memory and doing inference over that, it just goes very very slowly!
> > 
> > > InputStream in = FileManager.get().open(fileName); omodel.read(in, 
> > > baseName, "TURTLE");
> > > 
> > > OntClass oclass = omodel.getOntClass(line);   // access the class
> > > 
> > > On the first call to getOntClass, I have been seeing a VERY long wait (around an hour) before I get a response.
> > > Then after that first call, subsequent calls are much faster.
> > > But I started looking at the CPU utilization. After the call to getOntClass, CPU utilization is very close to 0.
> > > Is this to be expected?
> > 
> > Seems plausible, the inference engines are in effect doing huge number of triple queries to TDB which will be spend most of its time waiting for the disk.
> > 
> > If you really need to run live inference over the entire dataset then load it into a memory model first, then construct your inference model over that.
> > 
> > > Is there any form of tracing/logging that can be turned on to determine what (if anything) is happening?
> > > 
> > > Is there something I am doing wrong in setting up my models?
> > > For the ICD9 ontology I am using, I had read in the OWL data, created an OntModel with it, wrote this OntModel data out.
> > > Then I store the data from the OntModel into TDB, so it supposedly does not have to do as much work at runtime.
> > 
> > As Chris says, make sure you using writeAll not just plain write to store the OntModel.
> > 
> > That aside, this doesn't necessarily save you much work because the rules are having to run anyway, they are just not discovering anything much new.
> > 
> > In the absence of a highly scalable inference solution for Jena (something which can't be done without resourcing) then your two good options are:
> > 
> > (1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.
> > 
> > (2) Load all the data into memory and run inference over that.
> > 
> > Dave
> > 
> > 
> > 
> 
> 
> 
>

RE: benchmarking

Posted by David Jordan <Da...@sas.com>.

I guess I had the wrong impression of how TDB is implemented. I thought on 64 bit environments that some of the files were memory-mapped directly into the virtual memory space of the application process.

I'll be having a conversation tomorrow with the architect of AllegroGraph.

A few more questions:
1. What I am seeing is that there is a LONG wait the first time I access the OntModel. It seems that all of the reasoning/inferencing work is being done in bulk at that time, because after that first call, things are fast.

Is this upfront all-at-once reasoning an aspect of all OWL reasoners, i.e. it is a necessity of the OWL language itself
or is this just specific to the built-in Jena reasoners?
Are there third party Jena reasoners that do more of a "lazy evaluation" of entailments?
Or are there reasoners that do it upfront, but are much faster than the built-in reasoners?

2. The ICD9 ontology is a fully self-contained ontology. It does not depend on anything outside of itself.
I put this in its own model, I also created an OntModel of it, used writeAll to output the OntModel, then placed this in a new model in Jena. I did this because I thought it would speed things up considerably. There should not be any additional reasoning that needs to be done on this particular model. It is read-only and fully "reasoned".  It would be real nice if there was a performance hint that could be given to the OntModel that this particular ICD9 model is fully reasoned, self-contained and requires no further reasoning. I understand this is not currently supported, but I am suggesting that this may be a very useful feature for improving performance. We will eventually be including many other similar biomedical ontologies that are read-only, fully self-contained, that we can generate a reasoned model like I have done for ICD9.

-----Original Message-----
From: Dave Reynolds [mailto:dave.e.reynolds@gmail.com] 
Sent: Tuesday, September 20, 2011 9:23 AM
To: jena-users@incubator.apache.org
Subject: RE: benchmarking

Hi,

On Mon, 2011-09-19 at 18:16 +0000, David Jordan wrote: 
> One question is what level of inferencing is really necessary for things like the following:
> It is not very clear to me yet which OWL constructs require particular inferencing levels.
> 
> :Prostate_Cancer a owl:Class ;
> 	owl:unionOf (
> 		HOM_ICD9:HOM_ICD_10566	# V10.46  These identify ICD9 codes for Prostate Cancer
> 		HOM_ICD9:HOM_ICD_1767	# 233.4     which exist in a large class hierarchy
> 		HOM_ICD9:HOM_ICD_1343	# 185
> 	) .
> 
> :Cohort1 a owl:Class ;
> 	owl:equivalentClass [
> 		a owl:Restriction ;
> 		owl:onProperty patient:hasDiagnosis ;  # which patients have a diagnosis associated with prostate cancer?
> 		owl:someValuesFrom :Prostate_Cancer
> 	] .
> 
> Except for taking the ICD9 class hierarchy into account, this is not really much more than a simple database query.

Actually it is a lot more than a database query. For example, if you now assert (:eg a :Cohort1) then a DL reasoner will "know" that :eg has a diagnosis which is within one of the three members of the :Prostate_Cancer union but not which one.  If you then add additional information that constrains some of the other cases you may be able to determine which ICD9 code it is. By using owl:equivalentClass you are enabling reasoning in either direction across that equivalence, including reasoning by cases.

To determine the level of inference required you either have to assume complete inference and profile the language (DL, EL, QL etc) or you need to *also* specify what specific queries you might want to ask.

> The nice aspect of doing this in OWL is that we can define these sets, like :Prostate_Cancer and :Cohort1, and then ask other questions of these sets.

Agreed.

> I thought that with TDB running on a 64 bit Linux box, doing memory mapped I/O, that TDB could efficiently pull everything into memory quickly, avoiding doing lots of fine grained SQL calls to a MySQL server.

Think of it as more like a halfway house. Each query to TDB involves walking over the appropriate B-tree indexes, just like in a database.
Given enough memory and an operating system good enough at disc block caching you might end up with everything, or near everything, paged into memory, but that is not guaranteed. Lots of factors come into play.

> I did use writeAll for writing the OntModel.
> 
> Relative to your suggestion of
> (1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.
> 
> Would I need to do this for EVERYTHING, including the declarations above for Prostate_Cancer and Cohort1?

Yes. Essentially create an in-memory model containing everything you want to reason over. Then either materialize everything or, if you know the query patterns you are interested in, then ask those queries and materialize the results.

Alternatively you may want to consider the commercial triple stores that offer inference at scale with Jena compatibility.

Dave

> 
> 
> -----Original Message-----
> From: Dave Reynolds [mailto:dave.e.reynolds@gmail.com]
> Sent: Monday, September 19, 2011 11:39 AM
> To: jena-users@incubator.apache.org
> Subject: Re: benchmarking
> 
> Hi,
> 
> On Mon, 2011-09-19 at 15:04 +0000, David Jordan wrote: 
> > I have switch over from SDB to TDB to see if I can get better performance.
> > In the following, Database is a class of mine that insulated the code from knowing if it is SDB or TDB.
> > 
> > I do the following, which combines 2 models I have stored in TDB and then reads a third small model from a file that contains some classes I want to “test”. I then have some code that times how long it takes to get a particular class and list its instances.
> > 
> > Model model1 = Database.getICD9inferredModel(); Model model2 = 
> > Database.getPatientModel(); OntModel omodel = 
> > ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF
> > ,
> > model1); omodel.add(model2);
> 
> That is running a full rule reasoner over the TDB model. As I've mentioned before the rule inference engines store everything in memory so that doesn't give you any scaling over simply loading the file into memory and doing inference over that, it just goes very very slowly!
> 
> > InputStream in = FileManager.get().open(fileName); omodel.read(in, 
> > baseName, "TURTLE");
> > 
> > OntClass oclass = omodel.getOntClass(line);   // access the class
> > 
> > On the first call to getOntClass, I have been seeing a VERY long wait (around an hour) before I get a response.
> > Then after that first call, subsequent calls are much faster.
> > But I started looking at the CPU utilization. After the call to getOntClass, CPU utilization is very close to 0.
> > Is this to be expected?
> 
> Seems plausible, the inference engines are in effect doing huge number of triple queries to TDB which will be spend most of its time waiting for the disk.
> 
> If you really need to run live inference over the entire dataset then load it into a memory model first, then construct your inference model over that.
> 
> > Is there any form of tracing/logging that can be turned on to determine what (if anything) is happening?
> > 
> > Is there something I am doing wrong in setting up my models?
> > For the ICD9 ontology I am using, I had read in the OWL data, created an OntModel with it, wrote this OntModel data out.
> > Then I store the data from the OntModel into TDB, so it supposedly does not have to do as much work at runtime.
> 
> As Chris says, make sure you using writeAll not just plain write to store the OntModel.
> 
> That aside, this doesn't necessarily save you much work because the rules are having to run anyway, they are just not discovering anything much new.
> 
> In the absence of a highly scalable inference solution for Jena (something which can't be done without resourcing) then your two good options are:
> 
> (1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.
> 
> (2) Load all the data into memory and run inference over that.
> 
> Dave
> 
> 
>

RE: benchmarking

Posted by Dave Reynolds <da...@gmail.com>.

Hi,

On Mon, 2011-09-19 at 18:16 +0000, David Jordan wrote: 
> One question is what level of inferencing is really necessary for things like the following:
> It is not very clear to me yet which OWL constructs require particular inferencing levels.
> 
> :Prostate_Cancer a owl:Class ;
> 	owl:unionOf (
> 		HOM_ICD9:HOM_ICD_10566	# V10.46  These identify ICD9 codes for Prostate Cancer
> 		HOM_ICD9:HOM_ICD_1767	# 233.4     which exist in a large class hierarchy
> 		HOM_ICD9:HOM_ICD_1343	# 185
> 	) .
> 
> :Cohort1 a owl:Class ;
> 	owl:equivalentClass [
> 		a owl:Restriction ;
> 		owl:onProperty patient:hasDiagnosis ;  # which patients have a diagnosis associated with prostate cancer?
> 		owl:someValuesFrom :Prostate_Cancer
> 	] .
> 
> Except for taking the ICD9 class hierarchy into account, this is not really much more than a simple database query.

Actually it is a lot more than a database query. For example, if you now
assert (:eg a :Cohort1) then a DL reasoner will "know" that :eg has a
diagnosis which is within one of the three members of
the :Prostate_Cancer union but not which one.  If you then add
additional information that constrains some of the other cases you may
be able to determine which ICD9 code it is. By using owl:equivalentClass
you are enabling reasoning in either direction across that equivalence,
including reasoning by cases.

To determine the level of inference required you either have to assume
complete inference and profile the language (DL, EL, QL etc) or you need
to *also* specify what specific queries you might want to ask.

> The nice aspect of doing this in OWL is that we can define these sets, like :Prostate_Cancer and :Cohort1, and then ask other questions of these sets.

Agreed.

> I thought that with TDB running on a 64 bit Linux box, doing memory mapped I/O, that TDB could efficiently pull everything into memory quickly, avoiding doing lots of fine grained SQL calls to a MySQL server.

Think of it as more like a halfway house. Each query to TDB involves
walking over the appropriate B-tree indexes, just like in a database.
Given enough memory and an operating system good enough at disc block
caching you might end up with everything, or near everything, paged into
memory, but that is not guaranteed. Lots of factors come into play.

> I did use writeAll for writing the OntModel.
> 
> Relative to your suggestion of
> (1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.
> 
> Would I need to do this for EVERYTHING, including the declarations above for Prostate_Cancer and Cohort1?

Yes. Essentially create an in-memory model containing everything you
want to reason over. Then either materialize everything or, if you know
the query patterns you are interested in, then ask those queries and
materialize the results.

Alternatively you may want to consider the commercial triple stores that
offer inference at scale with Jena compatibility.

Dave

> 
> 
> -----Original Message-----
> From: Dave Reynolds [mailto:dave.e.reynolds@gmail.com] 
> Sent: Monday, September 19, 2011 11:39 AM
> To: jena-users@incubator.apache.org
> Subject: Re: benchmarking
> 
> Hi,
> 
> On Mon, 2011-09-19 at 15:04 +0000, David Jordan wrote: 
> > I have switch over from SDB to TDB to see if I can get better performance.
> > In the following, Database is a class of mine that insulated the code from knowing if it is SDB or TDB.
> > 
> > I do the following, which combines 2 models I have stored in TDB and then reads a third small model from a file that contains some classes I want to “test”. I then have some code that times how long it takes to get a particular class and list its instances.
> > 
> > Model model1 = Database.getICD9inferredModel(); Model model2 = 
> > Database.getPatientModel(); OntModel omodel = 
> > ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF, 
> > model1); omodel.add(model2);
> 
> That is running a full rule reasoner over the TDB model. As I've mentioned before the rule inference engines store everything in memory so that doesn't give you any scaling over simply loading the file into memory and doing inference over that, it just goes very very slowly!
> 
> > InputStream in = FileManager.get().open(fileName); omodel.read(in, 
> > baseName, "TURTLE");
> > 
> > OntClass oclass = omodel.getOntClass(line);   // access the class
> > 
> > On the first call to getOntClass, I have been seeing a VERY long wait (around an hour) before I get a response.
> > Then after that first call, subsequent calls are much faster.
> > But I started looking at the CPU utilization. After the call to getOntClass, CPU utilization is very close to 0.
> > Is this to be expected?
> 
> Seems plausible, the inference engines are in effect doing huge number of triple queries to TDB which will be spend most of its time waiting for the disk.
> 
> If you really need to run live inference over the entire dataset then load it into a memory model first, then construct your inference model over that.
> 
> > Is there any form of tracing/logging that can be turned on to determine what (if anything) is happening?
> > 
> > Is there something I am doing wrong in setting up my models?
> > For the ICD9 ontology I am using, I had read in the OWL data, created an OntModel with it, wrote this OntModel data out.
> > Then I store the data from the OntModel into TDB, so it supposedly does not have to do as much work at runtime.
> 
> As Chris says, make sure you using writeAll not just plain write to store the OntModel.
> 
> That aside, this doesn't necessarily save you much work because the rules are having to run anyway, they are just not discovering anything much new.
> 
> In the absence of a highly scalable inference solution for Jena (something which can't be done without resourcing) then your two good options are:
> 
> (1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.
> 
> (2) Load all the data into memory and run inference over that.
> 
> Dave
> 
> 
>

RE: benchmarking

Posted by David Jordan <Da...@sas.com>.

One question is what level of inferencing is really necessary for things like the following:
It is not very clear to me yet which OWL constructs require particular inferencing levels.

:Prostate_Cancer a owl:Class ;
	owl:unionOf (
		HOM_ICD9:HOM_ICD_10566	# V10.46  These identify ICD9 codes for Prostate Cancer
		HOM_ICD9:HOM_ICD_1767	# 233.4     which exist in a large class hierarchy
		HOM_ICD9:HOM_ICD_1343	# 185
	) .

:Cohort1 a owl:Class ;
	owl:equivalentClass [
		a owl:Restriction ;
		owl:onProperty patient:hasDiagnosis ;  # which patients have a diagnosis associated with prostate cancer?
		owl:someValuesFrom :Prostate_Cancer
	] .

Except for taking the ICD9 class hierarchy into account, this is not really much more than a simple database query.
The nice aspect of doing this in OWL is that we can define these sets, like :Prostate_Cancer and :Cohort1, and then ask other questions of these sets.

I thought that with TDB running on a 64 bit Linux box, doing memory mapped I/O, that TDB could efficiently pull everything into memory quickly, avoiding doing lots of fine grained SQL calls to a MySQL server.

I did use writeAll for writing the OntModel.

Relative to your suggestion of
(1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.

Would I need to do this for EVERYTHING, including the declarations above for Prostate_Cancer and Cohort1?

-----Original Message-----
From: Dave Reynolds [mailto:dave.e.reynolds@gmail.com] 
Sent: Monday, September 19, 2011 11:39 AM
To: jena-users@incubator.apache.org
Subject: Re: benchmarking

Hi,

On Mon, 2011-09-19 at 15:04 +0000, David Jordan wrote: 
> I have switch over from SDB to TDB to see if I can get better performance.
> In the following, Database is a class of mine that insulated the code from knowing if it is SDB or TDB.
> 
> I do the following, which combines 2 models I have stored in TDB and then reads a third small model from a file that contains some classes I want to “test”. I then have some code that times how long it takes to get a particular class and list its instances.
> 
> Model model1 = Database.getICD9inferredModel(); Model model2 = 
> Database.getPatientModel(); OntModel omodel = 
> ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF, 
> model1); omodel.add(model2);

That is running a full rule reasoner over the TDB model. As I've mentioned before the rule inference engines store everything in memory so that doesn't give you any scaling over simply loading the file into memory and doing inference over that, it just goes very very slowly!

> InputStream in = FileManager.get().open(fileName); omodel.read(in, 
> baseName, "TURTLE");
> 
> OntClass oclass = omodel.getOntClass(line);   // access the class
> 
> On the first call to getOntClass, I have been seeing a VERY long wait (around an hour) before I get a response.
> Then after that first call, subsequent calls are much faster.
> But I started looking at the CPU utilization. After the call to getOntClass, CPU utilization is very close to 0.
> Is this to be expected?

Seems plausible, the inference engines are in effect doing huge number of triple queries to TDB which will be spend most of its time waiting for the disk.

If you really need to run live inference over the entire dataset then load it into a memory model first, then construct your inference model over that.

> Is there any form of tracing/logging that can be turned on to determine what (if anything) is happening?
> 
> Is there something I am doing wrong in setting up my models?
> For the ICD9 ontology I am using, I had read in the OWL data, created an OntModel with it, wrote this OntModel data out.
> Then I store the data from the OntModel into TDB, so it supposedly does not have to do as much work at runtime.

As Chris says, make sure you using writeAll not just plain write to store the OntModel.

That aside, this doesn't necessarily save you much work because the rules are having to run anyway, they are just not discovering anything much new.

In the absence of a highly scalable inference solution for Jena (something which can't be done without resourcing) then your two good options are:

(1) Precompute all inferences, store those, then at runtime work with plain (no inference at all) models over that stored closure.

(2) Load all the data into memory and run inference over that.

Dave

Re: benchmarking

Posted by Dave Reynolds <da...@gmail.com>.

Hi,

On Mon, 2011-09-19 at 15:04 +0000, David Jordan wrote: 
> I have switch over from SDB to TDB to see if I can get better performance.
> In the following, Database is a class of mine that insulated the code from knowing if it is SDB or TDB.
> 
> I do the following, which combines 2 models I have stored in TDB and then reads a third small model from a file that contains some classes I want to “test”. I then have some code that times how long it takes to get a particular class and list its instances.
> 
> Model model1 = Database.getICD9inferredModel();
> Model model2 = Database.getPatientModel();
> OntModel omodel = ModelFactory.createOntologyModel(OntModelSpec.OWL_MEM_MICRO_RULE_INF, model1);
> omodel.add(model2);

That is running a full rule reasoner over the TDB model. As I've
mentioned before the rule inference engines store everything in memory
so that doesn't give you any scaling over simply loading the file into
memory and doing inference over that, it just goes very very slowly!

> InputStream in = FileManager.get().open(fileName);
> omodel.read(in, baseName, "TURTLE");
> 
> OntClass oclass = omodel.getOntClass(line);   // access the class
> 
> On the first call to getOntClass, I have been seeing a VERY long wait (around an hour) before I get a response.
> Then after that first call, subsequent calls are much faster.
> But I started looking at the CPU utilization. After the call to getOntClass, CPU utilization is very close to 0.
> Is this to be expected?

Seems plausible, the inference engines are in effect doing huge number
of triple queries to TDB which will be spend most of its time waiting
for the disk.

If you really need to run live inference over the entire dataset then
load it into a memory model first, then construct your inference model
over that.

> Is there any form of tracing/logging that can be turned on to determine what (if anything) is happening?
> 
> Is there something I am doing wrong in setting up my models?
> For the ICD9 ontology I am using, I had read in the OWL data, created an OntModel with it, wrote this OntModel data out.
> Then I store the data from the OntModel into TDB, so it supposedly does not have to do as much work at runtime.

As Chris says, make sure you using writeAll not just plain write to
store the OntModel.

That aside, this doesn't necessarily save you much work because the
rules are having to run anyway, they are just not discovering anything
much new.

In the absence of a highly scalable inference solution for Jena
(something which can't be done without resourcing) then your two good
options are:

(1) Precompute all inferences, store those, then at runtime work with
plain (no inference at all) models over that stored closure.

(2) Load all the data into memory and run inference over that.

Dave

Re: benchmarking

Posted by Chris Dollin <ch...@epimorphics.com>.

David Jordan wrote:

> Is there something I am doing wrong in setting up my models?
> For the ICD9 ontology I am using, I had read in the OWL data, 
> created an OntModel with it, wrote this OntModel data out.

How? It matters.

(Because OntModel.write deliberately doesn't write out inferred
statements.)

In general, don't provide prose descriptions of suspect code, show
us the actual code. The problem may live in the details the prose
skips over.

Chris

-- 
RIP Diana Wynne Jones, 1934 - 2011.

Epimorphics Ltd, http://www.epimorphics.com
Registered address: Court Lodge, 105 High Street, Portishead, Bristol BS20 6PT
Epimorphics Ltd. is a limited company registered in England (number 7016688)