You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Brandon Sara <br...@collectivemedicaltech.com.INVALID> on 2021/09/13 23:42:09 UTC

Re: Subclass caching has some problems on Fuseki startup

I have been able to create an easily reproducible scenario that others can use to replicate and test the issues that I’m seeing:

1. Start fuseki using the config that I’ve listed below.
2. Attempt to load the latest version of ICD-10 CM as provided freely by BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM

If inference is enabled, then I can’t even get the turtle file to load in its entirety. If I load the turtle file without inference, then the load completes, but upon restarting the server and submitting a request, the service doesn’t finish processing the request in any reasonable amount of time, no matter how simple the query of the request is (one that actually queries data from the dataset at least).

Config:

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX fuseki: <http://jena.apache.org/fuseki#>
PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
PREFIX text: <http://jena.apache.org/text#>

[] rdf:type fuseki:Server ;
  fuseki:pingEP true ;
  fuseki:statsEP true ;
  fuseki:metricsEP true ;
  fuseki:compactEP true ;

  ja:context [
    ja:cxtName "arq:queryTimeout" ;
    ja:cxtValue "10000,60000" ;
  ] ;
.

<#kgService> a fuseki:Service ;
  fuseki:name "kg" ;
  fuseki:dataset <#kgIndexedDataset> ;
  fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
  fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
  fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
  fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name "data"; ] ;
.

<#kgIndexedDataset> rdf:type text:TextDataset ;
  text:dataset <#kgInferredDataset> ;
  text:index <#kgIndex> ;
.

<#kgIndex> a text:TextIndexLucene ;
  text:directory <file:/fuseki/databases/kg.index> ;
  text:entityMap <#kgEntityMap> ;
  text:storeValues true ;
  text:queryParser [ a text:ComplexPhraseQueryParser ]
.

<#kgEntityMap> a text:EntityMap ;
  text:defaultField "label" ;
  text:entityField "uri" ;
  text:uidField "uid" ;
  text:langField "lang" ;
  text:graphField "graph" ;
  text:map (
    [ text:field "id" ;
      text:predicate dcterms:identifier ]

    [ text:field "label" ;
      text:predicate rdfs:label ]
  ) ;
.

<#kgInferredDataset> a ja:RDFDataset ;
  ja:defaultGraph <#kgInferenceModel> ;
.

<#kgInferenceModel> a ja:InfModel ;
  ja:baseModel <#kgTdbGraph> ;
  ja:reasoner [
    ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
  ] ;
.

<#kgTdbGraph> a tdb2:GraphTDB2 ;
  tdb2:dataset <#kgTdbDataset> ;
.

<#kgTdbDataset> a tdb2:DatasetTDB2 ;
  tdb2:location "/fuseki/databases/kg" ;
.



No PHI in Email: PointClickCare and Collective Medical, A PointClickCare Company, policies prohibit sending protected health information (PHI) by email, which may violate regulatory requirements. If sending PHI is necessary, please contact the sender for secure delivery instructions.

Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Re: Subclass caching has some problems on Fuseki startup

Posted by Andy Seaborne <an...@apache.org>.


On 21/09/2021 23:42, Ryan Stokes wrote:
> Thanks for giving this some more thought, Andy.
> 
> We could consider different ways of doing both inference and updating. I
> think the basic requirements are that a collection of common medical
> datasets (ICD-10, RxNorm, and the like) be treated as a high-performance
> ontology - updated at most daily from various sources. We could use another
> writable model, small but with more frequent changes, which also needs to
> be very fast for queries and build on the ontology and simple (RDFS)
> inference there.
> 
> Would you recommend a layered configuration so that the ontology model can
> remain read-only? I recall documentation on it, but haven't encountered an
> example in use, limited as my search has been so far.

If you want to navigate the ontology AND apply it to data, then you may 
need two copies, one with and one with inference. If subclass closure 
has been applied, you can't see easily what the immediate parent of a 
concept is (ontology browsing task)

> As for OWL, I haven't looked closer into the reasoners other than to find
> that owl-fb-mini.rules has this (?x ?p ?y) rule in it:
> 
> # This one could be omitted since the results are not really very
> interesting!
> #[rdf1and4: (?x ?p ?y) -> (?p rdf:type rdf:Property), (?x rdf:type
> rdfs:Resource), (?y rdf:type rdfs:Resource)]
> [rdf4: (?x ?p ?y) -> (?p rdf:type rdf:Property)]
> 
> I'm going run it without that rule to see if we can blame it.

Should be interesting.

> Thanks for
> the pointer.

Tuning the inference you need for the application is going to help. From 
what I take from these discussions and looking at the data, ICD10CM 
only needs rdfs:subClassOf; it does not mention domain and range, nor 
use OWL specific inference.

     Andy

> 
> ~Ryan
> 
> On Sat, Sep 18, 2021 at 11:38 AM Andy Seaborne <an...@apache.org> wrote:
> 
>> Hi Ryan,
>>
>> On 17/09/2021 16:22, Ryan Stokes wrote:
>>> Hi Andy,
>>>
>>> By way of introduction I've been exploring ontology solutions
>>> with Brandon recently using Jena and Fuseki and come to
>>> appreciate your capable stewardship and responsive
>>> engagement with this community. Thank you.
>>>
>>> I was able to replicate Brandon's problem loading the ICD-10
>>> dataset using any of the built-in OWL reasoners without search
>>> indexing. However it did successfully load and respond fast to
>>> queries using RDFSRuleReasoner, as well as Transitive and Generic.
>>
>> OK - we're getting closer.
>>
>> That "pump" loop could well be cause if it is from a rule with with (?x
>> ?p ?y) in it. Rule 'rdf1and4' - I think the default reasoner for RDFS
>> omits that rule. This dataset is only 800K triples.
>>
>> The rules engine copes with the schema and data changing during runtime
>> with an engine that minimises re-computation at the expense of a lot
>> more initial work and crucially tracking with in-memory state. I guess
>> it is on first-touch doing all the setup work.
>>
>> [Later: It is not specific to TDB - seems to happen with any base
>> storage including both in-memory kinds.]
>>
>>> Brandon is better able to say whether we need OWL for other
>>> reasons, but we do want to use ICD-10-CM with data for inference.
>>> Would* Data with RDFS Inferencing* have advantages over using the
>>> built-in RDFSRuleReasoner for that?
>>
>> Maybe :-)
>>
>> Data+RDFS is different - it's not trying to be a replacement for the
>> rules engine for RDFS. We have the rules engine for complete adherence
>> to RDFS.
>>
>> Data+RDFS:
>> 1/ It is a fixed RDFS (subclass/subproperty/domain/range).
>>      No axioms. No x:directSubClassOf.
>> 2/ Applies to every graph in the dataset.
>> 3/ Assumes the schema is fixed - no update to the schema at runtime.
>> 4/ The schema is invisible - the app sees data and inferred triples.
>>
>> but it should scale and work with persistent databases.
>>
>> [ The "no update to the schema" could be changed. Programming needed
>> though. ]
>>
>> So - Ryan, Brandon - what inference does your usage need? Is the
>> schema/ontology updated during runtime?
>>
>>       Andy
>>
>>>
>>> Thanks again for any help in advance,
>>>
>>> Ryan
>>>
>>> *JFYI, the Transitive- and RDFSRuleReasoners inferred*
>>>
>>> *570k :subClassOf and an additional 192k :type triples over the base 96k
>> of
>>> each relation, respectively.*
>>>
>>>
>>> *Profiling the OWL reasoner with VisualVM I was able to see that it seems
>>> to cycle without end through*
>>>
>>>
>>> *Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
>>> Node.sameValueAs(). I have yet to try this on a reduced dataset to see
>> if I
>>> can find the minimum necessary to replicate the spin.*
>>>
>>> On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> Hi Brandon,
>>>>
>>>> The configuration is quite complex - it's likely due to the inference
>>>> layer but it would be worth trying without the text index to confirm
>>>> that especially for the loading.
>>>>
>>>> Do you need all that
>>>> <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
>>>> offers or is all you want RDFS subclass?
>>>>
>>>> Because there is
>>>>      https://jena.apache.org/documentation/rdfs/
>>>> (give ICD10CM as both data and also in a file to be the schema).
>>>>
>>>> The schema is assumed to be fixed which might not work for you long term
>>>> but it is another data point to understand the situation.
>>>>
>>>> About ICD10CM itseld - are you wanting to navigate its structure or use
>>>> it with data for inference? If it is to navigate its structure do you
>>>> even want inference?
>>>>
>>>>        Andy
>>>>
>>>> On 14/09/2021 00:42, Brandon Sara wrote:
>>>>> I have been able to create an easily reproducible scenario that others
>>>> can use to replicate and test the issues that I’m seeing:
>>>>>
>>>>> 1. Start fuseki using the config that I’ve listed below.
>>>>> 2. Attempt to load the latest version of ICD-10 CM as provided freely
>> by
>>>> BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM
>>>>>
>>>>> If inference is enabled, then I can’t even get the turtle file to load
>>>> in its entirety. If I load the turtle file without inference, then the
>> load
>>>> completes, but upon restarting the server and submitting a request, the
>>>> service doesn’t finish processing the request in any reasonable amount
>> of
>>>> time, no matter how simple the query of the request is (one that
>> actually
>>>> queries data from the dataset at least).
>>>>>
>>>>> Config:
>>>>>
>>>>> PREFIX dcterms: <http://purl.org/dc/terms/>
>>>>> PREFIX fuseki: <http://jena.apache.org/fuseki#>
>>>>> PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
>>>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>>>> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>>>>> PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
>>>>> PREFIX text: <http://jena.apache.org/text#>
>>>>>
>>>>> [] rdf:type fuseki:Server ;
>>>>>      fuseki:pingEP true ;
>>>>>      fuseki:statsEP true ;
>>>>>      fuseki:metricsEP true ;
>>>>>      fuseki:compactEP true ;
>>>>>
>>>>>      ja:context [
>>>>>        ja:cxtName "arq:queryTimeout" ;
>>>>>        ja:cxtValue "10000,60000" ;
>>>>>      ] ;
>>>>> .
>>>>>
>>>>> <#kgService> a fuseki:Service ;
>>>>>      fuseki:name "kg" ;
>>>>>      fuseki:dataset <#kgIndexedDataset> ;
>>>>>      fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
>>>>>      fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
>>>>>      fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
>>>>>      fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name
>> "data";
>>>> ] ;
>>>>> .
>>>>>
>>>>> <#kgIndexedDataset> rdf:type text:TextDataset ;
>>>>>      text:dataset <#kgInferredDataset> ;
>>>>>      text:index <#kgIndex> ;
>>>>> .
>>>>>
>>>>> <#kgIndex> a text:TextIndexLucene ;
>>>>>      text:directory <file:/fuseki/databases/kg.index> ;
>>>>>      text:entityMap <#kgEntityMap> ;
>>>>>      text:storeValues true ;
>>>>>      text:queryParser [ a text:ComplexPhraseQueryParser ]
>>>>> .
>>>>>
>>>>> <#kgEntityMap> a text:EntityMap ;
>>>>>      text:defaultField "label" ;
>>>>>      text:entityField "uri" ;
>>>>>      text:uidField "uid" ;
>>>>>      text:langField "lang" ;
>>>>>      text:graphField "graph" ;
>>>>>      text:map (
>>>>>        [ text:field "id" ;
>>>>>          text:predicate dcterms:identifier ]
>>>>>
>>>>>        [ text:field "label" ;
>>>>>          text:predicate rdfs:label ]
>>>>>      ) ;
>>>>> .
>>>>>
>>>>> <#kgInferredDataset> a ja:RDFDataset ;
>>>>>      ja:defaultGraph <#kgInferenceModel> ;
>>>>> .
>>>>>
>>>>> <#kgInferenceModel> a ja:InfModel ;
>>>>>      ja:baseModel <#kgTdbGraph> ;
>>>>>      ja:reasoner [
>>>>>        ja:reasonerURL <
>> http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
>>>>>      ] ;
>>>>> .
>>>>>
>>>>> <#kgTdbGraph> a tdb2:GraphTDB2 ;
>>>>>      tdb2:dataset <#kgTdbDataset> ;
>>>>> .
>>>>>
>>>>> <#kgTdbDataset> a tdb2:DatasetTDB2 ;
>>>>>      tdb2:location "/fuseki/databases/kg" ;
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>> No PHI in Email: PointClickCare and Collective Medical, A
>> PointClickCare
>>>> Company, policies prohibit sending protected health information (PHI) by
>>>> email, which may violate regulatory requirements. If sending PHI is
>>>> necessary, please contact the sender for secure delivery instructions.
>>>>>
>>>>> Confidentiality Notice: This email message, including any attachments,
>>>> is for the sole use of the intended recipient(s) and may contain
>>>> confidential and privileged information. Any unauthorized review, use,
>>>> disclosure or distribution is prohibited. If you are not the intended
>>>> recipient, please contact the sender by reply email and destroy all
>> copies
>>>> of the original message.
>>>>>
>>>>
>>>
>>
>

Re: Subclass caching has some problems on Fuseki startup

Posted by Ryan Stokes <rq...@gmail.com>.

Thanks for giving this some more thought, Andy.

We could consider different ways of doing both inference and updating. I
think the basic requirements are that a collection of common medical
datasets (ICD-10, RxNorm, and the like) be treated as a high-performance
ontology - updated at most daily from various sources. We could use another
writable model, small but with more frequent changes, which also needs to
be very fast for queries and build on the ontology and simple (RDFS)
inference there.

Would you recommend a layered configuration so that the ontology model can
remain read-only? I recall documentation on it, but haven't encountered an
example in use, limited as my search has been so far.

As for OWL, I haven't looked closer into the reasoners other than to find
that owl-fb-mini.rules has this (?x ?p ?y) rule in it:

# This one could be omitted since the results are not really very
interesting!
#[rdf1and4: (?x ?p ?y) -> (?p rdf:type rdf:Property), (?x rdf:type
rdfs:Resource), (?y rdf:type rdfs:Resource)]
[rdf4: (?x ?p ?y) -> (?p rdf:type rdf:Property)]

I'm going run it without that rule to see if we can blame it. Thanks for
the pointer.

~Ryan

On Sat, Sep 18, 2021 at 11:38 AM Andy Seaborne <an...@apache.org> wrote:

> Hi Ryan,
>
> On 17/09/2021 16:22, Ryan Stokes wrote:
> > Hi Andy,
> >
> > By way of introduction I've been exploring ontology solutions
> > with Brandon recently using Jena and Fuseki and come to
> > appreciate your capable stewardship and responsive
> > engagement with this community. Thank you.
> >
> > I was able to replicate Brandon's problem loading the ICD-10
> > dataset using any of the built-in OWL reasoners without search
> > indexing. However it did successfully load and respond fast to
> > queries using RDFSRuleReasoner, as well as Transitive and Generic.
>
> OK - we're getting closer.
>
> That "pump" loop could well be cause if it is from a rule with with (?x
> ?p ?y) in it. Rule 'rdf1and4' - I think the default reasoner for RDFS
> omits that rule. This dataset is only 800K triples.
>
> The rules engine copes with the schema and data changing during runtime
> with an engine that minimises re-computation at the expense of a lot
> more initial work and crucially tracking with in-memory state. I guess
> it is on first-touch doing all the setup work.
>
> [Later: It is not specific to TDB - seems to happen with any base
> storage including both in-memory kinds.]
>
> > Brandon is better able to say whether we need OWL for other
> > reasons, but we do want to use ICD-10-CM with data for inference.
> > Would* Data with RDFS Inferencing* have advantages over using the
> > built-in RDFSRuleReasoner for that?
>
> Maybe :-)
>
> Data+RDFS is different - it's not trying to be a replacement for the
> rules engine for RDFS. We have the rules engine for complete adherence
> to RDFS.
>
> Data+RDFS:
> 1/ It is a fixed RDFS (subclass/subproperty/domain/range).
>     No axioms. No x:directSubClassOf.
> 2/ Applies to every graph in the dataset.
> 3/ Assumes the schema is fixed - no update to the schema at runtime.
> 4/ The schema is invisible - the app sees data and inferred triples.
>
> but it should scale and work with persistent databases.
>
> [ The "no update to the schema" could be changed. Programming needed
> though. ]
>
> So - Ryan, Brandon - what inference does your usage need? Is the
> schema/ontology updated during runtime?
>
>      Andy
>
> >
> > Thanks again for any help in advance,
> >
> > Ryan
> >
> > *JFYI, the Transitive- and RDFSRuleReasoners inferred*
> >
> > *570k :subClassOf and an additional 192k :type triples over the base 96k
> of
> > each relation, respectively.*
> >
> >
> > *Profiling the OWL reasoner with VisualVM I was able to see that it seems
> > to cycle without end through*
> >
> >
> > *Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
> > Node.sameValueAs(). I have yet to try this on a reduced dataset to see
> if I
> > can find the minimum necessary to replicate the spin.*
> >
> > On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne <an...@apache.org> wrote:
> >
> >> Hi Brandon,
> >>
> >> The configuration is quite complex - it's likely due to the inference
> >> layer but it would be worth trying without the text index to confirm
> >> that especially for the loading.
> >>
> >> Do you need all that
> >> <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
> >> offers or is all you want RDFS subclass?
> >>
> >> Because there is
> >>     https://jena.apache.org/documentation/rdfs/
> >> (give ICD10CM as both data and also in a file to be the schema).
> >>
> >> The schema is assumed to be fixed which might not work for you long term
> >> but it is another data point to understand the situation.
> >>
> >> About ICD10CM itseld - are you wanting to navigate its structure or use
> >> it with data for inference? If it is to navigate its structure do you
> >> even want inference?
> >>
> >>       Andy
> >>
> >> On 14/09/2021 00:42, Brandon Sara wrote:
> >>> I have been able to create an easily reproducible scenario that others
> >> can use to replicate and test the issues that I’m seeing:
> >>>
> >>> 1. Start fuseki using the config that I’ve listed below.
> >>> 2. Attempt to load the latest version of ICD-10 CM as provided freely
> by
> >> BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM
> >>>
> >>> If inference is enabled, then I can’t even get the turtle file to load
> >> in its entirety. If I load the turtle file without inference, then the
> load
> >> completes, but upon restarting the server and submitting a request, the
> >> service doesn’t finish processing the request in any reasonable amount
> of
> >> time, no matter how simple the query of the request is (one that
> actually
> >> queries data from the dataset at least).
> >>>
> >>> Config:
> >>>
> >>> PREFIX dcterms: <http://purl.org/dc/terms/>
> >>> PREFIX fuseki: <http://jena.apache.org/fuseki#>
> >>> PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
> >>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> >>> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> >>> PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
> >>> PREFIX text: <http://jena.apache.org/text#>
> >>>
> >>> [] rdf:type fuseki:Server ;
> >>>     fuseki:pingEP true ;
> >>>     fuseki:statsEP true ;
> >>>     fuseki:metricsEP true ;
> >>>     fuseki:compactEP true ;
> >>>
> >>>     ja:context [
> >>>       ja:cxtName "arq:queryTimeout" ;
> >>>       ja:cxtValue "10000,60000" ;
> >>>     ] ;
> >>> .
> >>>
> >>> <#kgService> a fuseki:Service ;
> >>>     fuseki:name "kg" ;
> >>>     fuseki:dataset <#kgIndexedDataset> ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
> >>>     fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name
> "data";
> >> ] ;
> >>> .
> >>>
> >>> <#kgIndexedDataset> rdf:type text:TextDataset ;
> >>>     text:dataset <#kgInferredDataset> ;
> >>>     text:index <#kgIndex> ;
> >>> .
> >>>
> >>> <#kgIndex> a text:TextIndexLucene ;
> >>>     text:directory <file:/fuseki/databases/kg.index> ;
> >>>     text:entityMap <#kgEntityMap> ;
> >>>     text:storeValues true ;
> >>>     text:queryParser [ a text:ComplexPhraseQueryParser ]
> >>> .
> >>>
> >>> <#kgEntityMap> a text:EntityMap ;
> >>>     text:defaultField "label" ;
> >>>     text:entityField "uri" ;
> >>>     text:uidField "uid" ;
> >>>     text:langField "lang" ;
> >>>     text:graphField "graph" ;
> >>>     text:map (
> >>>       [ text:field "id" ;
> >>>         text:predicate dcterms:identifier ]
> >>>
> >>>       [ text:field "label" ;
> >>>         text:predicate rdfs:label ]
> >>>     ) ;
> >>> .
> >>>
> >>> <#kgInferredDataset> a ja:RDFDataset ;
> >>>     ja:defaultGraph <#kgInferenceModel> ;
> >>> .
> >>>
> >>> <#kgInferenceModel> a ja:InfModel ;
> >>>     ja:baseModel <#kgTdbGraph> ;
> >>>     ja:reasoner [
> >>>       ja:reasonerURL <
> http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
> >>>     ] ;
> >>> .
> >>>
> >>> <#kgTdbGraph> a tdb2:GraphTDB2 ;
> >>>     tdb2:dataset <#kgTdbDataset> ;
> >>> .
> >>>
> >>> <#kgTdbDataset> a tdb2:DatasetTDB2 ;
> >>>     tdb2:location "/fuseki/databases/kg" ;
> >>> .
> >>>
> >>>
> >>>
> >>> No PHI in Email: PointClickCare and Collective Medical, A
> PointClickCare
> >> Company, policies prohibit sending protected health information (PHI) by
> >> email, which may violate regulatory requirements. If sending PHI is
> >> necessary, please contact the sender for secure delivery instructions.
> >>>
> >>> Confidentiality Notice: This email message, including any attachments,
> >> is for the sole use of the intended recipient(s) and may contain
> >> confidential and privileged information. Any unauthorized review, use,
> >> disclosure or distribution is prohibited. If you are not the intended
> >> recipient, please contact the sender by reply email and destroy all
> copies
> >> of the original message.
> >>>
> >>
> >
>

Re: Subclass caching has some problems on Fuseki startup

Posted by Andy Seaborne <an...@apache.org>.

Hi Ryan,

On 17/09/2021 16:22, Ryan Stokes wrote:
> Hi Andy,
> 
> By way of introduction I've been exploring ontology solutions
> with Brandon recently using Jena and Fuseki and come to
> appreciate your capable stewardship and responsive
> engagement with this community. Thank you.
> 
> I was able to replicate Brandon's problem loading the ICD-10
> dataset using any of the built-in OWL reasoners without search
> indexing. However it did successfully load and respond fast to
> queries using RDFSRuleReasoner, as well as Transitive and Generic.

OK - we're getting closer.

That "pump" loop could well be cause if it is from a rule with with (?x 
?p ?y) in it. Rule 'rdf1and4' - I think the default reasoner for RDFS 
omits that rule. This dataset is only 800K triples.

The rules engine copes with the schema and data changing during runtime 
with an engine that minimises re-computation at the expense of a lot 
more initial work and crucially tracking with in-memory state. I guess 
it is on first-touch doing all the setup work.

[Later: It is not specific to TDB - seems to happen with any base 
storage including both in-memory kinds.]

> Brandon is better able to say whether we need OWL for other
> reasons, but we do want to use ICD-10-CM with data for inference.
> Would* Data with RDFS Inferencing* have advantages over using the
> built-in RDFSRuleReasoner for that?

Maybe :-)

Data+RDFS is different - it's not trying to be a replacement for the 
rules engine for RDFS. We have the rules engine for complete adherence 
to RDFS.

Data+RDFS:
1/ It is a fixed RDFS (subclass/subproperty/domain/range).
    No axioms. No x:directSubClassOf.
2/ Applies to every graph in the dataset.
3/ Assumes the schema is fixed - no update to the schema at runtime.
4/ The schema is invisible - the app sees data and inferred triples.

but it should scale and work with persistent databases.

[ The "no update to the schema" could be changed. Programming needed 
though. ]

So - Ryan, Brandon - what inference does your usage need? Is the 
schema/ontology updated during runtime?

     Andy

> 
> Thanks again for any help in advance,
> 
> Ryan
> 
> *JFYI, the Transitive- and RDFSRuleReasoners inferred*
> 
> *570k :subClassOf and an additional 192k :type triples over the base 96k of
> each relation, respectively.*
> 
> 
> *Profiling the OWL reasoner with VisualVM I was able to see that it seems
> to cycle without end through*
> 
> 
> *Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
> Node.sameValueAs(). I have yet to try this on a reduced dataset to see if I
> can find the minimum necessary to replicate the spin.*
> 
> On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne <an...@apache.org> wrote:
> 
>> Hi Brandon,
>>
>> The configuration is quite complex - it's likely due to the inference
>> layer but it would be worth trying without the text index to confirm
>> that especially for the loading.
>>
>> Do you need all that
>> <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
>> offers or is all you want RDFS subclass?
>>
>> Because there is
>>     https://jena.apache.org/documentation/rdfs/
>> (give ICD10CM as both data and also in a file to be the schema).
>>
>> The schema is assumed to be fixed which might not work for you long term
>> but it is another data point to understand the situation.
>>
>> About ICD10CM itseld - are you wanting to navigate its structure or use
>> it with data for inference? If it is to navigate its structure do you
>> even want inference?
>>
>>       Andy
>>
>> On 14/09/2021 00:42, Brandon Sara wrote:
>>> I have been able to create an easily reproducible scenario that others
>> can use to replicate and test the issues that I’m seeing:
>>>
>>> 1. Start fuseki using the config that I’ve listed below.
>>> 2. Attempt to load the latest version of ICD-10 CM as provided freely by
>> BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM
>>>
>>> If inference is enabled, then I can’t even get the turtle file to load
>> in its entirety. If I load the turtle file without inference, then the load
>> completes, but upon restarting the server and submitting a request, the
>> service doesn’t finish processing the request in any reasonable amount of
>> time, no matter how simple the query of the request is (one that actually
>> queries data from the dataset at least).
>>>
>>> Config:
>>>
>>> PREFIX dcterms: <http://purl.org/dc/terms/>
>>> PREFIX fuseki: <http://jena.apache.org/fuseki#>
>>> PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>>> PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
>>> PREFIX text: <http://jena.apache.org/text#>
>>>
>>> [] rdf:type fuseki:Server ;
>>>     fuseki:pingEP true ;
>>>     fuseki:statsEP true ;
>>>     fuseki:metricsEP true ;
>>>     fuseki:compactEP true ;
>>>
>>>     ja:context [
>>>       ja:cxtName "arq:queryTimeout" ;
>>>       ja:cxtValue "10000,60000" ;
>>>     ] ;
>>> .
>>>
>>> <#kgService> a fuseki:Service ;
>>>     fuseki:name "kg" ;
>>>     fuseki:dataset <#kgIndexedDataset> ;
>>>     fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
>>>     fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
>>>     fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
>>>     fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name "data";
>> ] ;
>>> .
>>>
>>> <#kgIndexedDataset> rdf:type text:TextDataset ;
>>>     text:dataset <#kgInferredDataset> ;
>>>     text:index <#kgIndex> ;
>>> .
>>>
>>> <#kgIndex> a text:TextIndexLucene ;
>>>     text:directory <file:/fuseki/databases/kg.index> ;
>>>     text:entityMap <#kgEntityMap> ;
>>>     text:storeValues true ;
>>>     text:queryParser [ a text:ComplexPhraseQueryParser ]
>>> .
>>>
>>> <#kgEntityMap> a text:EntityMap ;
>>>     text:defaultField "label" ;
>>>     text:entityField "uri" ;
>>>     text:uidField "uid" ;
>>>     text:langField "lang" ;
>>>     text:graphField "graph" ;
>>>     text:map (
>>>       [ text:field "id" ;
>>>         text:predicate dcterms:identifier ]
>>>
>>>       [ text:field "label" ;
>>>         text:predicate rdfs:label ]
>>>     ) ;
>>> .
>>>
>>> <#kgInferredDataset> a ja:RDFDataset ;
>>>     ja:defaultGraph <#kgInferenceModel> ;
>>> .
>>>
>>> <#kgInferenceModel> a ja:InfModel ;
>>>     ja:baseModel <#kgTdbGraph> ;
>>>     ja:reasoner [
>>>       ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
>>>     ] ;
>>> .
>>>
>>> <#kgTdbGraph> a tdb2:GraphTDB2 ;
>>>     tdb2:dataset <#kgTdbDataset> ;
>>> .
>>>
>>> <#kgTdbDataset> a tdb2:DatasetTDB2 ;
>>>     tdb2:location "/fuseki/databases/kg" ;
>>> .
>>>
>>>
>>>
>>> No PHI in Email: PointClickCare and Collective Medical, A PointClickCare
>> Company, policies prohibit sending protected health information (PHI) by
>> email, which may violate regulatory requirements. If sending PHI is
>> necessary, please contact the sender for secure delivery instructions.
>>>
>>> Confidentiality Notice: This email message, including any attachments,
>> is for the sole use of the intended recipient(s) and may contain
>> confidential and privileged information. Any unauthorized review, use,
>> disclosure or distribution is prohibited. If you are not the intended
>> recipient, please contact the sender by reply email and destroy all copies
>> of the original message.
>>>
>>
>

Re: Subclass caching has some problems on Fuseki startup

Posted by Ryan Stokes <rq...@gmail.com>.

Hi Andy,

By way of introduction I've been exploring ontology solutions
with Brandon recently using Jena and Fuseki and come to
appreciate your capable stewardship and responsive
engagement with this community. Thank you.

I was able to replicate Brandon's problem loading the ICD-10
dataset using any of the built-in OWL reasoners without search
indexing. However it did successfully load and respond fast to
queries using RDFSRuleReasoner, as well as Transitive and Generic.

Brandon is better able to say whether we need OWL for other
reasons, but we do want to use ICD-10-CM with data for inference.
Would* Data with RDFS Inferencing* have advantages over using the
built-in RDFSRuleReasoner for that?

Thanks again for any help in advance,

Ryan

*JFYI, the Transitive- and RDFSRuleReasoners inferred*

*570k :subClassOf and an additional 192k :type triples over the base 96k of
each relation, respectively.*


*Profiling the OWL reasoner with VisualVM I was able to see that it seems
to cycle without end through*


*Generator.pump() -> LPInterpreter.next() -> LPInterpreter.run() ->
Node.sameValueAs(). I have yet to try this on a reduced dataset to see if I
can find the minimum necessary to replicate the spin.*

On Fri, Sep 17, 2021 at 7:04 AM Andy Seaborne <an...@apache.org> wrote:

> Hi Brandon,
>
> The configuration is quite complex - it's likely due to the inference
> layer but it would be worth trying without the text index to confirm
> that especially for the loading.
>
> Do you need all that
> <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
> offers or is all you want RDFS subclass?
>
> Because there is
>    https://jena.apache.org/documentation/rdfs/
> (give ICD10CM as both data and also in a file to be the schema).
>
> The schema is assumed to be fixed which might not work for you long term
> but it is another data point to understand the situation.
>
> About ICD10CM itseld - are you wanting to navigate its structure or use
> it with data for inference? If it is to navigate its structure do you
> even want inference?
>
>      Andy
>
> On 14/09/2021 00:42, Brandon Sara wrote:
> > I have been able to create an easily reproducible scenario that others
> can use to replicate and test the issues that I’m seeing:
> >
> > 1. Start fuseki using the config that I’ve listed below.
> > 2. Attempt to load the latest version of ICD-10 CM as provided freely by
> BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM
> >
> > If inference is enabled, then I can’t even get the turtle file to load
> in its entirety. If I load the turtle file without inference, then the load
> completes, but upon restarting the server and submitting a request, the
> service doesn’t finish processing the request in any reasonable amount of
> time, no matter how simple the query of the request is (one that actually
> queries data from the dataset at least).
> >
> > Config:
> >
> > PREFIX dcterms: <http://purl.org/dc/terms/>
> > PREFIX fuseki: <http://jena.apache.org/fuseki#>
> > PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
> > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> > PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> > PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> > PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
> > PREFIX text: <http://jena.apache.org/text#>
> >
> > [] rdf:type fuseki:Server ;
> >    fuseki:pingEP true ;
> >    fuseki:statsEP true ;
> >    fuseki:metricsEP true ;
> >    fuseki:compactEP true ;
> >
> >    ja:context [
> >      ja:cxtName "arq:queryTimeout" ;
> >      ja:cxtValue "10000,60000" ;
> >    ] ;
> > .
> >
> > <#kgService> a fuseki:Service ;
> >    fuseki:name "kg" ;
> >    fuseki:dataset <#kgIndexedDataset> ;
> >    fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
> >    fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
> >    fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
> >    fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name "data";
> ] ;
> > .
> >
> > <#kgIndexedDataset> rdf:type text:TextDataset ;
> >    text:dataset <#kgInferredDataset> ;
> >    text:index <#kgIndex> ;
> > .
> >
> > <#kgIndex> a text:TextIndexLucene ;
> >    text:directory <file:/fuseki/databases/kg.index> ;
> >    text:entityMap <#kgEntityMap> ;
> >    text:storeValues true ;
> >    text:queryParser [ a text:ComplexPhraseQueryParser ]
> > .
> >
> > <#kgEntityMap> a text:EntityMap ;
> >    text:defaultField "label" ;
> >    text:entityField "uri" ;
> >    text:uidField "uid" ;
> >    text:langField "lang" ;
> >    text:graphField "graph" ;
> >    text:map (
> >      [ text:field "id" ;
> >        text:predicate dcterms:identifier ]
> >
> >      [ text:field "label" ;
> >        text:predicate rdfs:label ]
> >    ) ;
> > .
> >
> > <#kgInferredDataset> a ja:RDFDataset ;
> >    ja:defaultGraph <#kgInferenceModel> ;
> > .
> >
> > <#kgInferenceModel> a ja:InfModel ;
> >    ja:baseModel <#kgTdbGraph> ;
> >    ja:reasoner [
> >      ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
> >    ] ;
> > .
> >
> > <#kgTdbGraph> a tdb2:GraphTDB2 ;
> >    tdb2:dataset <#kgTdbDataset> ;
> > .
> >
> > <#kgTdbDataset> a tdb2:DatasetTDB2 ;
> >    tdb2:location "/fuseki/databases/kg" ;
> > .
> >
> >
> >
> > No PHI in Email: PointClickCare and Collective Medical, A PointClickCare
> Company, policies prohibit sending protected health information (PHI) by
> email, which may violate regulatory requirements. If sending PHI is
> necessary, please contact the sender for secure delivery instructions.
> >
> > Confidentiality Notice: This email message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> confidential and privileged information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
> >
>

Re: Subclass caching has some problems on Fuseki startup

Posted by Andy Seaborne <an...@apache.org>.

Hi Brandon,

The configuration is quite complex - it's likely due to the inference 
layer but it would be worth trying without the text index to confirm 
that especially for the loading.

Do you need all that
<http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
offers or is all you want RDFS subclass?

Because there is
   https://jena.apache.org/documentation/rdfs/
(give ICD10CM as both data and also in a file to be the schema).

The schema is assumed to be fixed which might not work for you long term 
but it is another data point to understand the situation.

About ICD10CM itseld - are you wanting to navigate its structure or use 
it with data for inference? If it is to navigate its structure do you 
even want inference?

     Andy

On 14/09/2021 00:42, Brandon Sara wrote:
> I have been able to create an easily reproducible scenario that others can use to replicate and test the issues that I’m seeing:
> 
> 1. Start fuseki using the config that I’ve listed below.
> 2. Attempt to load the latest version of ICD-10 CM as provided freely by BioPortal: https://bioportal.bioontology.org/ontologies/ICD10CM
> 
> If inference is enabled, then I can’t even get the turtle file to load in its entirety. If I load the turtle file without inference, then the load completes, but upon restarting the server and submitting a request, the service doesn’t finish processing the request in any reasonable amount of time, no matter how simple the query of the request is (one that actually queries data from the dataset at least).
> 
> Config:
> 
> PREFIX dcterms: <http://purl.org/dc/terms/>
> PREFIX fuseki: <http://jena.apache.org/fuseki#>
> PREFIX ja: <http://jena.hpl.hp.com/2005/11/Assembler#>
> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> PREFIX tdb2: <http://jena.apache.org/2016/tdb#>
> PREFIX text: <http://jena.apache.org/text#>
> 
> [] rdf:type fuseki:Server ;
>    fuseki:pingEP true ;
>    fuseki:statsEP true ;
>    fuseki:metricsEP true ;
>    fuseki:compactEP true ;
> 
>    ja:context [
>      ja:cxtName "arq:queryTimeout" ;
>      ja:cxtValue "10000,60000" ;
>    ] ;
> .
> 
> <#kgService> a fuseki:Service ;
>    fuseki:name "kg" ;
>    fuseki:dataset <#kgIndexedDataset> ;
>    fuseki:endpoint [ fuseki:operation fuseki:query; ] ;
>    fuseki:endpoint [ fuseki:operation fuseki:update; ] ;
>    fuseki:endpoint [ fuseki:operation fuseki:gsp_r; ] ;
>    fuseki:endpoint [ fuseki:operation fuseki:gsp_rw; fuseki:name "data"; ] ;
> .
> 
> <#kgIndexedDataset> rdf:type text:TextDataset ;
>    text:dataset <#kgInferredDataset> ;
>    text:index <#kgIndex> ;
> .
> 
> <#kgIndex> a text:TextIndexLucene ;
>    text:directory <file:/fuseki/databases/kg.index> ;
>    text:entityMap <#kgEntityMap> ;
>    text:storeValues true ;
>    text:queryParser [ a text:ComplexPhraseQueryParser ]
> .
> 
> <#kgEntityMap> a text:EntityMap ;
>    text:defaultField "label" ;
>    text:entityField "uri" ;
>    text:uidField "uid" ;
>    text:langField "lang" ;
>    text:graphField "graph" ;
>    text:map (
>      [ text:field "id" ;
>        text:predicate dcterms:identifier ]
> 
>      [ text:field "label" ;
>        text:predicate rdfs:label ]
>    ) ;
> .
> 
> <#kgInferredDataset> a ja:RDFDataset ;
>    ja:defaultGraph <#kgInferenceModel> ;
> .
> 
> <#kgInferenceModel> a ja:InfModel ;
>    ja:baseModel <#kgTdbGraph> ;
>    ja:reasoner [
>      ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLMicroFBRuleReasoner>
>    ] ;
> .
> 
> <#kgTdbGraph> a tdb2:GraphTDB2 ;
>    tdb2:dataset <#kgTdbDataset> ;
> .
> 
> <#kgTdbDataset> a tdb2:DatasetTDB2 ;
>    tdb2:location "/fuseki/databases/kg" ;
> .
> 
> 
> 
> No PHI in Email: PointClickCare and Collective Medical, A PointClickCare Company, policies prohibit sending protected health information (PHI) by email, which may violate regulatory requirements. If sending PHI is necessary, please contact the sender for secure delivery instructions.
> 
> Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
>