You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Martynas Jusevičius <ma...@graphity.org> on 2016/06/20 21:18:03 UTC

Forward/backward rules (and reasoner memory leaks)

Hey,

after using GenericRuleReasoner and InfModel more extensively, we
started experiencing memory leaks that eventually kill our webapp
because it runs out of heap space. Jena version is 2.11.0.

After some profiling, it seems that RETEEngine.clauseIndex and/or
RETEEngine.infGraph are retaining a lot of references. It might be
related to this report, but I'm not sure:
https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E

The suggestion was to use use backward rules instead of forward rules.
I have read the following:
https://jena.apache.org/documentation/inference/#rules

But still I fail to understand in which situations backward rules
can/should be used instead of forward rules? I guess simply replacing
-> with <- will not be enough? The actual rules in question look like
this:

[gp:    (?class rdf:type
<http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
<http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
(?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
noValue(?subClass ?p) -> (?subClass ?p ?o) ]
[gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>), noValue(?subClass
<http://graphity.org/gc#defaultMode>) -> (?subClass
<http://graphity.org/gc#defaultMode> ?o) ]
[gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
(?template <http://graphity.org/gc#supportedMode> ?supportedMode),
(?subClass rdfs:subClassOf ?template), (?subClass rdf:type
<http://graphity.org/gp#Template>) -> (?subClass
<http://graphity.org/gc#supportedMode> ?supportedMode) ]
[rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

Can these be rewritten as backward rules instead? Does it involve code
changes, such as calling reset() etc?

I would appreciate any help.


Martynas
atomgraph.com

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Martynas Jusevičius <ma...@atomgraph.com>.
What is the status of JENA-650 by the way?
https://issues.apache.org/jira/browse/JENA-650

On Mon, Jun 20, 2016 at 11:18 PM, Martynas Jusevičius
<ma...@graphity.org> wrote:
> Hey,
>
> after using GenericRuleReasoner and InfModel more extensively, we
> started experiencing memory leaks that eventually kill our webapp
> because it runs out of heap space. Jena version is 2.11.0.
>
> After some profiling, it seems that RETEEngine.clauseIndex and/or
> RETEEngine.infGraph are retaining a lot of references. It might be
> related to this report, but I'm not sure:
> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>
> The suggestion was to use use backward rules instead of forward rules.
> I have read the following:
> https://jena.apache.org/documentation/inference/#rules
>
> But still I fail to understand in which situations backward rules
> can/should be used instead of forward rules? I guess simply replacing
> -> with <- will not be enough? The actual rules in question look like
> this:
>
> [gp:    (?class rdf:type
> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
> rdfs:subClassOf ?template), (?subClass rdf:type
> <http://graphity.org/gp#Template>), noValue(?subClass
> <http://graphity.org/gc#defaultMode>) -> (?subClass
> <http://graphity.org/gc#defaultMode> ?o) ]
> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
> <http://graphity.org/gp#Template>) -> (?subClass
> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>
> Can these be rewritten as backward rules instead? Does it involve code
> changes, such as calling reset() etc?
>
> I would appreciate any help.
>
>
> Martynas
> atomgraph.com

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Martynas Jusevičius <ma...@graphity.org>.
What is the status of JENA-650 by the way?
https://issues.apache.org/jira/browse/JENA-650

On Mon, Jun 20, 2016 at 11:18 PM, Martynas Jusevičius
<ma...@graphity.org> wrote:
> Hey,
>
> after using GenericRuleReasoner and InfModel more extensively, we
> started experiencing memory leaks that eventually kill our webapp
> because it runs out of heap space. Jena version is 2.11.0.
>
> After some profiling, it seems that RETEEngine.clauseIndex and/or
> RETEEngine.infGraph are retaining a lot of references. It might be
> related to this report, but I'm not sure:
> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>
> The suggestion was to use use backward rules instead of forward rules.
> I have read the following:
> https://jena.apache.org/documentation/inference/#rules
>
> But still I fail to understand in which situations backward rules
> can/should be used instead of forward rules? I guess simply replacing
> -> with <- will not be enough? The actual rules in question look like
> this:
>
> [gp:    (?class rdf:type
> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
> rdfs:subClassOf ?template), (?subClass rdf:type
> <http://graphity.org/gp#Template>), noValue(?subClass
> <http://graphity.org/gc#defaultMode>) -> (?subClass
> <http://graphity.org/gc#defaultMode> ?o) ]
> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
> <http://graphity.org/gp#Template>) -> (?subClass
> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>
> Can these be rewritten as backward rules instead? Does it involve code
> changes, such as calling reset() etc?
>
> I would appreciate any help.
>
>
> Martynas
> atomgraph.com

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Dave Reynolds <da...@gmail.com>.
Hi Martynas,

If it really is a different schema and and data each time then you can't 
cache.

If you only have a small number of schemas then you could use bindSchema 
to generate a set of partially-evaluated reasoners, cache those, and 
pick the right one to use for a given set of message headers.

The other option (apart from stopping using rules) would be to use 
backward rules. As already discussed the forward engine does all the 
inferences up front whereas the backward rules do them on demand.

Dave

On 23/06/16 21:38, Martynas Jusevi\u010dius wrote:
> Hey again,
>
> I have profiled the CPU time, and it seems that a lot of it (93.5%
> after some 22500 HTTP requests) is spent in the following methods:
>
> com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel
> (com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model,
> com.hp.hpl.jena.rdf.model.Model)
>    com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema
> (com.hp.hpl.jena.graph.Graph)
>      com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare ()
>        com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit
> (com.hp.hpl.jena.reasoner.Finder)
>
> Probably not so smart to create an InfModel with every
> request/response. But in my case it is created using HTTP response
> body and metadata only: Model from response body, and schema OntModel
> from headers metadata, so I'm not sure how it could be cached. Here is
> the code:
> https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107
>
> I would appreciate suggestions on how to improve performance.
>
> Martynas
>
> On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds
> <da...@gmail.com> wrote:
>> Hi Martynas,
>>
>> On 20/06/16 22:18, Martynas Jusevi\u010dius wrote:
>>>
>>> Hey,
>>>
>>> after using GenericRuleReasoner and InfModel more extensively, we
>>> started experiencing memory leaks that eventually kill our webapp
>>> because it runs out of heap space. Jena version is 2.11.0.
>>>
>>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>>> RETEEngine.infGraph are retaining a lot of references. It might be
>>> related to this report, but I'm not sure:
>>>
>>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>>
>>
>> If it is related to that then it is not a leak it is "just" memory use.
>>
>> A leak implies that when you turn over data then unused internal state
>> objects are not reclaimed. Are you continuously adding and deleting data? If
>> so then the delete should release the whole of the RETEEngine state and
>> start over. If that isn't happening then that's a bug but you could work
>> around with an explicit reset() or even delete and recreate your InfGraph at
>> that stage. A delete loses all the state anyway.
>>
>>> The suggestion was to use use backward rules instead of forward rules.
>>> I have read the following:
>>> https://jena.apache.org/documentation/inference/#rules
>>>
>>> But still I fail to understand in which situations backward rules
>>> can/should be used instead of forward rules?
>>
>>
>> Forward rules are generally faster because they keep all that partially
>> matched state. So if you have stable data or just add triples monotonically,
>> and have a lot of queries, then generally use forward rules for performance.
>>
>> Backward rules (without tabling) keep no state so there's less memory
>> overhead and no cost for delete but they are slow and have to redo the work
>> for every query.
>>
>> Strictly the performance trade-off is a bit more subtle than that. Forward
>> rules will try to work out all the entailments whereas backward rules are
>> just responding to specific queries. So if your queries only touch a small
>> part of the possible space then backward rules could be more efficient.
>> However in practice RDF rules seem involve a lot of unground terms and lots
>> of rules match nearly every query.
>>
>> Tabling allows you to selectively cache certain predicates which can enable
>> you to get more reasonable performance while keeping memory use under
>> control. You can also do some tuning of how the rules execute by testing if
>> variables are bound or not and using different clause orderings for
>> different query patterns.
>>
>>>   I guess simply replacing
>>> -> with <- will not be enough?
>>
>>
>> Unless you use non-monotonic predicates (which, sadly, you do) then that
>> would be enough to get something working. In fact you don't even need to do
>> that. If you create a pure backward reasoner instances (as opposed to the
>> hybrid) reasoner it'll read forward syntax rules but treat them as backward.
>>
>>> The actual rules in question look like
>>> this:
>>>
>>> [gp:    (?class rdf:type
>>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>>
>>
>> That's a horrible rule from the engine's point of view. The head is
>> completely ungrounded so when running backwards then it will need to run for
>> *every* triple pattern. [It also makes no sense to me as a use of
>> owl:AnnotationProperty but whatever.] You could try it backwards but put the
>> clauses in a more efficient order:
>>
>> (?subClass ?p ?o) <-
>>       (?p rdf:type owl:AnnotationProperty),
>>       (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>>       (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>>
>> The rdf:type rdfs:Class constraints are pointless since those are implied by
>> rdfs:subClassOf anyway. The noValue check is probably best avoided for both
>> cases.
>>
>> Alternatively, depending on the nature of your space leak you could use
>> hybrid rules:
>>
>>    (?p rdf:type owl:AnnotationProperty),
>>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>      ->
>>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                               (?class ?p ?o) ]
>>
>> That way the forward engine is only looking at your annotations and the
>> backward engine then has rules that have grounded predicates. You could also
>> table those predicates:
>>
>>    (?p rdf:type owl:AnnotationProperty),
>>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>      ->
>>        table(?p),
>>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                                (?class ?p ?o) ]
>>
>>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>>> rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>), noValue(?subClass
>>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>>> <http://graphity.org/gc#defaultMode> ?o) ]
>>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>) -> (?subClass
>>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>>
>>
>> These two are more reasonable and could be used backwards or hybrid.
>>
>>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>>
>>
>> That would work backwards. Depending on the scale of your data you might
>> want to table rdf:type for performance/space tradeoff.
>>
>>> Can these be rewritten as backward rules instead?
>>
>>
>> Sure, the challenge is performance tuning as noted above.
>>
>>> Does it involve code changes, such as calling reset() etc?
>>
>> Shouldn't do.
>>
>> Dave


Re: Forward/backward rules (and reasoner memory leaks)

Posted by Martynas Jusevičius <ma...@graphity.org>.
Maybe I should evaluate, if I need an InfModel there in the first place...

On Thu, Jun 23, 2016 at 10:38 PM, Martynas Jusevičius
<ma...@graphity.org> wrote:
> Hey again,
>
> I have profiled the CPU time, and it seems that a lot of it (93.5%
> after some 22500 HTTP requests) is spent in the following methods:
>
> com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel
> (com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model,
> com.hp.hpl.jena.rdf.model.Model)
>   com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema
> (com.hp.hpl.jena.graph.Graph)
>     com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare ()
>       com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit
> (com.hp.hpl.jena.reasoner.Finder)
>
> Probably not so smart to create an InfModel with every
> request/response. But in my case it is created using HTTP response
> body and metadata only: Model from response body, and schema OntModel
> from headers metadata, so I'm not sure how it could be cached. Here is
> the code:
> https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107
>
> I would appreciate suggestions on how to improve performance.
>
> Martynas
>
> On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds
> <da...@gmail.com> wrote:
>> Hi Martynas,
>>
>> On 20/06/16 22:18, Martynas Jusevičius wrote:
>>>
>>> Hey,
>>>
>>> after using GenericRuleReasoner and InfModel more extensively, we
>>> started experiencing memory leaks that eventually kill our webapp
>>> because it runs out of heap space. Jena version is 2.11.0.
>>>
>>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>>> RETEEngine.infGraph are retaining a lot of references. It might be
>>> related to this report, but I'm not sure:
>>>
>>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>>
>>
>> If it is related to that then it is not a leak it is "just" memory use.
>>
>> A leak implies that when you turn over data then unused internal state
>> objects are not reclaimed. Are you continuously adding and deleting data? If
>> so then the delete should release the whole of the RETEEngine state and
>> start over. If that isn't happening then that's a bug but you could work
>> around with an explicit reset() or even delete and recreate your InfGraph at
>> that stage. A delete loses all the state anyway.
>>
>>> The suggestion was to use use backward rules instead of forward rules.
>>> I have read the following:
>>> https://jena.apache.org/documentation/inference/#rules
>>>
>>> But still I fail to understand in which situations backward rules
>>> can/should be used instead of forward rules?
>>
>>
>> Forward rules are generally faster because they keep all that partially
>> matched state. So if you have stable data or just add triples monotonically,
>> and have a lot of queries, then generally use forward rules for performance.
>>
>> Backward rules (without tabling) keep no state so there's less memory
>> overhead and no cost for delete but they are slow and have to redo the work
>> for every query.
>>
>> Strictly the performance trade-off is a bit more subtle than that. Forward
>> rules will try to work out all the entailments whereas backward rules are
>> just responding to specific queries. So if your queries only touch a small
>> part of the possible space then backward rules could be more efficient.
>> However in practice RDF rules seem involve a lot of unground terms and lots
>> of rules match nearly every query.
>>
>> Tabling allows you to selectively cache certain predicates which can enable
>> you to get more reasonable performance while keeping memory use under
>> control. You can also do some tuning of how the rules execute by testing if
>> variables are bound or not and using different clause orderings for
>> different query patterns.
>>
>>>  I guess simply replacing
>>> -> with <- will not be enough?
>>
>>
>> Unless you use non-monotonic predicates (which, sadly, you do) then that
>> would be enough to get something working. In fact you don't even need to do
>> that. If you create a pure backward reasoner instances (as opposed to the
>> hybrid) reasoner it'll read forward syntax rules but treat them as backward.
>>
>>> The actual rules in question look like
>>> this:
>>>
>>> [gp:    (?class rdf:type
>>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>>
>>
>> That's a horrible rule from the engine's point of view. The head is
>> completely ungrounded so when running backwards then it will need to run for
>> *every* triple pattern. [It also makes no sense to me as a use of
>> owl:AnnotationProperty but whatever.] You could try it backwards but put the
>> clauses in a more efficient order:
>>
>> (?subClass ?p ?o) <-
>>      (?p rdf:type owl:AnnotationProperty),
>>      (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>>      (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>>
>> The rdf:type rdfs:Class constraints are pointless since those are implied by
>> rdfs:subClassOf anyway. The noValue check is probably best avoided for both
>> cases.
>>
>> Alternatively, depending on the nature of your space leak you could use
>> hybrid rules:
>>
>>   (?p rdf:type owl:AnnotationProperty),
>>   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>     ->
>>       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                              (?class ?p ?o) ]
>>
>> That way the forward engine is only looking at your annotations and the
>> backward engine then has rules that have grounded predicates. You could also
>> table those predicates:
>>
>>   (?p rdf:type owl:AnnotationProperty),
>>   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>     ->
>>       table(?p),
>>       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                               (?class ?p ?o) ]
>>
>>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>>> rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>), noValue(?subClass
>>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>>> <http://graphity.org/gc#defaultMode> ?o) ]
>>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>) -> (?subClass
>>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>>
>>
>> These two are more reasonable and could be used backwards or hybrid.
>>
>>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>>
>>
>> That would work backwards. Depending on the scale of your data you might
>> want to table rdf:type for performance/space tradeoff.
>>
>>> Can these be rewritten as backward rules instead?
>>
>>
>> Sure, the challenge is performance tuning as noted above.
>>
>>> Does it involve code changes, such as calling reset() etc?
>>
>> Shouldn't do.
>>
>> Dave

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Martynas Jusevičius <ma...@graphity.org>.
Hey again,

I have profiled the CPU time, and it seems that a lot of it (93.5%
after some 22500 HTTP requests) is spent in the following methods:

com.hp.hpl.jena.rdf.model.ModelFactory.createInfModel
(com.hp.hpl.jena.reasoner.Reasoner, com.hp.hpl.jena.rdf.model.Model,
com.hp.hpl.jena.rdf.model.Model)
  com.hp.hpl.jena.reasoner.rulesys.GenericRuleReasoner.bindSchema
(com.hp.hpl.jena.graph.Graph)
    com.hp.hpl.jena.reasoner.rulesys.FBRuleInfGraph.prepare ()
      com.hp.hpl.jena.reasoner.rulesys.impl.RETEEngine.fastInit
(com.hp.hpl.jena.reasoner.Finder)

Probably not so smart to create an InfModel with every
request/response. But in my case it is created using HTTP response
body and metadata only: Model from response body, and schema OntModel
from headers metadata, so I'm not sure how it could be cached. Here is
the code:
https://github.com/AtomGraph/Processor/blob/master/src/main/java/org/graphity/processor/filter/response/HypermediaFilter.java#L107

I would appreciate suggestions on how to improve performance.

Martynas

On Tue, Jun 21, 2016 at 10:28 AM, Dave Reynolds
<da...@gmail.com> wrote:
> Hi Martynas,
>
> On 20/06/16 22:18, Martynas Jusevičius wrote:
>>
>> Hey,
>>
>> after using GenericRuleReasoner and InfModel more extensively, we
>> started experiencing memory leaks that eventually kill our webapp
>> because it runs out of heap space. Jena version is 2.11.0.
>>
>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>> RETEEngine.infGraph are retaining a lot of references. It might be
>> related to this report, but I'm not sure:
>>
>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>
>
> If it is related to that then it is not a leak it is "just" memory use.
>
> A leak implies that when you turn over data then unused internal state
> objects are not reclaimed. Are you continuously adding and deleting data? If
> so then the delete should release the whole of the RETEEngine state and
> start over. If that isn't happening then that's a bug but you could work
> around with an explicit reset() or even delete and recreate your InfGraph at
> that stage. A delete loses all the state anyway.
>
>> The suggestion was to use use backward rules instead of forward rules.
>> I have read the following:
>> https://jena.apache.org/documentation/inference/#rules
>>
>> But still I fail to understand in which situations backward rules
>> can/should be used instead of forward rules?
>
>
> Forward rules are generally faster because they keep all that partially
> matched state. So if you have stable data or just add triples monotonically,
> and have a lot of queries, then generally use forward rules for performance.
>
> Backward rules (without tabling) keep no state so there's less memory
> overhead and no cost for delete but they are slow and have to redo the work
> for every query.
>
> Strictly the performance trade-off is a bit more subtle than that. Forward
> rules will try to work out all the entailments whereas backward rules are
> just responding to specific queries. So if your queries only touch a small
> part of the possible space then backward rules could be more efficient.
> However in practice RDF rules seem involve a lot of unground terms and lots
> of rules match nearly every query.
>
> Tabling allows you to selectively cache certain predicates which can enable
> you to get more reasonable performance while keeping memory use under
> control. You can also do some tuning of how the rules execute by testing if
> variables are bound or not and using different clause orderings for
> different query patterns.
>
>>  I guess simply replacing
>> -> with <- will not be enough?
>
>
> Unless you use non-monotonic predicates (which, sadly, you do) then that
> would be enough to get something working. In fact you don't even need to do
> that. If you create a pure backward reasoner instances (as opposed to the
> hybrid) reasoner it'll read forward syntax rules but treat them as backward.
>
>> The actual rules in question look like
>> this:
>>
>> [gp:    (?class rdf:type
>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>
>
> That's a horrible rule from the engine's point of view. The head is
> completely ungrounded so when running backwards then it will need to run for
> *every* triple pattern. [It also makes no sense to me as a use of
> owl:AnnotationProperty but whatever.] You could try it backwards but put the
> clauses in a more efficient order:
>
> (?subClass ?p ?o) <-
>      (?p rdf:type owl:AnnotationProperty),
>      (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>      (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>
> The rdf:type rdfs:Class constraints are pointless since those are implied by
> rdfs:subClassOf anyway. The noValue check is probably best avoided for both
> cases.
>
> Alternatively, depending on the nature of your space leak you could use
> hybrid rules:
>
>   (?p rdf:type owl:AnnotationProperty),
>   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>     ->
>       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>                              (?class ?p ?o) ]
>
> That way the forward engine is only looking at your annotations and the
> backward engine then has rules that have grounded predicates. You could also
> table those predicates:
>
>   (?p rdf:type owl:AnnotationProperty),
>   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>     ->
>       table(?p),
>       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>                               (?class ?p ?o) ]
>
>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>> rdfs:subClassOf ?template), (?subClass rdf:type
>> <http://graphity.org/gp#Template>), noValue(?subClass
>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>> <http://graphity.org/gc#defaultMode> ?o) ]
>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>> <http://graphity.org/gp#Template>) -> (?subClass
>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>
>
> These two are more reasonable and could be used backwards or hybrid.
>
>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>
>
> That would work backwards. Depending on the scale of your data you might
> want to table rdf:type for performance/space tradeoff.
>
>> Can these be rewritten as backward rules instead?
>
>
> Sure, the challenge is performance tuning as noted above.
>
>> Does it involve code changes, such as calling reset() etc?
>
> Shouldn't do.
>
> Dave

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Dave Reynolds <da...@gmail.com>.
That's not a cache but a table. Don't think it's guaranteed safe to 
delete from it, but may be misremembering - it was a long time ago!

Dave

On 24/06/16 13:43, Stian Soiland-Reyes wrote:
> I rebased and solved the FIXMEs. The memory leaks are still there in a
> way, but the guava cache would flush them out once it reaches the
> configured maximum (I set the default to 512k goals, but the memory
> usage per goal could vary a lot depending on the rules)
>
> On 21 June 2016 at 09:59, Andy Seaborne <an...@apache.org> wrote:
>> We have outstanding:
>>
>> https://github.com/apache/jena/pull/47
>>
>> which changes the cache to LRU from fixed.
>> That does not fix any memory leaks but might mitigate them.
>>
>> There are two FIXME in the PR which could do with looking at.
>>
>>      Andy
>>
>>
>> On 21/06/16 09:28, Dave Reynolds wrote:
>>>
>>> Hi Martynas,
>>>
>>> On 20/06/16 22:18, Martynas Jusevi\u010dius wrote:
>>>>
>>>> Hey,
>>>>
>>>> after using GenericRuleReasoner and InfModel more extensively, we
>>>> started experiencing memory leaks that eventually kill our webapp
>>>> because it runs out of heap space. Jena version is 2.11.0.
>>>>
>>>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>>>> RETEEngine.infGraph are retaining a lot of references. It might be
>>>> related to this report, but I'm not sure:
>>>>
>>>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>>>>
>>>
>>> If it is related to that then it is not a leak it is "just" memory use.
>>>
>>> A leak implies that when you turn over data then unused internal state
>>> objects are not reclaimed. Are you continuously adding and deleting
>>> data? If so then the delete should release the whole of the RETEEngine
>>> state and start over. If that isn't happening then that's a bug but you
>>> could work around with an explicit reset() or even delete and recreate
>>> your InfGraph at that stage. A delete loses all the state anyway.
>>>
>>>> The suggestion was to use use backward rules instead of forward rules.
>>>> I have read the following:
>>>> https://jena.apache.org/documentation/inference/#rules
>>>>
>>>> But still I fail to understand in which situations backward rules
>>>> can/should be used instead of forward rules?
>>>
>>>
>>> Forward rules are generally faster because they keep all that partially
>>> matched state. So if you have stable data or just add triples
>>> monotonically, and have a lot of queries, then generally use forward
>>> rules for performance.
>>>
>>> Backward rules (without tabling) keep no state so there's less memory
>>> overhead and no cost for delete but they are slow and have to redo the
>>> work for every query.
>>>
>>> Strictly the performance trade-off is a bit more subtle than that.
>>> Forward rules will try to work out all the entailments whereas backward
>>> rules are just responding to specific queries. So if your queries only
>>> touch a small part of the possible space then backward rules could be
>>> more efficient. However in practice RDF rules seem involve a lot of
>>> unground terms and lots of rules match nearly every query.
>>>
>>> Tabling allows you to selectively cache certain predicates which can
>>> enable you to get more reasonable performance while keeping memory use
>>> under control. You can also do some tuning of how the rules execute by
>>> testing if variables are bound or not and using different clause
>>> orderings for different query patterns.
>>>
>>>>   I guess simply replacing
>>>> -> with <- will not be enough?
>>>
>>>
>>> Unless you use non-monotonic predicates (which, sadly, you do) then that
>>> would be enough to get something working. In fact you don't even need to
>>> do that. If you create a pure backward reasoner instances (as opposed to
>>> the hybrid) reasoner it'll read forward syntax rules but treat them as
>>> backward.
>>>
>>>> The actual rules in question look like
>>>> this:
>>>>
>>>> [gp:    (?class rdf:type
>>>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>>>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>>>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>>>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>>>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>>>
>>>
>>> That's a horrible rule from the engine's point of view. The head is
>>> completely ungrounded so when running backwards then it will need to run
>>> for *every* triple pattern. [It also makes no sense to me as a use of
>>> owl:AnnotationProperty but whatever.] You could try it backwards but put
>>> the clauses in a more efficient order:
>>>
>>> (?subClass ?p ?o) <-
>>>        (?p rdf:type owl:AnnotationProperty),
>>>        (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>>>        (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>>>
>>> The rdf:type rdfs:Class constraints are pointless since those are
>>> implied by rdfs:subClassOf anyway. The noValue check is probably best
>>> avoided for both cases.
>>>
>>> Alternatively, depending on the nature of your space leak you could use
>>> hybrid rules:
>>>
>>>     (?p rdf:type owl:AnnotationProperty),
>>>     (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>>       ->
>>>         [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>>                                (?class ?p ?o) ]
>>>
>>> That way the forward engine is only looking at your annotations and the
>>> backward engine then has rules that have grounded predicates. You could
>>> also table those predicates:
>>>
>>>     (?p rdf:type owl:AnnotationProperty),
>>>     (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>>       ->
>>>         table(?p),
>>>         [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>>                                 (?class ?p ?o) ]
>>>
>>>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>>>> rdfs:subClassOf ?template), (?subClass rdf:type
>>>> <http://graphity.org/gp#Template>), noValue(?subClass
>>>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>>>> <http://graphity.org/gc#defaultMode> ?o) ]
>>>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>>>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>>>> <http://graphity.org/gp#Template>) -> (?subClass
>>>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>>>
>>>
>>> These two are more reasonable and could be used backwards or hybrid.
>>>
>>>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>>>
>>>
>>> That would work backwards. Depending on the scale of your data you might
>>> want to table rdf:type for performance/space tradeoff.
>>>
>>>> Can these be rewritten as backward rules instead?
>>>
>>>
>>> Sure, the challenge is performance tuning as noted above.
>>>
>>>   > Does it involve code changes, such as calling reset() etc?
>>>
>>> Shouldn't do.
>>>
>>> Dave
>>
>>
>
>
>

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Stian Soiland-Reyes <st...@apache.org>.
I rebased and solved the FIXMEs. The memory leaks are still there in a
way, but the guava cache would flush them out once it reaches the
configured maximum (I set the default to 512k goals, but the memory
usage per goal could vary a lot depending on the rules)

On 21 June 2016 at 09:59, Andy Seaborne <an...@apache.org> wrote:
> We have outstanding:
>
> https://github.com/apache/jena/pull/47
>
> which changes the cache to LRU from fixed.
> That does not fix any memory leaks but might mitigate them.
>
> There are two FIXME in the PR which could do with looking at.
>
>     Andy
>
>
> On 21/06/16 09:28, Dave Reynolds wrote:
>>
>> Hi Martynas,
>>
>> On 20/06/16 22:18, Martynas Jusevičius wrote:
>>>
>>> Hey,
>>>
>>> after using GenericRuleReasoner and InfModel more extensively, we
>>> started experiencing memory leaks that eventually kill our webapp
>>> because it runs out of heap space. Jena version is 2.11.0.
>>>
>>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>>> RETEEngine.infGraph are retaining a lot of references. It might be
>>> related to this report, but I'm not sure:
>>>
>>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>>>
>>
>> If it is related to that then it is not a leak it is "just" memory use.
>>
>> A leak implies that when you turn over data then unused internal state
>> objects are not reclaimed. Are you continuously adding and deleting
>> data? If so then the delete should release the whole of the RETEEngine
>> state and start over. If that isn't happening then that's a bug but you
>> could work around with an explicit reset() or even delete and recreate
>> your InfGraph at that stage. A delete loses all the state anyway.
>>
>>> The suggestion was to use use backward rules instead of forward rules.
>>> I have read the following:
>>> https://jena.apache.org/documentation/inference/#rules
>>>
>>> But still I fail to understand in which situations backward rules
>>> can/should be used instead of forward rules?
>>
>>
>> Forward rules are generally faster because they keep all that partially
>> matched state. So if you have stable data or just add triples
>> monotonically, and have a lot of queries, then generally use forward
>> rules for performance.
>>
>> Backward rules (without tabling) keep no state so there's less memory
>> overhead and no cost for delete but they are slow and have to redo the
>> work for every query.
>>
>> Strictly the performance trade-off is a bit more subtle than that.
>> Forward rules will try to work out all the entailments whereas backward
>> rules are just responding to specific queries. So if your queries only
>> touch a small part of the possible space then backward rules could be
>> more efficient. However in practice RDF rules seem involve a lot of
>> unground terms and lots of rules match nearly every query.
>>
>> Tabling allows you to selectively cache certain predicates which can
>> enable you to get more reasonable performance while keeping memory use
>> under control. You can also do some tuning of how the rules execute by
>> testing if variables are bound or not and using different clause
>> orderings for different query patterns.
>>
>>>  I guess simply replacing
>>> -> with <- will not be enough?
>>
>>
>> Unless you use non-monotonic predicates (which, sadly, you do) then that
>> would be enough to get something working. In fact you don't even need to
>> do that. If you create a pure backward reasoner instances (as opposed to
>> the hybrid) reasoner it'll read forward syntax rules but treat them as
>> backward.
>>
>>> The actual rules in question look like
>>> this:
>>>
>>> [gp:    (?class rdf:type
>>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>>
>>
>> That's a horrible rule from the engine's point of view. The head is
>> completely ungrounded so when running backwards then it will need to run
>> for *every* triple pattern. [It also makes no sense to me as a use of
>> owl:AnnotationProperty but whatever.] You could try it backwards but put
>> the clauses in a more efficient order:
>>
>> (?subClass ?p ?o) <-
>>       (?p rdf:type owl:AnnotationProperty),
>>       (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>>       (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>>
>> The rdf:type rdfs:Class constraints are pointless since those are
>> implied by rdfs:subClassOf anyway. The noValue check is probably best
>> avoided for both cases.
>>
>> Alternatively, depending on the nature of your space leak you could use
>> hybrid rules:
>>
>>    (?p rdf:type owl:AnnotationProperty),
>>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>      ->
>>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                               (?class ?p ?o) ]
>>
>> That way the forward engine is only looking at your annotations and the
>> backward engine then has rules that have grounded predicates. You could
>> also table those predicates:
>>
>>    (?p rdf:type owl:AnnotationProperty),
>>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>      ->
>>        table(?p),
>>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                                (?class ?p ?o) ]
>>
>>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>>> rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>), noValue(?subClass
>>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>>> <http://graphity.org/gc#defaultMode> ?o) ]
>>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>) -> (?subClass
>>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>>
>>
>> These two are more reasonable and could be used backwards or hybrid.
>>
>>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>>
>>
>> That would work backwards. Depending on the scale of your data you might
>> want to table rdf:type for performance/space tradeoff.
>>
>>> Can these be rewritten as backward rules instead?
>>
>>
>> Sure, the challenge is performance tuning as noted above.
>>
>>  > Does it involve code changes, such as calling reset() etc?
>>
>> Shouldn't do.
>>
>> Dave
>
>



-- 
Stian Soiland-Reyes
Apache Taverna (incubating), Apache Commons
http://orcid.org/0000-0001-9842-9718

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Andy Seaborne <an...@apache.org>.
On 21/06/16 21:20, Martynas Jusevi\u010dius wrote:
> What about https://issues.apache.org/jira/browse/JENA-650?

It was a GSoC project and provides some useful prototyping - that was 
the project goal and it was successful.

It isn't in a state to integrate in to the release - how about trying it 
out?

     Andy

>
> On Tue, Jun 21, 2016 at 10:59 AM, Andy Seaborne <an...@apache.org> wrote:
>> We have outstanding:
>>
>> https://github.com/apache/jena/pull/47
>>
>> which changes the cache to LRU from fixed.
>> That does not fix any memory leaks but might mitigate them.
>>
>> There are two FIXME in the PR which could do with looking at.
>>
>>      Andy
>>
>>
>> On 21/06/16 09:28, Dave Reynolds wrote:
>>>
>>> Hi Martynas,
>>>
>>> On 20/06/16 22:18, Martynas Jusevi\u010dius wrote:
>>>>
>>>> Hey,
>>>>
>>>> after using GenericRuleReasoner and InfModel more extensively, we
>>>> started experiencing memory leaks that eventually kill our webapp
>>>> because it runs out of heap space. Jena version is 2.11.0.
>>>>
>>>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>>>> RETEEngine.infGraph are retaining a lot of references. It might be
>>>> related to this report, but I'm not sure:
>>>>
>>>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>>>>
>>>
>>> If it is related to that then it is not a leak it is "just" memory use.
>>>
>>> A leak implies that when you turn over data then unused internal state
>>> objects are not reclaimed. Are you continuously adding and deleting
>>> data? If so then the delete should release the whole of the RETEEngine
>>> state and start over. If that isn't happening then that's a bug but you
>>> could work around with an explicit reset() or even delete and recreate
>>> your InfGraph at that stage. A delete loses all the state anyway.
>>>
>>>> The suggestion was to use use backward rules instead of forward rules.
>>>> I have read the following:
>>>> https://jena.apache.org/documentation/inference/#rules
>>>>
>>>> But still I fail to understand in which situations backward rules
>>>> can/should be used instead of forward rules?
>>>
>>>
>>> Forward rules are generally faster because they keep all that partially
>>> matched state. So if you have stable data or just add triples
>>> monotonically, and have a lot of queries, then generally use forward
>>> rules for performance.
>>>
>>> Backward rules (without tabling) keep no state so there's less memory
>>> overhead and no cost for delete but they are slow and have to redo the
>>> work for every query.
>>>
>>> Strictly the performance trade-off is a bit more subtle than that.
>>> Forward rules will try to work out all the entailments whereas backward
>>> rules are just responding to specific queries. So if your queries only
>>> touch a small part of the possible space then backward rules could be
>>> more efficient. However in practice RDF rules seem involve a lot of
>>> unground terms and lots of rules match nearly every query.
>>>
>>> Tabling allows you to selectively cache certain predicates which can
>>> enable you to get more reasonable performance while keeping memory use
>>> under control. You can also do some tuning of how the rules execute by
>>> testing if variables are bound or not and using different clause
>>> orderings for different query patterns.
>>>
>>>>   I guess simply replacing
>>>> -> with <- will not be enough?
>>>
>>>
>>> Unless you use non-monotonic predicates (which, sadly, you do) then that
>>> would be enough to get something working. In fact you don't even need to
>>> do that. If you create a pure backward reasoner instances (as opposed to
>>> the hybrid) reasoner it'll read forward syntax rules but treat them as
>>> backward.
>>>
>>>> The actual rules in question look like
>>>> this:
>>>>
>>>> [gp:    (?class rdf:type
>>>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>>>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>>>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>>>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>>>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>>>
>>>
>>> That's a horrible rule from the engine's point of view. The head is
>>> completely ungrounded so when running backwards then it will need to run
>>> for *every* triple pattern. [It also makes no sense to me as a use of
>>> owl:AnnotationProperty but whatever.] You could try it backwards but put
>>> the clauses in a more efficient order:
>>>
>>> (?subClass ?p ?o) <-
>>>        (?p rdf:type owl:AnnotationProperty),
>>>        (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>>>        (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>>>
>>> The rdf:type rdfs:Class constraints are pointless since those are
>>> implied by rdfs:subClassOf anyway. The noValue check is probably best
>>> avoided for both cases.
>>>
>>> Alternatively, depending on the nature of your space leak you could use
>>> hybrid rules:
>>>
>>>     (?p rdf:type owl:AnnotationProperty),
>>>     (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>>       ->
>>>         [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>>                                (?class ?p ?o) ]
>>>
>>> That way the forward engine is only looking at your annotations and the
>>> backward engine then has rules that have grounded predicates. You could
>>> also table those predicates:
>>>
>>>     (?p rdf:type owl:AnnotationProperty),
>>>     (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>>       ->
>>>         table(?p),
>>>         [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>>                                 (?class ?p ?o) ]
>>>
>>>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>>>> rdfs:subClassOf ?template), (?subClass rdf:type
>>>> <http://graphity.org/gp#Template>), noValue(?subClass
>>>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>>>> <http://graphity.org/gc#defaultMode> ?o) ]
>>>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>>>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>>>> <http://graphity.org/gp#Template>) -> (?subClass
>>>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>>>
>>>
>>> These two are more reasonable and could be used backwards or hybrid.
>>>
>>>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>>>
>>>
>>> That would work backwards. Depending on the scale of your data you might
>>> want to table rdf:type for performance/space tradeoff.
>>>
>>>> Can these be rewritten as backward rules instead?
>>>
>>>
>>> Sure, the challenge is performance tuning as noted above.
>>>
>>>   > Does it involve code changes, such as calling reset() etc?
>>>
>>> Shouldn't do.
>>>
>>> Dave
>>
>>


Re: Forward/backward rules (and reasoner memory leaks)

Posted by Martynas Jusevičius <ma...@graphity.org>.
What about https://issues.apache.org/jira/browse/JENA-650?

On Tue, Jun 21, 2016 at 10:59 AM, Andy Seaborne <an...@apache.org> wrote:
> We have outstanding:
>
> https://github.com/apache/jena/pull/47
>
> which changes the cache to LRU from fixed.
> That does not fix any memory leaks but might mitigate them.
>
> There are two FIXME in the PR which could do with looking at.
>
>     Andy
>
>
> On 21/06/16 09:28, Dave Reynolds wrote:
>>
>> Hi Martynas,
>>
>> On 20/06/16 22:18, Martynas Jusevičius wrote:
>>>
>>> Hey,
>>>
>>> after using GenericRuleReasoner and InfModel more extensively, we
>>> started experiencing memory leaks that eventually kill our webapp
>>> because it runs out of heap space. Jena version is 2.11.0.
>>>
>>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>>> RETEEngine.infGraph are retaining a lot of references. It might be
>>> related to this report, but I'm not sure:
>>>
>>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>>>
>>
>> If it is related to that then it is not a leak it is "just" memory use.
>>
>> A leak implies that when you turn over data then unused internal state
>> objects are not reclaimed. Are you continuously adding and deleting
>> data? If so then the delete should release the whole of the RETEEngine
>> state and start over. If that isn't happening then that's a bug but you
>> could work around with an explicit reset() or even delete and recreate
>> your InfGraph at that stage. A delete loses all the state anyway.
>>
>>> The suggestion was to use use backward rules instead of forward rules.
>>> I have read the following:
>>> https://jena.apache.org/documentation/inference/#rules
>>>
>>> But still I fail to understand in which situations backward rules
>>> can/should be used instead of forward rules?
>>
>>
>> Forward rules are generally faster because they keep all that partially
>> matched state. So if you have stable data or just add triples
>> monotonically, and have a lot of queries, then generally use forward
>> rules for performance.
>>
>> Backward rules (without tabling) keep no state so there's less memory
>> overhead and no cost for delete but they are slow and have to redo the
>> work for every query.
>>
>> Strictly the performance trade-off is a bit more subtle than that.
>> Forward rules will try to work out all the entailments whereas backward
>> rules are just responding to specific queries. So if your queries only
>> touch a small part of the possible space then backward rules could be
>> more efficient. However in practice RDF rules seem involve a lot of
>> unground terms and lots of rules match nearly every query.
>>
>> Tabling allows you to selectively cache certain predicates which can
>> enable you to get more reasonable performance while keeping memory use
>> under control. You can also do some tuning of how the rules execute by
>> testing if variables are bound or not and using different clause
>> orderings for different query patterns.
>>
>>>  I guess simply replacing
>>> -> with <- will not be enough?
>>
>>
>> Unless you use non-monotonic predicates (which, sadly, you do) then that
>> would be enough to get something working. In fact you don't even need to
>> do that. If you create a pure backward reasoner instances (as opposed to
>> the hybrid) reasoner it'll read forward syntax rules but treat them as
>> backward.
>>
>>> The actual rules in question look like
>>> this:
>>>
>>> [gp:    (?class rdf:type
>>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>>
>>
>> That's a horrible rule from the engine's point of view. The head is
>> completely ungrounded so when running backwards then it will need to run
>> for *every* triple pattern. [It also makes no sense to me as a use of
>> owl:AnnotationProperty but whatever.] You could try it backwards but put
>> the clauses in a more efficient order:
>>
>> (?subClass ?p ?o) <-
>>       (?p rdf:type owl:AnnotationProperty),
>>       (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>>       (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>>
>> The rdf:type rdfs:Class constraints are pointless since those are
>> implied by rdfs:subClassOf anyway. The noValue check is probably best
>> avoided for both cases.
>>
>> Alternatively, depending on the nature of your space leak you could use
>> hybrid rules:
>>
>>    (?p rdf:type owl:AnnotationProperty),
>>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>      ->
>>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                               (?class ?p ?o) ]
>>
>> That way the forward engine is only looking at your annotations and the
>> backward engine then has rules that have grounded predicates. You could
>> also table those predicates:
>>
>>    (?p rdf:type owl:AnnotationProperty),
>>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>>      ->
>>        table(?p),
>>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>>                                (?class ?p ?o) ]
>>
>>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>>> rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>), noValue(?subClass
>>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>>> <http://graphity.org/gc#defaultMode> ?o) ]
>>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>>> <http://graphity.org/gp#Template>) -> (?subClass
>>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>>
>>
>> These two are more reasonable and could be used backwards or hybrid.
>>
>>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>>
>>
>> That would work backwards. Depending on the scale of your data you might
>> want to table rdf:type for performance/space tradeoff.
>>
>>> Can these be rewritten as backward rules instead?
>>
>>
>> Sure, the challenge is performance tuning as noted above.
>>
>>  > Does it involve code changes, such as calling reset() etc?
>>
>> Shouldn't do.
>>
>> Dave
>
>

Re: Forward/backward rules (and reasoner memory leaks)

Posted by Andy Seaborne <an...@apache.org>.
We have outstanding:

https://github.com/apache/jena/pull/47

which changes the cache to LRU from fixed.
That does not fix any memory leaks but might mitigate them.

There are two FIXME in the PR which could do with looking at.

     Andy

On 21/06/16 09:28, Dave Reynolds wrote:
> Hi Martynas,
>
> On 20/06/16 22:18, Martynas Jusevi\u010dius wrote:
>> Hey,
>>
>> after using GenericRuleReasoner and InfModel more extensively, we
>> started experiencing memory leaks that eventually kill our webapp
>> because it runs out of heap space. Jena version is 2.11.0.
>>
>> After some profiling, it seems that RETEEngine.clauseIndex and/or
>> RETEEngine.infGraph are retaining a lot of references. It might be
>> related to this report, but I'm not sure:
>> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E
>>
>
> If it is related to that then it is not a leak it is "just" memory use.
>
> A leak implies that when you turn over data then unused internal state
> objects are not reclaimed. Are you continuously adding and deleting
> data? If so then the delete should release the whole of the RETEEngine
> state and start over. If that isn't happening then that's a bug but you
> could work around with an explicit reset() or even delete and recreate
> your InfGraph at that stage. A delete loses all the state anyway.
>
>> The suggestion was to use use backward rules instead of forward rules.
>> I have read the following:
>> https://jena.apache.org/documentation/inference/#rules
>>
>> But still I fail to understand in which situations backward rules
>> can/should be used instead of forward rules?
>
> Forward rules are generally faster because they keep all that partially
> matched state. So if you have stable data or just add triples
> monotonically, and have a lot of queries, then generally use forward
> rules for performance.
>
> Backward rules (without tabling) keep no state so there's less memory
> overhead and no cost for delete but they are slow and have to redo the
> work for every query.
>
> Strictly the performance trade-off is a bit more subtle than that.
> Forward rules will try to work out all the entailments whereas backward
> rules are just responding to specific queries. So if your queries only
> touch a small part of the possible space then backward rules could be
> more efficient. However in practice RDF rules seem involve a lot of
> unground terms and lots of rules match nearly every query.
>
> Tabling allows you to selectively cache certain predicates which can
> enable you to get more reasonable performance while keeping memory use
> under control. You can also do some tuning of how the rules execute by
> testing if variables are bound or not and using different clause
> orderings for different query patterns.
>
>>  I guess simply replacing
>> -> with <- will not be enough?
>
> Unless you use non-monotonic predicates (which, sadly, you do) then that
> would be enough to get something working. In fact you don't even need to
> do that. If you create a pure backward reasoner instances (as opposed to
> the hybrid) reasoner it'll read forward syntax rules but treat them as
> backward.
>
>> The actual rules in question look like
>> this:
>>
>> [gp:    (?class rdf:type
>> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
>> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
>> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
>> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
>> noValue(?subClass ?p) -> (?subClass ?p ?o) ]
>
> That's a horrible rule from the engine's point of view. The head is
> completely ungrounded so when running backwards then it will need to run
> for *every* triple pattern. [It also makes no sense to me as a use of
> owl:AnnotationProperty but whatever.] You could try it backwards but put
> the clauses in a more efficient order:
>
> (?subClass ?p ?o) <-
>       (?p rdf:type owl:AnnotationProperty),
>       (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
>       (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .
>
> The rdf:type rdfs:Class constraints are pointless since those are
> implied by rdfs:subClassOf anyway. The noValue check is probably best
> avoided for both cases.
>
> Alternatively, depending on the nature of your space leak you could use
> hybrid rules:
>
>    (?p rdf:type owl:AnnotationProperty),
>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>      ->
>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>                               (?class ?p ?o) ]
>
> That way the forward engine is only looking at your annotations and the
> backward engine then has rules that have grounded predicates. You could
> also table those predicates:
>
>    (?p rdf:type owl:AnnotationProperty),
>    (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
>      ->
>        table(?p),
>        [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
>                                (?class ?p ?o) ]
>
>> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
>> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
>> rdfs:subClassOf ?template), (?subClass rdf:type
>> <http://graphity.org/gp#Template>), noValue(?subClass
>> <http://graphity.org/gc#defaultMode>) -> (?subClass
>> <http://graphity.org/gc#defaultMode> ?o) ]
>> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
>> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
>> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
>> <http://graphity.org/gp#Template>) -> (?subClass
>> <http://graphity.org/gc#supportedMode> ?supportedMode) ]
>
> These two are more reasonable and could be used backwards or hybrid.
>
>> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]
>
> That would work backwards. Depending on the scale of your data you might
> want to table rdf:type for performance/space tradeoff.
>
>> Can these be rewritten as backward rules instead?
>
> Sure, the challenge is performance tuning as noted above.
>
>  > Does it involve code changes, such as calling reset() etc?
>
> Shouldn't do.
>
> Dave


Re: Forward/backward rules (and reasoner memory leaks)

Posted by Dave Reynolds <da...@gmail.com>.
Hi Martynas,

On 20/06/16 22:18, Martynas Jusevi\u010dius wrote:
> Hey,
>
> after using GenericRuleReasoner and InfModel more extensively, we
> started experiencing memory leaks that eventually kill our webapp
> because it runs out of heap space. Jena version is 2.11.0.
>
> After some profiling, it seems that RETEEngine.clauseIndex and/or
> RETEEngine.infGraph are retaining a lot of references. It might be
> related to this report, but I'm not sure:
> https://mail-archives.apache.org/mod_mbox/jena-users/201403.mbox/%3C5319B4E0.4060106@gmail.com%3E

If it is related to that then it is not a leak it is "just" memory use.

A leak implies that when you turn over data then unused internal state 
objects are not reclaimed. Are you continuously adding and deleting 
data? If so then the delete should release the whole of the RETEEngine 
state and start over. If that isn't happening then that's a bug but you 
could work around with an explicit reset() or even delete and recreate 
your InfGraph at that stage. A delete loses all the state anyway.

> The suggestion was to use use backward rules instead of forward rules.
> I have read the following:
> https://jena.apache.org/documentation/inference/#rules
>
> But still I fail to understand in which situations backward rules
> can/should be used instead of forward rules?

Forward rules are generally faster because they keep all that partially 
matched state. So if you have stable data or just add triples 
monotonically, and have a lot of queries, then generally use forward 
rules for performance.

Backward rules (without tabling) keep no state so there's less memory 
overhead and no cost for delete but they are slow and have to redo the 
work for every query.

Strictly the performance trade-off is a bit more subtle than that. 
Forward rules will try to work out all the entailments whereas backward 
rules are just responding to specific queries. So if your queries only 
touch a small part of the possible space then backward rules could be 
more efficient. However in practice RDF rules seem involve a lot of 
unground terms and lots of rules match nearly every query.

Tabling allows you to selectively cache certain predicates which can 
enable you to get more reasonable performance while keeping memory use 
under control. You can also do some tuning of how the rules execute by 
testing if variables are bound or not and using different clause 
orderings for different query patterns.

>  I guess simply replacing
> -> with <- will not be enough?

Unless you use non-monotonic predicates (which, sadly, you do) then that 
would be enough to get something working. In fact you don't even need to 
do that. If you create a pure backward reasoner instances (as opposed to 
the hybrid) reasoner it'll read forward syntax rules but treat them as 
backward.

> The actual rules in question look like
> this:
>
> [gp:    (?class rdf:type
> <http://www.w3.org/2000/01/rdf-schema#Class>), (?class ?p ?o), (?p
> rdf:type owl:AnnotationProperty), (?p rdfs:isDefinedBy
> <http://graphity.org/gp#>), (?subClass rdfs:subClassOf ?class),
> (?subClass rdf:type <http://www.w3.org/2000/01/rdf-schema#Class>),
> noValue(?subClass ?p) -> (?subClass ?p ?o) ]

That's a horrible rule from the engine's point of view. The head is 
completely ungrounded so when running backwards then it will need to run 
for *every* triple pattern. [It also makes no sense to me as a use of 
owl:AnnotationProperty but whatever.] You could try it backwards but put 
the clauses in a more efficient order:

(?subClass ?p ?o) <-
      (?p rdf:type owl:AnnotationProperty),
      (?p rdfs:isDefinedBy <http://graphity.org/gp#>),
      (?subClass rdfs:subClassOf ?class), (?class ?p ?o) .

The rdf:type rdfs:Class constraints are pointless since those are 
implied by rdfs:subClassOf anyway. The noValue check is probably best 
avoided for both cases.

Alternatively, depending on the nature of your space leak you could use 
hybrid rules:

   (?p rdf:type owl:AnnotationProperty),
   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
     ->
       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
                              (?class ?p ?o) ]

That way the forward engine is only looking at your annotations and the 
backward engine then has rules that have grounded predicates. You could 
also table those predicates:

   (?p rdf:type owl:AnnotationProperty),
   (?p rdfs:isDefinedBy <http://graphity.org/gp#>)
     ->
       table(?p),
       [ (?subClass ?p ?o) <- (?subClass rdfs:subClassOf ?class),
                               (?class ?p ?o) ]

> [gcdm:  (?template rdf:type <http://graphity.org/gp#Template>),
> (?template <http://graphity.org/gc#defaultMode> ?o), (?subClass
> rdfs:subClassOf ?template), (?subClass rdf:type
> <http://graphity.org/gp#Template>), noValue(?subClass
> <http://graphity.org/gc#defaultMode>) -> (?subClass
> <http://graphity.org/gc#defaultMode> ?o) ]
> [gcsm:  (?template rdf:type <http://graphity.org/gp#Template>),
> (?template <http://graphity.org/gc#supportedMode> ?supportedMode),
> (?subClass rdfs:subClassOf ?template), (?subClass rdf:type
> <http://graphity.org/gp#Template>) -> (?subClass
> <http://graphity.org/gc#supportedMode> ?supportedMode) ]

These two are more reasonable and could be used backwards or hybrid.

> [rdfs9: (?x rdfs:subClassOf ?y), (?a rdf:type ?x) -> (?a rdf:type ?y)]

That would work backwards. Depending on the scale of your data you might 
want to table rdf:type for performance/space tradeoff.

> Can these be rewritten as backward rules instead?

Sure, the challenge is performance tuning as noted above.

 > Does it involve code changes, such as calling reset() etc?

Shouldn't do.

Dave