You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2012/07/20 13:52:22 UTC

NodeCache : keep or remove?

In JENA-279, the issue of whether the NodeCache serves any useful 
purpose these days has come up.

Proposal: Remove the node cache
Proposal: Remove the triple cache

Node cache:

There are two reasons for the cache: time saving (object creation costs) 
and space saving (reuse nodes).  I'm not sure either of these apply much 
nowadays.  Java has moved on; parsers should be doing the caching then 
the cache is per-run.

TDB does it's own thing because it is caching the node file and the 
cache is NodeId to Node.

RIOT, for IRIs, does it's own thing because it is coupled with caching 
IRI parsing which is expensive because it's picky.

A quick test: parsing a file:
- - - - - - - - - - - - - -
With node cache:
bsbm-25m.nt.gz : 183.27 sec  25,000,250 triples  136,415.85 TPS

Without node cache:
Node.cache(false) ;
bsbm-25m.nt.gz : 179.19 sec  25,000,250 triples  139,514.99 TPS
- - - - - - - - - - - - - -

so I think that it is better to remove the Node cache and Triple caches 
and put reuse of Nodes (space saving, if any) as the responsibility of 
the creation code (which is a parser or persistent-to-memory storage 
unit typically).

I will check ARP to see what it does (unless anyone can knowns ...)

There are other caches at the Resource level so there some overlap there.

Triple cache:

There is a Triple cache as well although a lot of code goes direct to 
new Triple()

But any storage layer already does checking for a triple on insertion so 
there is no spacing within one graph.  The rules engine has two graphs 
so there is not much saving there either.  In fact, the cache overhead 
is a net cost!

There is no Quad cache.

	Andy


Re: NodeCache : keep or remove?

Posted by Stephen Allen <sa...@apache.org>.
Additionally, a fun thing to look at for Node would be to have a
static cache pre-populated with commonly used resources (RDF, RDFS,
OWL, etc.), similar to Java's Integer.valueOf(int) method.  That could
be useful.

-Stephen


On Fri, Jul 20, 2012 at 8:31 AM, Stephen Allen <sa...@apache.org> wrote:
> +1 on removal of both the node and triple caches.
>
> In addition to the reasons already discussed, there is also the fact
> that Node.create() uses a global lock, which is going to be really bad
> for concurrency!
>
> Triple.create() doesn't do any locking, which appears to work out OK
> in this specific instance because the cache never tries to remove
> anything (the worst that could happen would be for two identical
> triples to be floating around when two threads tried to insert the
> same triple at the same time).
>
> -Stephen
>
>
> On Fri, Jul 20, 2012 at 7:59 AM, Dave Reynolds
> <da...@gmail.com> wrote:
>> Agreed.
>>
>> Primary value of a node cache from my POV is space saving for in-memory
>> models. But that could indeed be done by ARP (if it isn't already) and is
>> probably better done at the resource level.
>>
>> I wouldn't expect any significant effect on the rules engines from scraping
>> these caches.
>>
>> Dave
>>
>>
>>
>> On 20/07/12 12:52, Andy Seaborne wrote:
>>>
>>> In JENA-279, the issue of whether the NodeCache serves any useful
>>> purpose these days has come up.
>>>
>>> Proposal: Remove the node cache
>>> Proposal: Remove the triple cache
>>>
>>> Node cache:
>>>
>>> There are two reasons for the cache: time saving (object creation costs)
>>> and space saving (reuse nodes).  I'm not sure either of these apply much
>>> nowadays.  Java has moved on; parsers should be doing the caching then
>>> the cache is per-run.
>>>
>>> TDB does it's own thing because it is caching the node file and the
>>> cache is NodeId to Node.
>>>
>>> RIOT, for IRIs, does it's own thing because it is coupled with caching
>>> IRI parsing which is expensive because it's picky.
>>>
>>> A quick test: parsing a file:
>>> - - - - - - - - - - - - - -
>>> With node cache:
>>> bsbm-25m.nt.gz : 183.27 sec  25,000,250 triples  136,415.85 TPS
>>>
>>> Without node cache:
>>> Node.cache(false) ;
>>> bsbm-25m.nt.gz : 179.19 sec  25,000,250 triples  139,514.99 TPS
>>> - - - - - - - - - - - - - -
>>>
>>> so I think that it is better to remove the Node cache and Triple caches
>>> and put reuse of Nodes (space saving, if any) as the responsibility of
>>> the creation code (which is a parser or persistent-to-memory storage
>>> unit typically).
>>>
>>> I will check ARP to see what it does (unless anyone can knowns ...)
>>>
>>> There are other caches at the Resource level so there some overlap there.
>>>
>>> Triple cache:
>>>
>>> There is a Triple cache as well although a lot of code goes direct to
>>> new Triple()
>>>
>>> But any storage layer already does checking for a triple on insertion so
>>> there is no spacing within one graph.  The rules engine has two graphs
>>> so there is not much saving there either.  In fact, the cache overhead
>>> is a net cost!
>>>
>>> There is no Quad cache.
>>>
>>>      Andy
>>>
>>

Re: NodeCache : keep or remove?

Posted by Stephen Allen <sa...@apache.org>.
+1 on removal of both the node and triple caches.

In addition to the reasons already discussed, there is also the fact
that Node.create() uses a global lock, which is going to be really bad
for concurrency!

Triple.create() doesn't do any locking, which appears to work out OK
in this specific instance because the cache never tries to remove
anything (the worst that could happen would be for two identical
triples to be floating around when two threads tried to insert the
same triple at the same time).

-Stephen


On Fri, Jul 20, 2012 at 7:59 AM, Dave Reynolds
<da...@gmail.com> wrote:
> Agreed.
>
> Primary value of a node cache from my POV is space saving for in-memory
> models. But that could indeed be done by ARP (if it isn't already) and is
> probably better done at the resource level.
>
> I wouldn't expect any significant effect on the rules engines from scraping
> these caches.
>
> Dave
>
>
>
> On 20/07/12 12:52, Andy Seaborne wrote:
>>
>> In JENA-279, the issue of whether the NodeCache serves any useful
>> purpose these days has come up.
>>
>> Proposal: Remove the node cache
>> Proposal: Remove the triple cache
>>
>> Node cache:
>>
>> There are two reasons for the cache: time saving (object creation costs)
>> and space saving (reuse nodes).  I'm not sure either of these apply much
>> nowadays.  Java has moved on; parsers should be doing the caching then
>> the cache is per-run.
>>
>> TDB does it's own thing because it is caching the node file and the
>> cache is NodeId to Node.
>>
>> RIOT, for IRIs, does it's own thing because it is coupled with caching
>> IRI parsing which is expensive because it's picky.
>>
>> A quick test: parsing a file:
>> - - - - - - - - - - - - - -
>> With node cache:
>> bsbm-25m.nt.gz : 183.27 sec  25,000,250 triples  136,415.85 TPS
>>
>> Without node cache:
>> Node.cache(false) ;
>> bsbm-25m.nt.gz : 179.19 sec  25,000,250 triples  139,514.99 TPS
>> - - - - - - - - - - - - - -
>>
>> so I think that it is better to remove the Node cache and Triple caches
>> and put reuse of Nodes (space saving, if any) as the responsibility of
>> the creation code (which is a parser or persistent-to-memory storage
>> unit typically).
>>
>> I will check ARP to see what it does (unless anyone can knowns ...)
>>
>> There are other caches at the Resource level so there some overlap there.
>>
>> Triple cache:
>>
>> There is a Triple cache as well although a lot of code goes direct to
>> new Triple()
>>
>> But any storage layer already does checking for a triple on insertion so
>> there is no spacing within one graph.  The rules engine has two graphs
>> so there is not much saving there either.  In fact, the cache overhead
>> is a net cost!
>>
>> There is no Quad cache.
>>
>>      Andy
>>
>

Re: NodeCache : keep or remove?

Posted by Dave Reynolds <da...@gmail.com>.
Agreed.

Primary value of a node cache from my POV is space saving for in-memory 
models. But that could indeed be done by ARP (if it isn't already) and 
is probably better done at the resource level.

I wouldn't expect any significant effect on the rules engines from 
scraping these caches.

Dave


On 20/07/12 12:52, Andy Seaborne wrote:
> In JENA-279, the issue of whether the NodeCache serves any useful
> purpose these days has come up.
>
> Proposal: Remove the node cache
> Proposal: Remove the triple cache
>
> Node cache:
>
> There are two reasons for the cache: time saving (object creation costs)
> and space saving (reuse nodes).  I'm not sure either of these apply much
> nowadays.  Java has moved on; parsers should be doing the caching then
> the cache is per-run.
>
> TDB does it's own thing because it is caching the node file and the
> cache is NodeId to Node.
>
> RIOT, for IRIs, does it's own thing because it is coupled with caching
> IRI parsing which is expensive because it's picky.
>
> A quick test: parsing a file:
> - - - - - - - - - - - - - -
> With node cache:
> bsbm-25m.nt.gz : 183.27 sec  25,000,250 triples  136,415.85 TPS
>
> Without node cache:
> Node.cache(false) ;
> bsbm-25m.nt.gz : 179.19 sec  25,000,250 triples  139,514.99 TPS
> - - - - - - - - - - - - - -
>
> so I think that it is better to remove the Node cache and Triple caches
> and put reuse of Nodes (space saving, if any) as the responsibility of
> the creation code (which is a parser or persistent-to-memory storage
> unit typically).
>
> I will check ARP to see what it does (unless anyone can knowns ...)
>
> There are other caches at the Resource level so there some overlap there.
>
> Triple cache:
>
> There is a Triple cache as well although a lot of code goes direct to
> new Triple()
>
> But any storage layer already does checking for a triple on insertion so
> there is no spacing within one graph.  The rules engine has two graphs
> so there is not much saving there either.  In fact, the cache overhead
> is a net cost!
>
> There is no Quad cache.
>
>      Andy
>