You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "A. Soroka" <aj...@virginia.edu> on 2015/06/29 15:04:58 UTC

Fuskei and ETags

A quick discussion of ETags in the "backup admin" PR that was sent by Yang Yuanzhe led me to this issue:

https://issues.apache.org/jira/browse/JENA-388

for "Make Fuseki responses cacheable" and which has been around for a little while. I was wondering about a couple of potential approaches here and thought I would run them down:

1) ETag-per-Dataset: this is a single ETag value for any Dataset for all requests, updated whenever a mutating request completes. This would work by letting any change on a Dataset whatsoever that comes through Fuseki invalidate all ETag-based caching on that Dataset. This seems to be where Andy Seaborne and Rob Vesse were heading, but I obviously can't speak for them. Advantage: relatively simple. Disadvantages: changes in the indexes not performed by Fuseki will not be reflected properly, only useful for instances that receive the right patterns of changes (meaning for which mutations aren't too "evenly sprinkled" amongst queries, thus keeping the cache often invalidated).

2) Constant Expires: Rob Vesse discusses this a bit in the issue. It's an Expires header that is configurable to allow some admin adjustment, but is constant during runtime. Advantage: dead simple. Disadvantage: unless the usage scenario is very tightly controlled, there's going to be some leakage of stale data. That may or may not be a big problem for an integrator, depending on use case. It would have to be carefully documented, I think, to avoid nasty surprises.

3) Per-query ETag: This would be mean some kind of map from request to ETag from which ETag headers are supplied for every request. The problem with this is that it implies some kind of reasonable algorithm for determining when an arbitrary update makes sufficient changes in an arbitrary graph to affect another arbitrary query, or it would imply stretching the meaning of "weak" ETag to a point that is probably not useful or correct for a query endpoint. This doesn't seem very practical.

4) Per-query-for-some-queries ETag. The idea here would be to cut down option 3 to a tranche of queries for which there actually _does_ exist some reasonable algorithm for detecting changes in the query-results. The example that comes to mind here would be simple DESCRIBE queries. Since it seems that ARQ deals with DESCRIBE using only relationships "outbound" from the things described, this approach could use an expiring map from URIs to Etags which could be updated (perhaps using a StatementListener) when a change directly affects an URI or a blank node in the CBD of that URI. This could be expensive, but it might be worth it for some use cases, for example where integrators are using software like Pubby to publish RDF. There might be other examples of query pattern where changes are practically calculable.

Whether (and how far) any of these are worth pursuing depends a good bit on the use case in hand. For example, for my use cases, option 2 isn't really practical, because one of the applications taking results from Fuseki would be using them to present live-editing pages. Option 1 would work, and it would give some advantage. Option 4 isn't interesting because very few of the queries in play will be simple DESRIBE queries. But that's all based on my use case.

Do you think any of these are worth pursuing?

---
A. Soroka
The University of Virginia Library

Re: Fuskei and ETags

Posted by Andy Seaborne <an...@apache.org>.

On 30/06/15 09:48, Rob Vesse wrote:
> FWIW dotNetRDF has supported and uses E-Tags when retrieving graphs from
> remote URIs since 2010 I.e. for 5 years
>
> So yes these things are used in the real world at least for simple data
> retrieval, I haven't honestly ever used them for other SPARQL operations.
>
> We even support caching local copies of remote data based on E-Tags so
> that we avoid unnecessary data transfer if the locally cached version is
> still the same as the one offered by the server.
>
> Rob
>
> On 29/06/2015 17:33, "ajs6f@virginia.edu" <aj...@virginia.edu> wrote:
>
>> I can only speak for the use cases I actually know about. ETags would get
>> used, because the most important web app in my concern that is
>> potentially a client to Fuseki would be able to use them. But that is
>> just one case.
>>
>> JENA-626 would be great in any regard.
>>
>> ---
>> A. Soroka
>> The University of Virginia Library

Good to know someone would find it useful!

	Andy

>>
>> On Jun 29, 2015, at 12:20 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>> There is no case of external modification of the database which Fuseki
>>> is running.  A disaster will occur otherwise.  [Modifying externally
>>> while running requires a different approach (e.g. switching between two
>>> copies of the database ... maybe ... so many ways to corrupt a database
>>> ... ).]
>>>
>>>
>>> E-tags is a quite technical solution - will any system actually use it
>>> for real even if it is the right solution?  We wouldn't want to find out
>>> that etags support does not get used.  For the SPARQL Protocols case
>>> (with query stings), it might not really get used.  Has caching of
>>> requests including query string rolled out to any degree? (a point from
>>> discussion in JENA-388).
>>>
>>> If query string currently cause no caching by intermediaries in
>>> practice, will clients cache which is the case of one client reissuing
>>> the same query? Possible but is it likely?
>>>
>>> See also JENA-626 "SPARQL Query Caching".  That would make a difference
>>> - different client apps starting up often ask the same query to get
>>> started.
>>>
>>> 	Andy
>>>
>>> On 29/06/15 16:03, Claude Warren wrote:
>>>> I am not familure with how the indexing interplays with the rest of the
>>>> Jena system.  My assumption is, like you, that we only want the ETag
>>>> in the
>>>> Fuseki layer.  However, to generate an ETag it seems like Fuseki will
>>>> need
>>>> to be able to ask the underlying dataset when the last change occured,
>>>> but
>>>> then you also want to know if indexing has changed so that results my
>>>> be
>>>> changed as well.
>>>>
>>>> If we consider ETag generation separate from the Dataset then the ETag
>>>> generator could register as a listener to the dataset and react
>>>> whenever a
>>>> change occurs to the model.   > This doesn't solve the problem of
>>>> responding
>>>> to index updates.  However, whatever interface the listener uses to
>>>> trigger
>>>> an ETag change could just as well be done by an indexer.  Is there an
>>>> indexer listener interface (ala Model/Graph listeners)?  In this
>>>> solution
>>>> the ETag gets input from any registered component.  I think that each
>>>> registered component should have a "name" and a "value".  The ETag
>>>> generator would retain the most recent value for each registered
>>>> component
>>>> and generate a new ETag when a value changes.  So I see a class with 2
>>>> methods
>>>>
>>>> void ETagGenerator.change( String name, String value )
>>>> and
>>>> String ETagGenerator.getTag(); // to retrieve the current tag.
>>>>
>>>> Claude
>>>>
>>>>
>>>>
>>>> On Mon, Jun 29, 2015 at 2:50 PM, ajs6f@virginia.edu
>>>> <aj...@virginia.edu>
>>>> wrote:
>>>>
>>>>> On Jun 29, 2015, at 9:33 AM, Claude Warren <cl...@xenei.com> wrote:
>>>>>> If there were an ETag per dataset and a method on the dataset to
>>>>>> force
>>>>> an ETag reset would this address the index issue in that the indexer
>>>>> could
>>>>> reset the ETag when it deemed appropriate?
>>>>>
>>>>> It might-- for that indexer. I would be concerned about setups in
>>>>> which
>>>>> another process acted against the data "out of sight" of Fuseki. But
>>>>> would
>>>>> the ETag be on ARQ's Dataset itself? If I understand what's going on
>>>>> here
>>>>> correctly (debatable at best), Dataset should not have any HTTP
>>>>> concerns
>>>>> mixed into it. ETag would be on something closer to Fuseki's
>>>>> DataService,
>>>>> which I do not think would normally be accessible to an indexer which
>>>>> is
>>>>> only aware of what's on disk… but this is all from my understanding
>>>>> of the
>>>>> architecture, which is pretty minimal. {grin} Maybe some kind of "last
>>>>> changed" timestamp could reasonably go on Dataset to support this
>>>>> kind of
>>>>> function?
>>>>>
>>>>>> In any case I would go with the first choice.
>>>>>
>>>>> It definitely seems like the most bang for the least buck.
>>>>>
>>>>>> Is there anything that prohibits sending both an ETag and a constant
>>>>> expires?  I havn't looked but I recall they are not mutually
>>>>> exclusive.
>>>>>
>>>>> Yes, I think you are correct. I suppose a bad ETag will never be
>>>>> known to
>>>>> be such as long as it is "inside" the range of a still-good Expires,
>>>>> but
>>>>> that is a question for the administrator configuring Fuseki, it seems
>>>>> to
>>>>> me. There is also Cache-Control, of course, in the same field of
>>>>> functionality.
>>>>>
>>>>> ---
>>>>> A. Soroka
>>>>> The University of Virginia Library
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>
>
>

Re: Fuskei and ETags

Posted by Rob Vesse <rv...@dotnetrdf.org>.

FWIW dotNetRDF has supported and uses E-Tags when retrieving graphs from
remote URIs since 2010 I.e. for 5 years

So yes these things are used in the real world at least for simple data
retrieval, I haven't honestly ever used them for other SPARQL operations.

We even support caching local copies of remote data based on E-Tags so
that we avoid unnecessary data transfer if the locally cached version is
still the same as the one offered by the server.

Rob

On 29/06/2015 17:33, "ajs6f@virginia.edu" <aj...@virginia.edu> wrote:

>I can only speak for the use cases I actually know about. ETags would get
>used, because the most important web app in my concern that is
>potentially a client to Fuseki would be able to use them. But that is
>just one case.
>
>JENA-626 would be great in any regard.
>
>---
>A. Soroka
>The University of Virginia Library
>
>On Jun 29, 2015, at 12:20 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> There is no case of external modification of the database which Fuseki
>>is running.  A disaster will occur otherwise.  [Modifying externally
>>while running requires a different approach (e.g. switching between two
>>copies of the database ... maybe ... so many ways to corrupt a database
>>... ).]
>> 
>> 
>> E-tags is a quite technical solution - will any system actually use it
>>for real even if it is the right solution?  We wouldn't want to find out
>>that etags support does not get used.  For the SPARQL Protocols case
>>(with query stings), it might not really get used.  Has caching of
>>requests including query string rolled out to any degree? (a point from
>>discussion in JENA-388).
>> 
>> If query string currently cause no caching by intermediaries in
>>practice, will clients cache which is the case of one client reissuing
>>the same query? Possible but is it likely?
>> 
>> See also JENA-626 "SPARQL Query Caching".  That would make a difference
>>- different client apps starting up often ask the same query to get
>>started.
>> 
>> 	Andy
>> 
>> On 29/06/15 16:03, Claude Warren wrote:
>>> I am not familure with how the indexing interplays with the rest of the
>>> Jena system.  My assumption is, like you, that we only want the ETag
>>>in the
>>> Fuseki layer.  However, to generate an ETag it seems like Fuseki will
>>>need
>>> to be able to ask the underlying dataset when the last change occured,
>>>but
>>> then you also want to know if indexing has changed so that results my
>>>be
>>> changed as well.
>>> 
>>> If we consider ETag generation separate from the Dataset then the ETag
>>> generator could register as a listener to the dataset and react
>>>whenever a
>>> change occurs to the model.   > This doesn't solve the problem of
>>>responding
>>> to index updates.  However, whatever interface the listener uses to
>>>trigger
>>> an ETag change could just as well be done by an indexer.  Is there an
>>> indexer listener interface (ala Model/Graph listeners)?  In this
>>>solution
>>> the ETag gets input from any registered component.  I think that each
>>> registered component should have a "name" and a "value".  The ETag
>>> generator would retain the most recent value for each registered
>>>component
>>> and generate a new ETag when a value changes.  So I see a class with 2
>>> methods
>>> 
>>> void ETagGenerator.change( String name, String value )
>>> and
>>> String ETagGenerator.getTag(); // to retrieve the current tag.
>>> 
>>> Claude
>>> 
>>> 
>>> 
>>> On Mon, Jun 29, 2015 at 2:50 PM, ajs6f@virginia.edu
>>><aj...@virginia.edu>
>>> wrote:
>>> 
>>>> On Jun 29, 2015, at 9:33 AM, Claude Warren <cl...@xenei.com> wrote:
>>>>> If there were an ETag per dataset and a method on the dataset to
>>>>>force
>>>> an ETag reset would this address the index issue in that the indexer
>>>>could
>>>> reset the ETag when it deemed appropriate?
>>>> 
>>>> It might-- for that indexer. I would be concerned about setups in
>>>>which
>>>> another process acted against the data "out of sight" of Fuseki. But
>>>>would
>>>> the ETag be on ARQ's Dataset itself? If I understand what's going on
>>>>here
>>>> correctly (debatable at best), Dataset should not have any HTTP
>>>>concerns
>>>> mixed into it. ETag would be on something closer to Fuseki's
>>>>DataService,
>>>> which I do not think would normally be accessible to an indexer which
>>>>is
>>>> only aware of what's on disk… but this is all from my understanding
>>>>of the
>>>> architecture, which is pretty minimal. {grin} Maybe some kind of "last
>>>> changed" timestamp could reasonably go on Dataset to support this
>>>>kind of
>>>> function?
>>>> 
>>>>> In any case I would go with the first choice.
>>>> 
>>>> It definitely seems like the most bang for the least buck.
>>>> 
>>>>> Is there anything that prohibits sending both an ETag and a constant
>>>> expires?  I havn't looked but I recall they are not mutually
>>>>exclusive.
>>>> 
>>>> Yes, I think you are correct. I suppose a bad ETag will never be
>>>>known to
>>>> be such as long as it is "inside" the range of a still-good Expires,
>>>>but
>>>> that is a question for the administrator configuring Fuseki, it seems
>>>>to
>>>> me. There is also Cache-Control, of course, in the same field of
>>>> functionality.
>>>> 
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>> 
>>> 
>>> 
>> 
>

Re: Fuskei and ETags

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

I can only speak for the use cases I actually know about. ETags would get used, because the most important web app in my concern that is potentially a client to Fuseki would be able to use them. But that is just one case.

JENA-626 would be great in any regard. 

---
A. Soroka
The University of Virginia Library

On Jun 29, 2015, at 12:20 PM, Andy Seaborne <an...@apache.org> wrote:

> There is no case of external modification of the database which Fuseki is running.  A disaster will occur otherwise.  [Modifying externally while running requires a different approach (e.g. switching between two copies of the database ... maybe ... so many ways to corrupt a database ... ).]
> 
> 
> E-tags is a quite technical solution - will any system actually use it for real even if it is the right solution?  We wouldn't want to find out that etags support does not get used.  For the SPARQL Protocols case (with query stings), it might not really get used.  Has caching of requests including query string rolled out to any degree? (a point from discussion in JENA-388).
> 
> If query string currently cause no caching by intermediaries in practice, will clients cache which is the case of one client reissuing the same query? Possible but is it likely?
> 
> See also JENA-626 "SPARQL Query Caching".  That would make a difference - different client apps starting up often ask the same query to get started.
> 
> 	Andy
> 
> On 29/06/15 16:03, Claude Warren wrote:
>> I am not familure with how the indexing interplays with the rest of the
>> Jena system.  My assumption is, like you, that we only want the ETag in the
>> Fuseki layer.  However, to generate an ETag it seems like Fuseki will need
>> to be able to ask the underlying dataset when the last change occured, but
>> then you also want to know if indexing has changed so that results my be
>> changed as well.
>> 
>> If we consider ETag generation separate from the Dataset then the ETag
>> generator could register as a listener to the dataset and react whenever a
>> change occurs to the model.   > This doesn't solve the problem of responding
>> to index updates.  However, whatever interface the listener uses to trigger
>> an ETag change could just as well be done by an indexer.  Is there an
>> indexer listener interface (ala Model/Graph listeners)?  In this solution
>> the ETag gets input from any registered component.  I think that each
>> registered component should have a "name" and a "value".  The ETag
>> generator would retain the most recent value for each registered component
>> and generate a new ETag when a value changes.  So I see a class with 2
>> methods
>> 
>> void ETagGenerator.change( String name, String value )
>> and
>> String ETagGenerator.getTag(); // to retrieve the current tag.
>> 
>> Claude
>> 
>> 
>> 
>> On Mon, Jun 29, 2015 at 2:50 PM, ajs6f@virginia.edu <aj...@virginia.edu>
>> wrote:
>> 
>>> On Jun 29, 2015, at 9:33 AM, Claude Warren <cl...@xenei.com> wrote:
>>>> If there were an ETag per dataset and a method on the dataset to force
>>> an ETag reset would this address the index issue in that the indexer could
>>> reset the ETag when it deemed appropriate?
>>> 
>>> It might-- for that indexer. I would be concerned about setups in which
>>> another process acted against the data "out of sight" of Fuseki. But would
>>> the ETag be on ARQ's Dataset itself? If I understand what's going on here
>>> correctly (debatable at best), Dataset should not have any HTTP concerns
>>> mixed into it. ETag would be on something closer to Fuseki's DataService,
>>> which I do not think would normally be accessible to an indexer which is
>>> only aware of what's on disk… but this is all from my understanding of the
>>> architecture, which is pretty minimal. {grin} Maybe some kind of "last
>>> changed" timestamp could reasonably go on Dataset to support this kind of
>>> function?
>>> 
>>>> In any case I would go with the first choice.
>>> 
>>> It definitely seems like the most bang for the least buck.
>>> 
>>>> Is there anything that prohibits sending both an ETag and a constant
>>> expires?  I havn't looked but I recall they are not mutually exclusive.
>>> 
>>> Yes, I think you are correct. I suppose a bad ETag will never be known to
>>> be such as long as it is "inside" the range of a still-good Expires, but
>>> that is a question for the administrator configuring Fuseki, it seems to
>>> me. There is also Cache-Control, of course, in the same field of
>>> functionality.
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>> 
>> 
>> 
>

Re: Fuskei and ETags

Posted by Andy Seaborne <an...@apache.org>.

There is no case of external modification of the database which Fuseki 
is running.  A disaster will occur otherwise.  [Modifying externally 
while running requires a different approach (e.g. switching between two 
copies of the database ... maybe ... so many ways to corrupt a database 
... ).]


E-tags is a quite technical solution - will any system actually use it 
for real even if it is the right solution?  We wouldn't want to find out 
that etags support does not get used.  For the SPARQL Protocols case 
(with query stings), it might not really get used.  Has caching of 
requests including query string rolled out to any degree? (a point from 
discussion in JENA-388).

If query string currently cause no caching by intermediaries in 
practice, will clients cache which is the case of one client reissuing 
the same query? Possible but is it likely?

See also JENA-626 "SPARQL Query Caching".  That would make a difference 
- different client apps starting up often ask the same query to get started.

	Andy

On 29/06/15 16:03, Claude Warren wrote:
> I am not familure with how the indexing interplays with the rest of the
> Jena system.  My assumption is, like you, that we only want the ETag in the
> Fuseki layer.  However, to generate an ETag it seems like Fuseki will need
> to be able to ask the underlying dataset when the last change occured, but
> then you also want to know if indexing has changed so that results my be
> changed as well.
>
> If we consider ETag generation separate from the Dataset then the ETag
> generator could register as a listener to the dataset and react whenever a
> change occurs to the model.   > This doesn't solve the problem of responding
> to index updates.  However, whatever interface the listener uses to trigger
> an ETag change could just as well be done by an indexer.  Is there an
> indexer listener interface (ala Model/Graph listeners)?  In this solution
> the ETag gets input from any registered component.  I think that each
> registered component should have a "name" and a "value".  The ETag
> generator would retain the most recent value for each registered component
> and generate a new ETag when a value changes.  So I see a class with 2
> methods
>
> void ETagGenerator.change( String name, String value )
> and
> String ETagGenerator.getTag(); // to retrieve the current tag.
>
> Claude
>
>
>
> On Mon, Jun 29, 2015 at 2:50 PM, ajs6f@virginia.edu <aj...@virginia.edu>
> wrote:
>
>> On Jun 29, 2015, at 9:33 AM, Claude Warren <cl...@xenei.com> wrote:
>>> If there were an ETag per dataset and a method on the dataset to force
>> an ETag reset would this address the index issue in that the indexer could
>> reset the ETag when it deemed appropriate?
>>
>> It might-- for that indexer. I would be concerned about setups in which
>> another process acted against the data "out of sight" of Fuseki. But would
>> the ETag be on ARQ's Dataset itself? If I understand what's going on here
>> correctly (debatable at best), Dataset should not have any HTTP concerns
>> mixed into it. ETag would be on something closer to Fuseki's DataService,
>> which I do not think would normally be accessible to an indexer which is
>> only aware of what's on disk… but this is all from my understanding of the
>> architecture, which is pretty minimal. {grin} Maybe some kind of "last
>> changed" timestamp could reasonably go on Dataset to support this kind of
>> function?
>>
>>> In any case I would go with the first choice.
>>
>> It definitely seems like the most bang for the least buck.
>>
>>> Is there anything that prohibits sending both an ETag and a constant
>> expires?  I havn't looked but I recall they are not mutually exclusive.
>>
>> Yes, I think you are correct. I suppose a bad ETag will never be known to
>> be such as long as it is "inside" the range of a still-good Expires, but
>> that is a question for the administrator configuring Fuseki, it seems to
>> me. There is also Cache-Control, of course, in the same field of
>> functionality.
>>
>> ---
>> A. Soroka
>> The University of Virginia Library
>>
>>
>
>

Re: Fuskei and ETags

Posted by Claude Warren <cl...@xenei.com>.

I am not familure with how the indexing interplays with the rest of the
Jena system.  My assumption is, like you, that we only want the ETag in the
Fuseki layer.  However, to generate an ETag it seems like Fuseki will need
to be able to ask the underlying dataset when the last change occured, but
then you also want to know if indexing has changed so that results my be
changed as well.

If we consider ETag generation separate from the Dataset then the ETag
generator could register as a listener to the dataset and react whenever a
change occurs to the model.  This doesn't solve the problem of responding
to index updates.  However, whatever interface the listener uses to trigger
an ETag change could just as well be done by an indexer.  Is there an
indexer listener interface (ala Model/Graph listeners)?  In this solution
the ETag gets input from any registered component.  I think that each
registered component should have a "name" and a "value".  The ETag
generator would retain the most recent value for each registered component
and generate a new ETag when a value changes.  So I see a class with 2
methods

void ETagGenerator.change( String name, String value )
and
String ETagGenerator.getTag(); // to retrieve the current tag.

Claude

On Mon, Jun 29, 2015 at 2:50 PM, ajs6f@virginia.edu <aj...@virginia.edu>
wrote:

> On Jun 29, 2015, at 9:33 AM, Claude Warren <cl...@xenei.com> wrote:
> > If there were an ETag per dataset and a method on the dataset to force
> an ETag reset would this address the index issue in that the indexer could
> reset the ETag when it deemed appropriate?
>
> It might-- for that indexer. I would be concerned about setups in which
> another process acted against the data "out of sight" of Fuseki. But would
> the ETag be on ARQ's Dataset itself? If I understand what's going on here
> correctly (debatable at best), Dataset should not have any HTTP concerns
> mixed into it. ETag would be on something closer to Fuseki's DataService,
> which I do not think would normally be accessible to an indexer which is
> only aware of what's on disk… but this is all from my understanding of the
> architecture, which is pretty minimal. {grin} Maybe some kind of "last
> changed" timestamp could reasonably go on Dataset to support this kind of
> function?
>
> > In any case I would go with the first choice.
>
> It definitely seems like the most bang for the least buck.
>
> > Is there anything that prohibits sending both an ETag and a constant
> expires?  I havn't looked but I recall they are not mutually exclusive.
>
> Yes, I think you are correct. I suppose a bad ETag will never be known to
> be such as long as it is "inside" the range of a still-good Expires, but
> that is a question for the administrator configuring Fuseki, it seems to
> me. There is also Cache-Control, of course, in the same field of
> functionality.
>
> ---
> A. Soroka
> The University of Virginia Library
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Fuskei and ETags

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

On Jun 29, 2015, at 9:33 AM, Claude Warren <cl...@xenei.com> wrote:
> If there were an ETag per dataset and a method on the dataset to force an ETag reset would this address the index issue in that the indexer could reset the ETag when it deemed appropriate?

It might-- for that indexer. I would be concerned about setups in which another process acted against the data "out of sight" of Fuseki. But would the ETag be on ARQ's Dataset itself? If I understand what's going on here correctly (debatable at best), Dataset should not have any HTTP concerns mixed into it. ETag would be on something closer to Fuseki's DataService, which I do not think would normally be accessible to an indexer which is only aware of what's on disk… but this is all from my understanding of the architecture, which is pretty minimal. {grin} Maybe some kind of "last changed" timestamp could reasonably go on Dataset to support this kind of function?

> In any case I would go with the first choice.

It definitely seems like the most bang for the least buck.

> Is there anything that prohibits sending both an ETag and a constant expires?  I havn't looked but I recall they are not mutually exclusive.

Yes, I think you are correct. I suppose a bad ETag will never be known to be such as long as it is "inside" the range of a still-good Expires, but that is a question for the administrator configuring Fuseki, it seems to me. There is also Cache-Control, of course, in the same field of functionality.

---
A. Soroka
The University of Virginia Library

Re: Fuskei and ETags

Posted by Claude Warren <cl...@xenei.com>.

If there were an ETag per dataset and a method on the dataset to force an
ETag reset would this address the index issue in that the indexer could
reset the ETag when it deemed appropriate?

In any case I would go with the first choice.

Is there anything that prohibits sending both an ETag and a constant
expires?  I havn't looked but I recall they are not mutually exclusive.

Claude

On Mon, Jun 29, 2015 at 2:04 PM, A. Soroka <aj...@virginia.edu> wrote:

> A quick discussion of ETags in the "backup admin" PR that was sent by Yang
> Yuanzhe led me to this issue:
>
> https://issues.apache.org/jira/browse/JENA-388
>
> for "Make Fuseki responses cacheable" and which has been around for a
> little while. I was wondering about a couple of potential approaches here
> and thought I would run them down:
>
> 1) ETag-per-Dataset: this is a single ETag value for any Dataset for all
> requests, updated whenever a mutating request completes. This would work by
> letting any change on a Dataset whatsoever that comes through Fuseki
> invalidate all ETag-based caching on that Dataset. This seems to be where
> Andy Seaborne and Rob Vesse were heading, but I obviously can't speak for
> them. Advantage: relatively simple. Disadvantages: changes in the indexes
> not performed by Fuseki will not be reflected properly, only useful for
> instances that receive the right patterns of changes (meaning for which
> mutations aren't too "evenly sprinkled" amongst queries, thus keeping the
> cache often invalidated).
>
> 2) Constant Expires: Rob Vesse discusses this a bit in the issue. It's an
> Expires header that is configurable to allow some admin adjustment, but is
> constant during runtime. Advantage: dead simple. Disadvantage: unless the
> usage scenario is very tightly controlled, there's going to be some leakage
> of stale data. That may or may not be a big problem for an integrator,
> depending on use case. It would have to be carefully documented, I think,
> to avoid nasty surprises.
>
> 3) Per-query ETag: This would be mean some kind of map from request to
> ETag from which ETag headers are supplied for every request. The problem
> with this is that it implies some kind of reasonable algorithm for
> determining when an arbitrary update makes sufficient changes in an
> arbitrary graph to affect another arbitrary query, or it would imply
> stretching the meaning of "weak" ETag to a point that is probably not
> useful or correct for a query endpoint. This doesn't seem very practical.
>
> 4) Per-query-for-some-queries ETag. The idea here would be to cut down
> option 3 to a tranche of queries for which there actually _does_ exist some
> reasonable algorithm for detecting changes in the query-results. The
> example that comes to mind here would be simple DESCRIBE queries. Since it
> seems that ARQ deals with DESCRIBE using only relationships "outbound" from
> the things described, this approach could use an expiring map from URIs to
> Etags which could be updated (perhaps using a StatementListener) when a
> change directly affects an URI or a blank node in the CBD of that URI. This
> could be expensive, but it might be worth it for some use cases, for
> example where integrators are using software like Pubby to publish RDF.
> There might be other examples of query pattern where changes are
> practically calculable.
>
> Whether (and how far) any of these are worth pursuing depends a good bit
> on the use case in hand. For example, for my use cases, option 2 isn't
> really practical, because one of the applications taking results from
> Fuseki would be using them to present live-editing pages. Option 1 would
> work, and it would give some advantage. Option 4 isn't interesting because
> very few of the queries in play will be simple DESRIBE queries. But that's
> all based on my use case.
>
> Do you think any of these are worth pursuing?
>
> ---
> A. Soroka
> The University of Virginia Library
>
>


-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: Fuskei and ETags

Posted by Osma Suominen <os...@helsinki.fi>.

29.06.2015, 16:04, A. Soroka wrote:

> 1) ETag-per-Dataset: this is a single ETag value for any Dataset for
> all requests, updated whenever a mutating request completes. This
> would work by letting any change on a Dataset whatsoever that comes
> through Fuseki invalidate all ETag-based caching on that Dataset.
> This seems to be where Andy Seaborne and Rob Vesse were heading, but
> I obviously can't speak for them. Advantage: relatively simple.
> Disadvantages: changes in the indexes not performed by Fuseki will
> not be reflected properly, only useful for instances that receive
> the right patterns of changes (meaning for which mutations aren't
> too "evenly sprinkled" amongst queries, thus keeping the cache often
>  invalidated).

+1 for either this, or a Last-Modified header which works the same way
(a per-dataset timestamp that is updated on any change). ETag is more
opaque than a Last-Modified timestamp; a timestamp might allow some more
intelligent choices to be made by a cache, as it can see whether the
data is only seconds old, vs. hours or days.

We currently use a Varnish cache in front of Fuseki (see [1] for
details), which is configured for a long expiry time. We manually
invalidate the Varnish cache after any updates to Fuseki data. This
means that frequently occurring queries will be answered by Varnish
directly without going to Fuseki at all. This works and performs well,
but the downside is having to do the manual invalidation, which then
throws away the whole cache. We usually only update the data once per
day so this is OK for now.

But having Fuseki respond with ETag/Last-Modified would enable another
mode of operation, which might be suitable for more dynamic data. In
this model, Varnish (or another HTTP cache such as nginx) would keep the
data for a shorter period (a few minutes), during which it would serve
cached responses without consulting Fuseki. After this period, it would
still keep the data, but when a new query comes in, it would ask Fuseki
whether its cached data is still valid based on ETag or Last-Modified. 
If it's still valid, it could keep serving it for a few more minutes 
etc. This would still be much better than asking Fuseki every time, and 
also better than throwing away the data completely after a few minutes, 
which are currently the main options besides using a long expiry time 
and manual invalidation.

So I'd definitely consider using this if Fuseki gets the support.

> 2) Constant Expires: Rob Vesse discusses this a bit in the issue.
> It's an Expires header that is configurable to allow some admin
> adjustment, but is constant during runtime. Advantage: dead simple.
> Disadvantage: unless the usage scenario is very tightly controlled,
> there's going to be some leakage of stale data. That may or may not
> be a big problem for an integrator, depending on use case. It would
> have to be carefully documented, I think, to avoid nasty surprises.

This is compatible with 1). I'd give it a +1 too, although it's not as
important as 1). We currently set the long expiry time in Varnish
configuration, but it would be more elegant to be able to do this in
Fuseki as a per-dataset, constant-during-runtime option. Apache
mod_expires [2] does something similar and I've used that to set expiry
times for static content. It works very well so I'd recommend looking at
mod_expires documentation for inspiration.

-Osma

[1] https://github.com/NatLibFi/Skosmos/wiki/FusekiTuning#http-caching

[2] http://httpd.apache.org/docs/2.2/mod/mod_expires.html

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi