You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Reinhard Pötz <re...@apache.org> on 2009/12/09 19:38:42 UTC

[c3] Conditional GET

Steven Dolg wrote:
> Reinhard Pötz schrieb:
>> Thomas Markus wrote:
>>  
>>> i missed the aop part
>>>
>>> it would be great to move all servlet independent parts outside of
>>> cocoon-servlet (CallStack ...), maybe to cocoon-pipeline
>>>
>>> currently i cant use cocoon-servlet as dependency (and dont want).
>>>
>>> org.springframework.beans.factory.BeanCreationException: Error creating
>>> bean with name
>>> 'org.apache.cocoon.blockdeployment.BlockContextURLStreamHandlerFactory'
>>> defined in class path resource
>>> [META-INF/cocoon/spring/cocoon-blockdeployment-protocol.xml]:
>>> Initialization of bean failed; nested exception is
>>> java.lang.ClassCastException:
>>> org.springframework.context.support.ClassPathXmlApplicationContext
>>> cannot be cast to org.springframework.web.context.WebApplicationContext
>>>
>>> i try to run a sitemap inside a junit test without any servlet specific
>>> parts. this works fine without coocon-servlet. after copying CallFrame,
>>> CallStack, MimeTypeCollector and spring config (ugly i know) mimetype
>>> is ok.
>>>     
>>
>> I refactored all *Collector aspects and merged them into
>> o.a.c.servlet.collector.ResponseHeaderCollector. This collector uses
>> data store that implements o.a.c.servlet.collector.CollectorDataStore
>> and provided two implementations: one uses a thread local the other the
>> CallStack of the ServletService framework.
>>   
> 
> That sounds good.
> 
>> This means that the ResponseHeaderCollector doesn't have a dependency on
>> the SSF or the Servlet API any longer.
>>   
> 
> That sounds even better.
> 
>> The last step would be moving it into cocoon-sitemap but I have to find
>> a way to ensure that there is no other class in cocoon-sitemap that
>> depends on it. Alternatively I could create an own module but it seems
>> odd to me doing this for a single class.
>>   
> 
> Sitemap should be the right place for this.
> Pipeline is definitely too low, iow. the infrastructure is missing to
> ensure that those things are actually happening.
> 
> 
> The purpose of the pipeline module is to allow direct programmatic
> creation and execution of Cocoon pipelines - wherever you want and no
> matter which other frameworks you might use.
> (Just like you would use commons-lang or commons-logging).
> 
> The sitemap is the first layer that actually creates some kind of
> predictable environment for the Cocoon pipelines and thus is the first
> layer where we can actually rely on assumptions like
> having AOP or access to certain information.

I'm still not sure where the ResponseHeaderCollector should go to but I
agree with you that cocoon-pipeline is too low.

But let me broaden the picture: Based on our work from about two weeks
ago, I created another aspect which implements the support for
conditional GET requests and also takes care that a pipeline isn't
executed unless it is really necessary. I was also able to fix all
failing test cases. I created an issue that contains a patch:
https://issues.apache.org/jira/browse/COCOON3-47

Additionally there is also another feature that I would like to add: The
current patch only takes care of 'If-Modified-Since' requests. I also
want to support 'If-None-Match' requests that are based on the 'ETag'
response header. (see http://en.wikipedia.org/wiki/HTTP_ETag).

Using ETag has the advantage that we could support conditional GET
requests also in the case where we can't use a timestamp based approach
 (e.g. when using o.a.c.pipeline.caching.ParameterCacheKey) or to
provide conditional GET support in REST controllers.

As an ETag value we could use the hash code of a pipeline's cache key.

Finally back to the initial question: Maybe it's easier to decide where
things should go to after the conditional GET support works as outlined
above.

WDYT?

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [c3] Conditional GET

Posted by Sylvain Wallez <sy...@apache.org>.
Reinhard Pötz wrote:
> When coming back to my original question about the implementation
> conditional get support based on cache keys, the important part of your
> comments is that you suggest using a _strong_ hash key. Is there any
> concrete implementation that you can recommend? (regarding speed and
> potential collisions and of course distributed under some 'friendly'
> license)
>   

I've been using 64-bit FNV hash [1] for a similar purpose (computing an 
ETag for a full response in dynamic pages, see [2]). It's fast and was 
designed to have a low collision rate even for small changes in the 
input data.

There is a Java ASL-licensed implementation available at 
http://www.getopt.org/

I've also noticed a recent newcomer on the "hash scene", MurmurHash, 
that seems to have even better speed and non-collision than FNV. There's 
also an implementation at getopt.org.

Sylvain

[1] http://en.wikipedia.org/wiki/Fowler_Noll_Vo_hash
[2] http://bluxte.net/musings/2008/04/30/speeding-mobile-web-applications

-- 
Sylvain Wallez - http://bluxte.net


Re: [c3] Conditional GET

Posted by Reinhard Pötz <re...@apache.org>.
Sylvain Wallez wrote:
> Steven Dolg wrote:
>> Sylvain Wallez schrieb:
>>> Reinhard Pötz wrote:

<snip/>

>> This additional layer is used to perform validity checks when
>> necessary and/or desired and not check the validity if not.
>> The intention here is to reuse the abstraction layer and not have this
>> kind of (critical) logic scattered in the individual cache provider
>> adaptors.
> 
> Agree. A cache storage should not have to do much more than get(key),
> put(key, value) and maybe delete(key) and clear().
> 
>> So it is possible to check if a CacheKey is pointing to the same
>> resource *and* if that cached data is still valid - even tho the
>> underlying cache provider has no means of performing the second check
>> (validity).
> 
> I totally agree. Now this doesn't solve the issue: to implement put(key,
> value) on an arbitrary non-java store, I wouldn't trust the CacheKey's
> hashcode() method to produce a uniform distribution that would avoid
> conflicts (same hashcode for different CacheKeys). The solution would be
> to to serialize the CacheKey and either use this as the store key if
> it's not too long, or use a strong hash (e.g. MD5 or FNV) of this
> serialized representation otherwise.
> 
> Mixing key and validity in a single CacheKey object means having several
> store keys (which is different from _cache_ keys in this case) for
> CacheKeys that are equal(), leading to the problems I outlined.

I leave it to Steven to comment on how C3 caching works in detail.

When coming back to my original question about the implementation
conditional get support based on cache keys, the important part of your
comments is that you suggest using a _strong_ hash key. Is there any
concrete implementation that you can recommend? (regarding speed and
potential collisions and of course distributed under some 'friendly'
license)

>>> And BTW, what is the "jmxGroupName" property on CacheKey used for?
>>
>> The jmxGroupName is used for making them accessible via JMX, no?
> 
> Well, I guessed this was somehow related to JMX ;-) Did my homework and
> found its use in cocoon-monitoring's CacheEntrysMonitorInitializer. Nice
> stuff, but I'm wondering if exposing the full key set of a big cache to
> JMX actually scales.

cocoon-monitoring is Dariusz' work, our 2009 GSoC student. I was
thinking about this issue too but decided to keep the code so that
others can improve it .It was my idea that it is probably easier to make
things better than to start them from scratch.

> And there are some cache implementations (again, memcached) that don't
> expose their key set. But since this is mostly used for monitoring
> AFAIU, returning an empty set in these cases should be acceptable.

yes, I think so too.

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [c3] Conditional GET

Posted by Sylvain Wallez <sy...@apache.org>.
Steven Dolg wrote:
> Sylvain Wallez schrieb:
>> Reinhard Pötz wrote:
>>
>> <snip/>
>>
>>> But let me broaden the picture: Based on our work from about two weeks
>>> ago, I created another aspect which implements the support for
>>> conditional GET requests and also takes care that a pipeline isn't
>>> executed unless it is really necessary. I was also able to fix all
>>> failing test cases. I created an issue that contains a patch:
>>> https://issues.apache.org/jira/browse/COCOON3-47
>>>
>>> Additionally there is also another feature that I would like to add: 
>>> The
>>> current patch only takes care of 'If-Modified-Since' requests. I also
>>> want to support 'If-None-Match' requests that are based on the 'ETag'
>>> response header. (see http://en.wikipedia.org/wiki/HTTP_ETag).
>>>
>>> Using ETag has the advantage that we could support conditional GET
>>> requests also in the case where we can't use a timestamp based approach
>>>  (e.g. when using o.a.c.pipeline.caching.ParameterCacheKey) or to
>>> provide conditional GET support in REST controllers.
>>>
>>> As an ETag value we could use the hash code of a pipeline's cache key.
>>>   
>>
>> I don't fully get the context of this conversation, but this last 
>> sentence triggered a question to me: how can we validate a cache 
>> entry with its _key_? Looking at the code, I see that CacheKey holds 
>> both the identifier information (the actual key) and the validity 
>> information.
>>
>> There is a naming issue here which leads to some confusion between 
>> key and key-and-validity that we can see it in the code: 
>> ExpiresCacheKey doesn't include the validity information in 
>> hashcode() and equals() whereas ParameterCacheKey does. What is the 
>> right contract?
>
> I'm not sure I understand what you mean.
> The implementations of hashCode() in ExpiresCacheKey and 
> ParameterCacheKey are as similar in both the code and actual behaviour 
> as they can be.
> Neither of them performs any operations necessary to check their 
> validity in the hashCode() or equals() methods.

Hmm... ok, so parameter values are part of the key. When reading the 
code, and because of this mixing of validity and key (and lack of docs) 
I thought the parameter keys were defining the identity and their values 
were defining the validity.

> Your confusion might arrise from the point that ParameterCacheKey 
> cannot become invalid because the same parameter value means always 
> the same parameter value, there is no way this can become invalid (as 
> opposed to a cache file contents which can become invalid when the 
> file is changed, even if it is still the same file)
> So the isValid() method basically performs the equals check, since 
> this is a required condition for being valid. (valid = equal & not 
> expired; since expired = false here: valid = equal).

Ok. So with the definition of parameter keys and values being part of 
the identity, ExpiresCacheKey and ParameterCacheKey effectively behave 
consistenly.

Now you'll understand that this becomes really confusing, and if there's 
not a very well defined contract for equals and hashcode (and even with 
that) there's a big opportunity for people to implement wrongly their 
CacheKey.

> The ExpiresCacheKey performs an additional check in its isValid() 
> method, namely checking the expiresTimestamp.
> This is not done in either the hashcode() or equals() method. So here 
> valid := equals & not expired.
>
> This principle holds true for each and every CacheKey currently 
> implemented (unless there is faulty implementation).
> And this is also the answer to your question:
> CacheKey contains information to check its validity, but this 
> information is not used for identifying (iow, equals() and hashCode() 
> methods) CacheKeys.
> Which means frequently invalidated CacheKeys will not fill the cache 
> but instead overwrite each other.

This is only true for cache implementations that rely on hashcode and 
equals(), i.e. that keep an index in a Map or a Set (ehcache does this 
for its DiskStore).

But if you use a non-java cache or persistent store, you have no other 
solution than serializing the key and its validity information. This is 
for example the case with memcache which requires the key to be a 
String. And this is where the problem arises if you don't want or can't 
keep an in-memory index, e.g. because of size or distribution.

>> As a side note, both classes include the class' hashcode in the 
>> instance's hash code, which means hash codes will be different a 
>> every JVM restart, or across JVM instances in a cluster, and is 
>> likely to break persistent and distributed caches.
>
> That is a good hint.
> We will want to look into that and amend things if necessary.

Well, I would call it a bug that needs fixing, because it's basically 
equivalent to clearing persistent caches at every JVM restart or class 
reloading.

>> That being said, I'm wondering if this aggregation of key and 
>> validity won't cause other kinds of problems with distributed cache 
>> implementations. For example, Java memcached clients serialize the 
>> cache key and use this result as the memcache key. If the key 
>> includes validity information, the memcache key will change every 
>> time the underlying data changes (e.g. a file's timestamp).
>>
>> At first sight, this can sound good as it means we will have a cache 
>> miss when the validity has changed, and will even avoid having to 
>> compare the validity of cached content. But this can have a 
>> desastrous impact on the cache efficiency in situations where you 
>> have some often requested content that changes frequently: the cache 
>> will quickly fill up with obsolete versions of this content under 
>> different key values, that will lead older content to be evicted, 
>> reducing the overall cache efficiency. Whereas a key that's only an 
>> indentifier will lead the entry to be _replaced_ and not a new one 
>> being added.
>>
>> So in the end, my feeling is that key and validity information really 
>> should be separated.
>>
>> Now going back to the ETag discussion, using the pipeline's cache key 
>> won't work IMHO because of the implementation of some key's 
>> hashcode() using only the identifier part of the key and not the 
>> validity. Confusion, I told you ;-)
>
> We (intend to) use a layer for integrating caches since we don't want 
> to compile directly against the API of one specific provider and then 
> have to stick with that provider till the end of time (Avalon, anyone?)

Avalon certainly had its problems, but never mandated a particular cache 
API beyond the one that we, Cocoon devs, defined in Excalibur 
(org.apache.excalibur.store.Store). But even the Excalibur Store wasn't 
a requirement, since it was just the store abstraction used by a 
particular implementation of org.apache.cocoon.caching.Cache.

> This additional layer is used to perform validity checks when 
> necessary and/or desired and not check the validity if not.
> The intention here is to reuse the abstraction layer and not have this 
> kind of (critical) logic scattered in the individual cache provider 
> adaptors.

Agree. A cache storage should not have to do much more than get(key), 
put(key, value) and maybe delete(key) and clear().

> So it is possible to check if a CacheKey is pointing to the same 
> resource *and* if that cached data is still valid - even tho the 
> underlying cache provider has no means of performing the second check 
> (validity).

I totally agree. Now this doesn't solve the issue: to implement put(key, 
value) on an arbitrary non-java store, I wouldn't trust the CacheKey's 
hashcode() method to produce a uniform distribution that would avoid 
conflicts (same hashcode for different CacheKeys). The solution would be 
to to serialize the CacheKey and either use this as the store key if 
it's not too long, or use a strong hash (e.g. MD5 or FNV) of this 
serialized representation otherwise.

Mixing key and validity in a single CacheKey object means having several 
store keys (which is different from _cache_ keys in this case) for 
CacheKeys that are equal(), leading to the problems I outlined.

>> And BTW, what is the "jmxGroupName" property on CacheKey used for?
>
> The jmxGroupName is used for making them accessible via JMX, no?

Well, I guessed this was somehow related to JMX ;-) Did my homework and 
found its use in cocoon-monitoring's CacheEntrysMonitorInitializer. Nice 
stuff, but I'm wondering if exposing the full key set of a big cache to 
JMX actually scales.

And there are some cache implementations (again, memcached) that don't 
expose their key set. But since this is mostly used for monitoring 
AFAIU, returning an empty set in these cases should be acceptable.

Sylvain

-- 
Sylvain Wallez - http://bluxte.net


Re: [c3] Conditional GET

Posted by Steven Dolg <st...@indoqa.com>.
Sylvain Wallez schrieb:
> Reinhard Pötz wrote:
>
> <snip/>
>
>> But let me broaden the picture: Based on our work from about two weeks
>> ago, I created another aspect which implements the support for
>> conditional GET requests and also takes care that a pipeline isn't
>> executed unless it is really necessary. I was also able to fix all
>> failing test cases. I created an issue that contains a patch:
>> https://issues.apache.org/jira/browse/COCOON3-47
>>
>> Additionally there is also another feature that I would like to add: The
>> current patch only takes care of 'If-Modified-Since' requests. I also
>> want to support 'If-None-Match' requests that are based on the 'ETag'
>> response header. (see http://en.wikipedia.org/wiki/HTTP_ETag).
>>
>> Using ETag has the advantage that we could support conditional GET
>> requests also in the case where we can't use a timestamp based approach
>>  (e.g. when using o.a.c.pipeline.caching.ParameterCacheKey) or to
>> provide conditional GET support in REST controllers.
>>
>> As an ETag value we could use the hash code of a pipeline's cache key.
>>   
>
> I don't fully get the context of this conversation, but this last 
> sentence triggered a question to me: how can we validate a cache entry 
> with its _key_? Looking at the code, I see that CacheKey holds both 
> the identifier information (the actual key) and the validity information.
>
> There is a naming issue here which leads to some confusion between key 
> and key-and-validity that we can see it in the code: ExpiresCacheKey 
> doesn't include the validity information in hashcode() and equals() 
> whereas ParameterCacheKey does. What is the right contract?

I'm not sure I understand what you mean.
The implementations of hashCode() in ExpiresCacheKey and 
ParameterCacheKey are as similar in both the code and actual behaviour 
as they can be.
Neither of them performs any operations necessary to check their 
validity in the hashCode() or equals() methods.

Your confusion might arrise from the point that ParameterCacheKey cannot 
become invalid because the same parameter value means always the same 
parameter value, there is no way this can become invalid (as opposed to 
a cache file contents which can become invalid when the file is changed, 
even if it is still the same file)
So the isValid() method basically performs the equals check, since this 
is a required condition for being valid. (valid = equal & not expired; 
since expired = false here: valid = equal).

The ExpiresCacheKey performs an additional check in its isValid() 
method, namely checking the expiresTimestamp.
This is not done in either the hashcode() or equals() method. So here 
valid := equals & not expired.

This principle holds true for each and every CacheKey currently 
implemented (unless there is faulty implementation).
And this is also the answer to your question:
CacheKey contains information to check its validity, but this 
information is not used for identifying (iow, equals() and hashCode() 
methods) CacheKeys.
Which means frequently invalidated CacheKeys will not fill the cache but 
instead overwrite each other.


>
> As a side note, both classes include the class' hashcode in the 
> instance's hash code, which means hash codes will be different a every 
> JVM restart, or across JVM instances in a cluster, and is likely to 
> break persistent and distributed caches.

That is a good hint.
We will want to look into that and amend things if necessary.
Thanks

>
> That being said, I'm wondering if this aggregation of key and validity 
> won't cause other kinds of problems with distributed cache 
> implementations. For example, Java memcached clients serialize the 
> cache key and use this result as the memcache key. If the key includes 
> validity information, the memcache key will change every time the 
> underlying data changes (e.g. a file's timestamp).
>
> At first sight, this can sound good as it means we will have a cache 
> miss when the validity has changed, and will even avoid having to 
> compare the validity of cached content. But this can have a desastrous 
> impact on the cache efficiency in situations where you have some often 
> requested content that changes frequently: the cache will quickly fill 
> up with obsolete versions of this content under different key values, 
> that will lead older content to be evicted, reducing the overall cache 
> efficiency. Whereas a key that's only an indentifier will lead the 
> entry to be _replaced_ and not a new one being added.
>
> So in the end, my feeling is that key and validity information really 
> should be separated.
>
> Now going back to the ETag discussion, using the pipeline's cache key 
> won't work IMHO because of the implementation of some key's hashcode() 
> using only the identifier part of the key and not the validity. 
> Confusion, I told you ;-)

We (intend to) use a layer for integrating caches since we don't want to 
compile directly against the API of one specific provider and then have 
to stick with that provider till the end of time (Avalon, anyone?)

This additional layer is used to perform validity checks when necessary 
and/or desired and not check the validity if not.
The intention here is to reuse the abstraction layer and not have this 
kind of (critical) logic scattered in the individual cache provider 
adaptors.

So it is possible to check if a CacheKey is pointing to the same 
resource *and* if that cached data is still valid - even tho the 
underlying cache provider has no means of performing the second check 
(validity).

For details you might want to look at 
org.apache.cocoon.pipeline.caching.AbstractCache

>
> And BTW, what is the "jmxGroupName" property on CacheKey used for?

The jmxGroupName is used for making them accessible via JMX, no?

>
> Sylvain
>


Steven

Re: [c3] Conditional GET

Posted by Sylvain Wallez <sy...@apache.org>.
Reinhard Pötz wrote:

<snip/>

> But let me broaden the picture: Based on our work from about two weeks
> ago, I created another aspect which implements the support for
> conditional GET requests and also takes care that a pipeline isn't
> executed unless it is really necessary. I was also able to fix all
> failing test cases. I created an issue that contains a patch:
> https://issues.apache.org/jira/browse/COCOON3-47
>
> Additionally there is also another feature that I would like to add: The
> current patch only takes care of 'If-Modified-Since' requests. I also
> want to support 'If-None-Match' requests that are based on the 'ETag'
> response header. (see http://en.wikipedia.org/wiki/HTTP_ETag).
>
> Using ETag has the advantage that we could support conditional GET
> requests also in the case where we can't use a timestamp based approach
>  (e.g. when using o.a.c.pipeline.caching.ParameterCacheKey) or to
> provide conditional GET support in REST controllers.
>
> As an ETag value we could use the hash code of a pipeline's cache key.
>   

I don't fully get the context of this conversation, but this last 
sentence triggered a question to me: how can we validate a cache entry 
with its _key_? Looking at the code, I see that CacheKey holds both the 
identifier information (the actual key) and the validity information.

There is a naming issue here which leads to some confusion between key 
and key-and-validity that we can see it in the code: ExpiresCacheKey 
doesn't include the validity information in hashcode() and equals() 
whereas ParameterCacheKey does. What is the right contract?

As a side note, both classes include the class' hashcode in the 
instance's hash code, which means hash codes will be different a every 
JVM restart, or across JVM instances in a cluster, and is likely to 
break persistent and distributed caches.

That being said, I'm wondering if this aggregation of key and validity 
won't cause other kinds of problems with distributed cache 
implementations. For example, Java memcached clients serialize the cache 
key and use this result as the memcache key. If the key includes 
validity information, the memcache key will change every time the 
underlying data changes (e.g. a file's timestamp).

At first sight, this can sound good as it means we will have a cache 
miss when the validity has changed, and will even avoid having to 
compare the validity of cached content. But this can have a desastrous 
impact on the cache efficiency in situations where you have some often 
requested content that changes frequently: the cache will quickly fill 
up with obsolete versions of this content under different key values, 
that will lead older content to be evicted, reducing the overall cache 
efficiency. Whereas a key that's only an indentifier will lead the entry 
to be _replaced_ and not a new one being added.

So in the end, my feeling is that key and validity information really 
should be separated.

Now going back to the ETag discussion, using the pipeline's cache key 
won't work IMHO because of the implementation of some key's hashcode() 
using only the identifier part of the key and not the validity. 
Confusion, I told you ;-)

And BTW, what is the "jmxGroupName" property on CacheKey used for?

Sylvain

-- 
Sylvain Wallez - http://bluxte.net