You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@shindig.apache.org by John Hjelmstad <fa...@google.com> on 2008/09/09 21:11:19 UTC

Where to cache rewritten content

As discussed on a few threads and tracked in JIRA issue (
http://issues.apache.org/jira/browse/SHINDIG-579), we need to move rewriting
logic out of AbstractHttpCache. Yet we should maintain rewritten content
caching capability. The question is where to put it.
I see two options, at a high level:
1. In code that calls ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher, GadgetServer, and
the near-future Renderer and Preloader. This allows finer-grained control
over caching behavior in context, at the cost of distributing caching logic
in various places.
2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget) itself, if so
chosen. Caching logic can be consolidated in CachingContentRewriterRegistry,
for instance (which will no longer subclass CachingWebRetrievalFactory), and
be considered an optimization to rewriting.

I'm inclined to go #2. Rewriters themselves can be augmented with caching
hints if necessary, and be assumed deterministic for a given cache key in
the meantime. Consolidating rewriting logic makes it easier to share the
cache itself.

Still, I might be missing situations in which additional context inherent to
the calling context is needed to make a caching decision.

--John

Re: Where to cache rewritten content

Posted by Kevin Brown <et...@google.com>.

This isn't a problem if we incorporate Brian's signing fetch patch and make
it so that the HttpCacheKey can generate the key without any other objects.

On Tue, Sep 9, 2008 at 3:18 PM, John Hjelmstad <fa...@google.com> wrote:

> This might be a problem -- the rewriter registry has no knowledge of its
> calling context, and I'm loath to add specific caching APIs/parameters to
> its methods unless absolutely necessary. Can you say more about why this is
> required? In what way can't the rewriting behavior in this case key on
> <rewriters>+<input string hash>?
> John
>
> On Tue, Sep 9, 2008 at 3:10 PM, Brian Eaton <be...@google.com> wrote:
>
> > On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com>
> wrote:
> > > Still, I might be missing situations in which additional context
> inherent
> > to
> > > the calling context is needed to make a caching decision.
> >
> > OAuthFetcher needs to have input in to the generation of the cache key
> > for any authenticated requests.  There is a function getCacheKey in
> > OAuthFetcher that is currently private, I think you'll want to make it
> > public.
> >
>

Re: Where to cache rewritten content

Posted by Brian Eaton <be...@google.com>.

On Tue, Sep 9, 2008 at 4:59 PM, Louis Ryan <lr...@google.com> wrote:
> I would think MD5/SHA1 would be perfectly fine.
>
> Brian are we worried about someone generating enough random variations of
> content to force collisions and get someone elses content from the cache? A
> brute force attack would require generating so many requests to the server &
> cache that it seems unfeasible. This is a closed system so the hashes
> themselves are never exposed publicly. Am I missing something? It seems like
> we would only care about the functional requirement of a hash that has a
> very low probability of collision and a collision detection mechanism.

Yeah, you're right, md5 or sha1 would be fine.  I was thinking about
this in terms of an attacker who already knew the hash they were
looking for, but that doesn't seem likely.

Re: Where to cache rewritten content

Posted by Louis Ryan <lr...@google.com>.

I would think MD5/SHA1 would be perfectly fine.

Brian are we worried about someone generating enough random variations of
content to force collisions and get someone elses content from the cache? A
brute force attack would require generating so many requests to the server &
cache that it seems unfeasible. This is a closed system so the hashes
themselves are never exposed publicly. Am I missing something? It seems like
we would only care about the functional requirement of a hash that has a
very low probability of collision and a collision detection mechanism.

On Tue, Sep 9, 2008 at 4:38 PM, Brian Eaton <be...@google.com> wrote:

> On Tue, Sep 9, 2008 at 4:25 PM, John Hjelmstad <fa...@google.com> wrote:
> > I briefly considered String hashCode, but quickly recognized that was a
> bad
> > idea. MD5 of contents sounds reasonable. Brian, thoughts?
>
> I suspect using the entire input body contents is out of the question,
> though that was my initial thought.
>
> Don't use MD5.  Nobody knows how to attack it for this kind of
> application, yet, but a lot of progress has been made.  SHA1 is
> probably OK.  SHA-256 would be great, HMAC-SHA1 would be great, except
> then you have to worry about keying, which is a pain.  This cache is
> potentially shared across multiple servers, right?
>
> If it's a single server cache, HMAC-SHA1 with a random key.
>
> The cache key generated by the HTTP content fetchers might be useful
> for this as well, assuming you can get ahold of it somehow.
>

Re: Where to cache rewritten content

Posted by Ben Laurie <be...@google.com>.

\

On Wed, Sep 10, 2008 at 12:38 AM, Brian Eaton <be...@google.com> wrote:
> On Tue, Sep 9, 2008 at 4:25 PM, John Hjelmstad <fa...@google.com> wrote:
>> I briefly considered String hashCode, but quickly recognized that was a bad
>> idea. MD5 of contents sounds reasonable. Brian, thoughts?
>
> I suspect using the entire input body contents is out of the question,
> though that was my initial thought.
>
> Don't use MD5.  Nobody knows how to attack it for this kind of
> application, yet, but a lot of progress has been made.  SHA1 is
> probably OK.  SHA-256 would be great, HMAC-SHA1 would be great, except
> then you have to worry about keying, which is a pain.  This cache is
> potentially shared across multiple servers, right?

Keyczar could help with this...

>
> If it's a single server cache, HMAC-SHA1 with a random key.
>
> The cache key generated by the HTTP content fetchers might be useful
> for this as well, assuming you can get ahold of it somehow.
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Using full body contents is probably unavoidable. This may be - depending on
implementation - a distributed cache, so we should code for that.
I'm inclined to agree with Louis that MD5/SHA1 seems fine... HashUtils (in
common) provides the former, while DigestUtils in apache.commons.codec
proffers sha (and another md5 impl, it seems).

--John

On Tue, Sep 9, 2008 at 4:45 PM, Kevin Brown <et...@google.com> wrote:

> On Tue, Sep 9, 2008 at 4:38 PM, Brian Eaton <be...@google.com> wrote:
>
> > On Tue, Sep 9, 2008 at 4:25 PM, John Hjelmstad <fa...@google.com> wrote:
> > > I briefly considered String hashCode, but quickly recognized that was a
> > bad
> > > idea. MD5 of contents sounds reasonable. Brian, thoughts?
> >
> > I suspect using the entire input body contents is out of the question,
> > though that was my initial thought.
> >
> > Don't use MD5.  Nobody knows how to attack it for this kind of
> > application, yet, but a lot of progress has been made.  SHA1 is
> > probably OK.  SHA-256 would be great, HMAC-SHA1 would be great, except
> > then you have to worry about keying, which is a pain.  This cache is
> > potentially shared across multiple servers, right?
> >
> > If it's a single server cache, HMAC-SHA1 with a random key.
> >
> > The cache key generated by the HTTP content fetchers might be useful
> > for this as well, assuming you can get ahold of it somehow.
>
>
> There's a utility checked in to produce a base32 encoded SHA1 checked into
> common that can be used for this.
>

Re: Where to cache rewritten content

Posted by Kevin Brown <et...@google.com>.

On Tue, Sep 9, 2008 at 4:38 PM, Brian Eaton <be...@google.com> wrote:

> On Tue, Sep 9, 2008 at 4:25 PM, John Hjelmstad <fa...@google.com> wrote:
> > I briefly considered String hashCode, but quickly recognized that was a
> bad
> > idea. MD5 of contents sounds reasonable. Brian, thoughts?
>
> I suspect using the entire input body contents is out of the question,
> though that was my initial thought.
>
> Don't use MD5.  Nobody knows how to attack it for this kind of
> application, yet, but a lot of progress has been made.  SHA1 is
> probably OK.  SHA-256 would be great, HMAC-SHA1 would be great, except
> then you have to worry about keying, which is a pain.  This cache is
> potentially shared across multiple servers, right?
>
> If it's a single server cache, HMAC-SHA1 with a random key.
>
> The cache key generated by the HTTP content fetchers might be useful
> for this as well, assuming you can get ahold of it somehow.


There's a utility checked in to produce a base32 encoded SHA1 checked into
common that can be used for this.

Re: Where to cache rewritten content

Posted by Brian Eaton <be...@google.com>.

On Tue, Sep 9, 2008 at 4:25 PM, John Hjelmstad <fa...@google.com> wrote:
> I briefly considered String hashCode, but quickly recognized that was a bad
> idea. MD5 of contents sounds reasonable. Brian, thoughts?

I suspect using the entire input body contents is out of the question,
though that was my initial thought.

Don't use MD5.  Nobody knows how to attack it for this kind of
application, yet, but a lot of progress has been made.  SHA1 is
probably OK.  SHA-256 would be great, HMAC-SHA1 would be great, except
then you have to worry about keying, which is a pain.  This cache is
potentially shared across multiple servers, right?

If it's a single server cache, HMAC-SHA1 with a random key.

The cache key generated by the HTTP content fetchers might be useful
for this as well, assuming you can get ahold of it somehow.

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

I briefly considered String hashCode, but quickly recognized that was a bad
idea. MD5 of contents sounds reasonable. Brian, thoughts?

On Tue, Sep 9, 2008 at 4:23 PM, Kevin Brown <et...@google.com> wrote:

> On Tue, Sep 9, 2008 at 4:07 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > In general, it _can_ be based on any metadata in the Gadget or
> HttpResponse
> > and HttpRequest objects.
> > Since ultimately rewriting is just running a succession of
> ContentRewriters
> > over some object(s), with String input and String output, at present I'm
> > thinking the key is:
> >
> >
> <list-of-contributing-rewriters-with-version-numbers>+<input-string-hashcode>
> >
> > The cache itself is String -> String. The main limitation with this is
> that
> > it assumes that rewriting only occurs on "payload" content that can be
> > expressed in String form. If other modifications to Gadget or
> HttpResponse
> > are required (ie. to HttpResponse headers) in the future, those will need
> > to
> > participate in the key.
>
>
> Is that input stream hashcode or a string fingerprint? String.hashCode is
> only 32 bits long, which would make for many collisions. An md5 of the body
> contents will avoid most of that.
>
> >
> >
> > --John
> >
> > On Tue, Sep 9, 2008 at 3:58 PM, Brian Eaton <be...@google.com> wrote:
> >
> > > On Tue, Sep 9, 2008 at 3:18 PM, John Hjelmstad <fa...@google.com>
> wrote:
> > > > This might be a problem -- the rewriter registry has no knowledge of
> > its
> > > > calling context, and I'm loath to add specific caching
> APIs/parameters
> > to
> > > > its methods unless absolutely necessary. Can you say more about why
> > this
> > > is
> > > > required? In what way can't the rewriting behavior in this case key
> on
> > > > <rewriters>+<input string hash>?
> > >
> > > Hmm.  Maybe it isn't necessary.  Can you explain the caching algorithm
> > > for the rewriters?  I thought it must be based on URL, but if it
> > > isn't, maybe there's no problem.
> > >
> >
>

Re: Where to cache rewritten content

Posted by Kevin Brown <et...@google.com>.

On Tue, Sep 9, 2008 at 4:07 PM, John Hjelmstad <fa...@google.com> wrote:

> In general, it _can_ be based on any metadata in the Gadget or HttpResponse
> and HttpRequest objects.
> Since ultimately rewriting is just running a succession of ContentRewriters
> over some object(s), with String input and String output, at present I'm
> thinking the key is:
>
> <list-of-contributing-rewriters-with-version-numbers>+<input-string-hashcode>
>
> The cache itself is String -> String. The main limitation with this is that
> it assumes that rewriting only occurs on "payload" content that can be
> expressed in String form. If other modifications to Gadget or HttpResponse
> are required (ie. to HttpResponse headers) in the future, those will need
> to
> participate in the key.


Is that input stream hashcode or a string fingerprint? String.hashCode is
only 32 bits long, which would make for many collisions. An md5 of the body
contents will avoid most of that.

>
>
> --John
>
> On Tue, Sep 9, 2008 at 3:58 PM, Brian Eaton <be...@google.com> wrote:
>
> > On Tue, Sep 9, 2008 at 3:18 PM, John Hjelmstad <fa...@google.com> wrote:
> > > This might be a problem -- the rewriter registry has no knowledge of
> its
> > > calling context, and I'm loath to add specific caching APIs/parameters
> to
> > > its methods unless absolutely necessary. Can you say more about why
> this
> > is
> > > required? In what way can't the rewriting behavior in this case key on
> > > <rewriters>+<input string hash>?
> >
> > Hmm.  Maybe it isn't necessary.  Can you explain the caching algorithm
> > for the rewriters?  I thought it must be based on URL, but if it
> > isn't, maybe there's no problem.
> >
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Well technically it's not, as I haven't implemented it yet. I was going to
ask if you have a suggestion on this.

On Tue, Sep 9, 2008 at 4:14 PM, Brian Eaton <be...@google.com> wrote:

> On Tue, Sep 9, 2008 at 4:07 PM, John Hjelmstad <fa...@google.com> wrote:
> > In general, it _can_ be based on any metadata in the Gadget or
> HttpResponse
> > and HttpRequest objects.
> > Since ultimately rewriting is just running a succession of
> ContentRewriters
> > over some object(s), with String input and String output, at present I'm
> > thinking the key is:
> >
> <list-of-contributing-rewriters-with-version-numbers>+<input-string-hashcode>
>
> How is input-string-hashcode generated?
>

Re: Where to cache rewritten content

Posted by Brian Eaton <be...@google.com>.

On Tue, Sep 9, 2008 at 4:07 PM, John Hjelmstad <fa...@google.com> wrote:
> In general, it _can_ be based on any metadata in the Gadget or HttpResponse
> and HttpRequest objects.
> Since ultimately rewriting is just running a succession of ContentRewriters
> over some object(s), with String input and String output, at present I'm
> thinking the key is:
> <list-of-contributing-rewriters-with-version-numbers>+<input-string-hashcode>

How is input-string-hashcode generated?

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

In general, it _can_ be based on any metadata in the Gadget or HttpResponse
and HttpRequest objects.
Since ultimately rewriting is just running a succession of ContentRewriters
over some object(s), with String input and String output, at present I'm
thinking the key is:
<list-of-contributing-rewriters-with-version-numbers>+<input-string-hashcode>

The cache itself is String -> String. The main limitation with this is that
it assumes that rewriting only occurs on "payload" content that can be
expressed in String form. If other modifications to Gadget or HttpResponse
are required (ie. to HttpResponse headers) in the future, those will need to
participate in the key.

--John

On Tue, Sep 9, 2008 at 3:58 PM, Brian Eaton <be...@google.com> wrote:

> On Tue, Sep 9, 2008 at 3:18 PM, John Hjelmstad <fa...@google.com> wrote:
> > This might be a problem -- the rewriter registry has no knowledge of its
> > calling context, and I'm loath to add specific caching APIs/parameters to
> > its methods unless absolutely necessary. Can you say more about why this
> is
> > required? In what way can't the rewriting behavior in this case key on
> > <rewriters>+<input string hash>?
>
> Hmm.  Maybe it isn't necessary.  Can you explain the caching algorithm
> for the rewriters?  I thought it must be based on URL, but if it
> isn't, maybe there's no problem.
>

Re: Where to cache rewritten content

Posted by Brian Eaton <be...@google.com>.

On Tue, Sep 9, 2008 at 3:18 PM, John Hjelmstad <fa...@google.com> wrote:
> This might be a problem -- the rewriter registry has no knowledge of its
> calling context, and I'm loath to add specific caching APIs/parameters to
> its methods unless absolutely necessary. Can you say more about why this is
> required? In what way can't the rewriting behavior in this case key on
> <rewriters>+<input string hash>?

Hmm.  Maybe it isn't necessary.  Can you explain the caching algorithm
for the rewriters?  I thought it must be based on URL, but if it
isn't, maybe there's no problem.

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

This might be a problem -- the rewriter registry has no knowledge of its
calling context, and I'm loath to add specific caching APIs/parameters to
its methods unless absolutely necessary. Can you say more about why this is
required? In what way can't the rewriting behavior in this case key on
<rewriters>+<input string hash>?
John

On Tue, Sep 9, 2008 at 3:10 PM, Brian Eaton <be...@google.com> wrote:

> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com> wrote:
> > Still, I might be missing situations in which additional context inherent
> to
> > the calling context is needed to make a caching decision.
>
> OAuthFetcher needs to have input in to the generation of the cache key
> for any authenticated requests.  There is a function getCacheKey in
> OAuthFetcher that is currently private, I think you'll want to make it
> public.
>

Re: Where to cache rewritten content

Posted by Brian Eaton <be...@google.com>.

On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com> wrote:
> Still, I might be missing situations in which additional context inherent to
> the calling context is needed to make a caching decision.

OAuthFetcher needs to have input in to the generation of the cache key
for any authenticated requests.  There is a function getCacheKey in
OAuthFetcher that is currently private, I think you'll want to make it
public.

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Got it, thanks for the clarification. Sounds like a good optimization to
me... I'll impl it in a follow-up to just getting the base infrastructure
in.
John

On Wed, Sep 10, 2008 at 1:19 PM, Louis Ryan <lr...@google.com> wrote:

> The overwhelming majority of the lookups will be for the rewritten versions
> of cached Http fetches. Those are still cached as HTTP to maintain their
> cache control semantics. If you stuff the MD5 in the originally cached HTTP
> response you can use it to perform the lookup of the rewritten content.
>
> On Wed, Sep 10, 2008 at 1:14 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Perhaps I'm missing something, but note that in order to share a cache
> the
> > data I'm storing are Strings, not HttpResponses (or Gadgets).
> > As to hash method, MD5 it is.
> >
> > On Wed, Sep 10, 2008 at 11:18 AM, Kevin Brown <et...@google.com> wrote:
> >
> > > On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com> wrote:
> > >
> > > > A little reading would indicate the MD5 is about 40% faster to
> compute
> > > than
> > > > SHA1 so I would suggest we use it. It would also make sense to stuff
> > the
> > > > MD5
> > > > of the cached content into an HTTP header on the cached version of
> the
> > > > original content so we can avoid re-computing it.
> > >
> > >
> > > Since HTTP response is immutable, we can do this at construction time
> in
> > > pretty much the same way that it's done for GadgetSpec (i.e. adding a
> > > getChecksum method).
> > >
> > >
> > > >
> > > >
> > > > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
> > > >
> > > > > I would be a very wary of relying on VM implementation specifics
> like
> > > > > XXXRweriter.class.hashCode(). You certainly don't want the same
> class
> > > > > generating different coes on different instances in a cluster where
> > the
> > > > > cache is shared. Also dont use serialVersionUID as that is only
> > > intended
> > > > for
> > > > > serialization compatability. Reading the class as an InputStream
> from
> > > the
> > > > > ClassLoader and then computing a hash might work but seems more
> > trouble
> > > > than
> > > > > its worth.
> > > > >
> > > > > In general I think a a manually maintained version no. might
> actually
> > > be
> > > > > your best bet.
> > > > >
> > > > >
> > > > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com>
> > > wrote:
> > > > >
> > > > >> Back to this sub-thread...
> > > > >> Adding a getVersion() parameter seems error prone and a little
> > > > burdensome.
> > > > >> In theory the version needs to change when any modification in
> > > potential
> > > > >> output for a given input is made, ie. not just optimization.
> That's
> > > not
> > > > >> always the easiest thing to analyze and get right. Meanwhile,
> > > > invalidating
> > > > >> a
> > > > >> bunch of cache keys (by registering a new rewriter impl) isn't
> > > > especially
> > > > >> costly: if the cache is useful at all, it will be hit often, ie.
> > > > refreshed
> > > > >> quickly.
> > > > >>
> > > > >> --John
> > > > >>
> > > > >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com>
> > wrote:
> > > > >>
> > > > >> > The versioning should be more explicit than that I think. Maybe
> > add
> > > a
> > > > >> > getVersion function to the rewriter interface so they can manage
> > > their
> > > > >> own
> > > > >> > changes
> > > > >> >
> > > > >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <
> fargo@google.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Interesting suggestion. I can include the rewriter class names
> > and
> > > > >> their
> > > > >> > > class hash codes or some other such versioning construct.
> > > > >> > >
> > > > >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> > > > wrote:
> > > > >> > >
> > > > >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
> > > fargo@google.com>
> > > > >> > wrote:
> > > > >> > > > > Excellent, agreed. CLs forthcoming.
> > > > >> > > >
> > > > >> > > > Can you include the version numbers of the rewriters in the
> > > cache
> > > > >> key?
> > > > >> > > >
> > > > >> > > > >
> > > > >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <
> > etnu@google.com
> > > >
> > > > >> > wrote:
> > > > >> > > > >
> > > > >> > > > >> #2 is the only really viable option. If we have to put
> > > caching
> > > > >> logic
> > > > >> > > in
> > > > >> > > > 10
> > > > >> > > > >> different places we'll screw it up 9 different times :).
> > > > >> > > > >>
> > > > >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> > > > >> fargo@google.com>
> > > > >> > > > wrote:
> > > > >> > > > >>
> > > > >> > > > >> > As discussed on a few threads and tracked in JIRA issue
> (
> > > > >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we
> > need
> > > > to
> > > > >> > move
> > > > >> > > > >> > rewriting
> > > > >> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> > > > >> rewritten
> > > > >> > > > content
> > > > >> > > > >> > caching capability. The question is where to put it.
> > > > >> > > > >> > I see two options, at a high level:
> > > > >> > > > >> > 1. In code that calls
> > > > >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > > >> > > > >> > Eg. MakeRequestHandler, ProxyHandler,
> ViewContentFetcher,
> > > > >> > > > GadgetServer,
> > > > >> > > > >> and
> > > > >> > > > >> > the near-future Renderer and Preloader. This allows
> > > > >> finer-grained
> > > > >> > > > control
> > > > >> > > > >> > over caching behavior in context, at the cost of
> > > distributing
> > > > >> > > caching
> > > > >> > > > >> logic
> > > > >> > > > >> > in various places.
> > > > >> > > > >> > 2. In
> > ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > > > >> itself,
> > > > >> > > if
> > > > >> > > > so
> > > > >> > > > >> > chosen. Caching logic can be consolidated in
> > > > >> > > > >> > CachingContentRewriterRegistry,
> > > > >> > > > >> > for instance (which will no longer subclass
> > > > >> > > > CachingWebRetrievalFactory),
> > > > >> > > > >> > and
> > > > >> > > > >> > be considered an optimization to rewriting.
> > > > >> > > > >> >
> > > > >> > > > >> > I'm inclined to go #2. Rewriters themselves can be
> > > augmented
> > > > >> with
> > > > >> > > > caching
> > > > >> > > > >> > hints if necessary, and be assumed deterministic for a
> > > given
> > > > >> cache
> > > > >> > > key
> > > > >> > > > in
> > > > >> > > > >> > the meantime. Consolidating rewriting logic makes it
> > easier
> > > > to
> > > > >> > share
> > > > >> > > > the
> > > > >> > > > >> > cache itself.
> > > > >> > > > >> >
> > > > >> > > > >> > Still, I might be missing situations in which
> additional
> > > > >> context
> > > > >> > > > inherent
> > > > >> > > > >> > to
> > > > >> > > > >> > the calling context is needed to make a caching
> decision.
> > > > >> > > > >> >
> > > > >> > > > >> > --John
> > > > >> > > > >> >
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by Louis Ryan <lr...@google.com>.

The overwhelming majority of the lookups will be for the rewritten versions
of cached Http fetches. Those are still cached as HTTP to maintain their
cache control semantics. If you stuff the MD5 in the originally cached HTTP
response you can use it to perform the lookup of the rewritten content.

On Wed, Sep 10, 2008 at 1:14 PM, John Hjelmstad <fa...@google.com> wrote:

> Perhaps I'm missing something, but note that in order to share a cache the
> data I'm storing are Strings, not HttpResponses (or Gadgets).
> As to hash method, MD5 it is.
>
> On Wed, Sep 10, 2008 at 11:18 AM, Kevin Brown <et...@google.com> wrote:
>
> > On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com> wrote:
> >
> > > A little reading would indicate the MD5 is about 40% faster to compute
> > than
> > > SHA1 so I would suggest we use it. It would also make sense to stuff
> the
> > > MD5
> > > of the cached content into an HTTP header on the cached version of the
> > > original content so we can avoid re-computing it.
> >
> >
> > Since HTTP response is immutable, we can do this at construction time in
> > pretty much the same way that it's done for GadgetSpec (i.e. adding a
> > getChecksum method).
> >
> >
> > >
> > >
> > > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
> > >
> > > > I would be a very wary of relying on VM implementation specifics like
> > > > XXXRweriter.class.hashCode(). You certainly don't want the same class
> > > > generating different coes on different instances in a cluster where
> the
> > > > cache is shared. Also dont use serialVersionUID as that is only
> > intended
> > > for
> > > > serialization compatability. Reading the class as an InputStream from
> > the
> > > > ClassLoader and then computing a hash might work but seems more
> trouble
> > > than
> > > > its worth.
> > > >
> > > > In general I think a a manually maintained version no. might actually
> > be
> > > > your best bet.
> > > >
> > > >
> > > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > > >
> > > >> Back to this sub-thread...
> > > >> Adding a getVersion() parameter seems error prone and a little
> > > burdensome.
> > > >> In theory the version needs to change when any modification in
> > potential
> > > >> output for a given input is made, ie. not just optimization. That's
> > not
> > > >> always the easiest thing to analyze and get right. Meanwhile,
> > > invalidating
> > > >> a
> > > >> bunch of cache keys (by registering a new rewriter impl) isn't
> > > especially
> > > >> costly: if the cache is useful at all, it will be hit often, ie.
> > > refreshed
> > > >> quickly.
> > > >>
> > > >> --John
> > > >>
> > > >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com>
> wrote:
> > > >>
> > > >> > The versioning should be more explicit than that I think. Maybe
> add
> > a
> > > >> > getVersion function to the rewriter interface so they can manage
> > their
> > > >> own
> > > >> > changes
> > > >> >
> > > >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fargo@google.com
> >
> > > >> wrote:
> > > >> >
> > > >> > > Interesting suggestion. I can include the rewriter class names
> and
> > > >> their
> > > >> > > class hash codes or some other such versioning construct.
> > > >> > >
> > > >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> > > wrote:
> > > >> > >
> > > >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
> > fargo@google.com>
> > > >> > wrote:
> > > >> > > > > Excellent, agreed. CLs forthcoming.
> > > >> > > >
> > > >> > > > Can you include the version numbers of the rewriters in the
> > cache
> > > >> key?
> > > >> > > >
> > > >> > > > >
> > > >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <
> etnu@google.com
> > >
> > > >> > wrote:
> > > >> > > > >
> > > >> > > > >> #2 is the only really viable option. If we have to put
> > caching
> > > >> logic
> > > >> > > in
> > > >> > > > 10
> > > >> > > > >> different places we'll screw it up 9 different times :).
> > > >> > > > >>
> > > >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> > > >> fargo@google.com>
> > > >> > > > wrote:
> > > >> > > > >>
> > > >> > > > >> > As discussed on a few threads and tracked in JIRA issue (
> > > >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we
> need
> > > to
> > > >> > move
> > > >> > > > >> > rewriting
> > > >> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> > > >> rewritten
> > > >> > > > content
> > > >> > > > >> > caching capability. The question is where to put it.
> > > >> > > > >> > I see two options, at a high level:
> > > >> > > > >> > 1. In code that calls
> > > >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > >> > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > > >> > > > GadgetServer,
> > > >> > > > >> and
> > > >> > > > >> > the near-future Renderer and Preloader. This allows
> > > >> finer-grained
> > > >> > > > control
> > > >> > > > >> > over caching behavior in context, at the cost of
> > distributing
> > > >> > > caching
> > > >> > > > >> logic
> > > >> > > > >> > in various places.
> > > >> > > > >> > 2. In
> ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > > >> itself,
> > > >> > > if
> > > >> > > > so
> > > >> > > > >> > chosen. Caching logic can be consolidated in
> > > >> > > > >> > CachingContentRewriterRegistry,
> > > >> > > > >> > for instance (which will no longer subclass
> > > >> > > > CachingWebRetrievalFactory),
> > > >> > > > >> > and
> > > >> > > > >> > be considered an optimization to rewriting.
> > > >> > > > >> >
> > > >> > > > >> > I'm inclined to go #2. Rewriters themselves can be
> > augmented
> > > >> with
> > > >> > > > caching
> > > >> > > > >> > hints if necessary, and be assumed deterministic for a
> > given
> > > >> cache
> > > >> > > key
> > > >> > > > in
> > > >> > > > >> > the meantime. Consolidating rewriting logic makes it
> easier
> > > to
> > > >> > share
> > > >> > > > the
> > > >> > > > >> > cache itself.
> > > >> > > > >> >
> > > >> > > > >> > Still, I might be missing situations in which additional
> > > >> context
> > > >> > > > inherent
> > > >> > > > >> > to
> > > >> > > > >> > the calling context is needed to make a caching decision.
> > > >> > > > >> >
> > > >> > > > >> > --John
> > > >> > > > >> >
> > > >> > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by Brian Eaton <be...@google.com>.

On Wed, Sep 10, 2008 at 1:14 PM, John Hjelmstad <fa...@google.com> wrote:
> Perhaps I'm missing something, but note that in order to share a cache the
> data I'm storing are Strings, not HttpResponses (or Gadgets).
> As to hash method, MD5 it is.

Ugh.  Using MD5 is bad crypto hygiene.  But it probably won't kill us here.

Re: Where to cache rewritten content

Posted by Kevin Brown <et...@google.com>.

On Wed, Sep 10, 2008 at 3:57 PM, John Hjelmstad <fa...@google.com> wrote:

> Now that I'm actually getting into it...
> It seems that what we want is for the hash to be generated only before
> HttpResponse is cached. We shouldn't waste time generating MD5s every time
> we generate an HttpResponse object.
>
> That suggests:
> HttpResponse.getChecksum() {
>  if (checksum != null) {
>    checksum = HashUtil.checksum(getResponseAsString());
>  }
>  return checksum;
> }
>
> ...which also requires that caching code knows to cache the checksum as
> well. So I see Louis' point in storing the MD5 in a header. The helper
> method is a nice convenience though.
>
> In either case we need to generate the MD5 in the AbstractHttpCache layer,
> which is sort of ugly (ie. calling getChecksum() just to make sure it's
> stored as a side-effect), but that seems unavoidable.


I think we're worrying way too much about the cost of an md5 here, and there
are far worse wastes of CPU time in the current Shindig code base that we
should address before we get into a micro optimization like caching the md5
results.


>
> --John
>
> On Wed, Sep 10, 2008 at 2:17 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Yep, exactly, I like that way of structuring it.
> >
> >
> > On Wed, Sep 10, 2008 at 1:32 PM, Kevin Brown <et...@google.com> wrote:
> >
> >> I was implying that you could keep the checksum on the HttpResponse
> >> object,
> >> yielding:
> >>
> >> cache.addEntry(response.getChecksum(), rewrittenContent);
> >>
> >> On Wed, Sep 10, 2008 at 1:14 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >>
> >> > Perhaps I'm missing something, but note that in order to share a cache
> >> the
> >> > data I'm storing are Strings, not HttpResponses (or Gadgets).
> >> > As to hash method, MD5 it is.
> >> >
> >> > On Wed, Sep 10, 2008 at 11:18 AM, Kevin Brown <et...@google.com>
> wrote:
> >> >
> >> > > On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com>
> >> wrote:
> >> > >
> >> > > > A little reading would indicate the MD5 is about 40% faster to
> >> compute
> >> > > than
> >> > > > SHA1 so I would suggest we use it. It would also make sense to
> stuff
> >> > the
> >> > > > MD5
> >> > > > of the cached content into an HTTP header on the cached version of
> >> the
> >> > > > original content so we can avoid re-computing it.
> >> > >
> >> > >
> >> > > Since HTTP response is immutable, we can do this at construction
> time
> >> in
> >> > > pretty much the same way that it's done for GadgetSpec (i.e. adding
> a
> >> > > getChecksum method).
> >> > >
> >> > >
> >> > > >
> >> > > >
> >> > > > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com>
> >> wrote:
> >> > > >
> >> > > > > I would be a very wary of relying on VM implementation specifics
> >> like
> >> > > > > XXXRweriter.class.hashCode(). You certainly don't want the same
> >> class
> >> > > > > generating different coes on different instances in a cluster
> >> where
> >> > the
> >> > > > > cache is shared. Also dont use serialVersionUID as that is only
> >> > > intended
> >> > > > for
> >> > > > > serialization compatability. Reading the class as an InputStream
> >> from
> >> > > the
> >> > > > > ClassLoader and then computing a hash might work but seems more
> >> > trouble
> >> > > > than
> >> > > > > its worth.
> >> > > > >
> >> > > > > In general I think a a manually maintained version no. might
> >> actually
> >> > > be
> >> > > > > your best bet.
> >> > > > >
> >> > > > >
> >> > > > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <
> fargo@google.com>
> >> > > wrote:
> >> > > > >
> >> > > > >> Back to this sub-thread...
> >> > > > >> Adding a getVersion() parameter seems error prone and a little
> >> > > > burdensome.
> >> > > > >> In theory the version needs to change when any modification in
> >> > > potential
> >> > > > >> output for a given input is made, ie. not just optimization.
> >> That's
> >> > > not
> >> > > > >> always the easiest thing to analyze and get right. Meanwhile,
> >> > > > invalidating
> >> > > > >> a
> >> > > > >> bunch of cache keys (by registering a new rewriter impl) isn't
> >> > > > especially
> >> > > > >> costly: if the cache is useful at all, it will be hit often,
> ie.
> >> > > > refreshed
> >> > > > >> quickly.
> >> > > > >>
> >> > > > >> --John
> >> > > > >>
> >> > > > >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com>
> >> > wrote:
> >> > > > >>
> >> > > > >> > The versioning should be more explicit than that I think.
> Maybe
> >> > add
> >> > > a
> >> > > > >> > getVersion function to the rewriter interface so they can
> >> manage
> >> > > their
> >> > > > >> own
> >> > > > >> > changes
> >> > > > >> >
> >> > > > >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <
> >> fargo@google.com
> >> > >
> >> > > > >> wrote:
> >> > > > >> >
> >> > > > >> > > Interesting suggestion. I can include the rewriter class
> >> names
> >> > and
> >> > > > >> their
> >> > > > >> > > class hash codes or some other such versioning construct.
> >> > > > >> > >
> >> > > > >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <
> benl@google.com
> >> >
> >> > > > wrote:
> >> > > > >> > >
> >> > > > >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
> >> > > fargo@google.com>
> >> > > > >> > wrote:
> >> > > > >> > > > > Excellent, agreed. CLs forthcoming.
> >> > > > >> > > >
> >> > > > >> > > > Can you include the version numbers of the rewriters in
> the
> >> > > cache
> >> > > > >> key?
> >> > > > >> > > >
> >> > > > >> > > > >
> >> > > > >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <
> >> > etnu@google.com
> >> > > >
> >> > > > >> > wrote:
> >> > > > >> > > > >
> >> > > > >> > > > >> #2 is the only really viable option. If we have to put
> >> > > caching
> >> > > > >> logic
> >> > > > >> > > in
> >> > > > >> > > > 10
> >> > > > >> > > > >> different places we'll screw it up 9 different times
> :).
> >> > > > >> > > > >>
> >> > > > >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> >> > > > >> fargo@google.com>
> >> > > > >> > > > wrote:
> >> > > > >> > > > >>
> >> > > > >> > > > >> > As discussed on a few threads and tracked in JIRA
> >> issue (
> >> > > > >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579),
> we
> >> > need
> >> > > > to
> >> > > > >> > move
> >> > > > >> > > > >> > rewriting
> >> > > > >> > > > >> > logic out of AbstractHttpCache. Yet we should
> maintain
> >> > > > >> rewritten
> >> > > > >> > > > content
> >> > > > >> > > > >> > caching capability. The question is where to put it.
> >> > > > >> > > > >> > I see two options, at a high level:
> >> > > > >> > > > >> > 1. In code that calls
> >> > > > >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> >> > > > >> > > > >> > Eg. MakeRequestHandler, ProxyHandler,
> >> ViewContentFetcher,
> >> > > > >> > > > GadgetServer,
> >> > > > >> > > > >> and
> >> > > > >> > > > >> > the near-future Renderer and Preloader. This allows
> >> > > > >> finer-grained
> >> > > > >> > > > control
> >> > > > >> > > > >> > over caching behavior in context, at the cost of
> >> > > distributing
> >> > > > >> > > caching
> >> > > > >> > > > >> logic
> >> > > > >> > > > >> > in various places.
> >> > > > >> > > > >> > 2. In
> >> > ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> >> > > > >> itself,
> >> > > > >> > > if
> >> > > > >> > > > so
> >> > > > >> > > > >> > chosen. Caching logic can be consolidated in
> >> > > > >> > > > >> > CachingContentRewriterRegistry,
> >> > > > >> > > > >> > for instance (which will no longer subclass
> >> > > > >> > > > CachingWebRetrievalFactory),
> >> > > > >> > > > >> > and
> >> > > > >> > > > >> > be considered an optimization to rewriting.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > I'm inclined to go #2. Rewriters themselves can be
> >> > > augmented
> >> > > > >> with
> >> > > > >> > > > caching
> >> > > > >> > > > >> > hints if necessary, and be assumed deterministic for
> a
> >> > > given
> >> > > > >> cache
> >> > > > >> > > key
> >> > > > >> > > > in
> >> > > > >> > > > >> > the meantime. Consolidating rewriting logic makes it
> >> > easier
> >> > > > to
> >> > > > >> > share
> >> > > > >> > > > the
> >> > > > >> > > > >> > cache itself.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > Still, I might be missing situations in which
> >> additional
> >> > > > >> context
> >> > > > >> > > > inherent
> >> > > > >> > > > >> > to
> >> > > > >> > > > >> > the calling context is needed to make a caching
> >> decision.
> >> > > > >> > > > >> >
> >> > > > >> > > > >> > --John
> >> > > > >> > > > >> >
> >> > > > >> > > > >>
> >> > > > >> > > > >
> >> > > > >> > > >
> >> > > > >> > >
> >> > > > >> >
> >> > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Now that I'm actually getting into it...
It seems that what we want is for the hash to be generated only before
HttpResponse is cached. We shouldn't waste time generating MD5s every time
we generate an HttpResponse object.

That suggests:
HttpResponse.getChecksum() {
  if (checksum != null) {
    checksum = HashUtil.checksum(getResponseAsString());
  }
  return checksum;
}

...which also requires that caching code knows to cache the checksum as
well. So I see Louis' point in storing the MD5 in a header. The helper
method is a nice convenience though.

In either case we need to generate the MD5 in the AbstractHttpCache layer,
which is sort of ugly (ie. calling getChecksum() just to make sure it's
stored as a side-effect), but that seems unavoidable.

--John

On Wed, Sep 10, 2008 at 2:17 PM, John Hjelmstad <fa...@google.com> wrote:

> Yep, exactly, I like that way of structuring it.
>
>
> On Wed, Sep 10, 2008 at 1:32 PM, Kevin Brown <et...@google.com> wrote:
>
>> I was implying that you could keep the checksum on the HttpResponse
>> object,
>> yielding:
>>
>> cache.addEntry(response.getChecksum(), rewrittenContent);
>>
>> On Wed, Sep 10, 2008 at 1:14 PM, John Hjelmstad <fa...@google.com> wrote:
>>
>> > Perhaps I'm missing something, but note that in order to share a cache
>> the
>> > data I'm storing are Strings, not HttpResponses (or Gadgets).
>> > As to hash method, MD5 it is.
>> >
>> > On Wed, Sep 10, 2008 at 11:18 AM, Kevin Brown <et...@google.com> wrote:
>> >
>> > > On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com>
>> wrote:
>> > >
>> > > > A little reading would indicate the MD5 is about 40% faster to
>> compute
>> > > than
>> > > > SHA1 so I would suggest we use it. It would also make sense to stuff
>> > the
>> > > > MD5
>> > > > of the cached content into an HTTP header on the cached version of
>> the
>> > > > original content so we can avoid re-computing it.
>> > >
>> > >
>> > > Since HTTP response is immutable, we can do this at construction time
>> in
>> > > pretty much the same way that it's done for GadgetSpec (i.e. adding a
>> > > getChecksum method).
>> > >
>> > >
>> > > >
>> > > >
>> > > > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com>
>> wrote:
>> > > >
>> > > > > I would be a very wary of relying on VM implementation specifics
>> like
>> > > > > XXXRweriter.class.hashCode(). You certainly don't want the same
>> class
>> > > > > generating different coes on different instances in a cluster
>> where
>> > the
>> > > > > cache is shared. Also dont use serialVersionUID as that is only
>> > > intended
>> > > > for
>> > > > > serialization compatability. Reading the class as an InputStream
>> from
>> > > the
>> > > > > ClassLoader and then computing a hash might work but seems more
>> > trouble
>> > > > than
>> > > > > its worth.
>> > > > >
>> > > > > In general I think a a manually maintained version no. might
>> actually
>> > > be
>> > > > > your best bet.
>> > > > >
>> > > > >
>> > > > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com>
>> > > wrote:
>> > > > >
>> > > > >> Back to this sub-thread...
>> > > > >> Adding a getVersion() parameter seems error prone and a little
>> > > > burdensome.
>> > > > >> In theory the version needs to change when any modification in
>> > > potential
>> > > > >> output for a given input is made, ie. not just optimization.
>> That's
>> > > not
>> > > > >> always the easiest thing to analyze and get right. Meanwhile,
>> > > > invalidating
>> > > > >> a
>> > > > >> bunch of cache keys (by registering a new rewriter impl) isn't
>> > > > especially
>> > > > >> costly: if the cache is useful at all, it will be hit often, ie.
>> > > > refreshed
>> > > > >> quickly.
>> > > > >>
>> > > > >> --John
>> > > > >>
>> > > > >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com>
>> > wrote:
>> > > > >>
>> > > > >> > The versioning should be more explicit than that I think. Maybe
>> > add
>> > > a
>> > > > >> > getVersion function to the rewriter interface so they can
>> manage
>> > > their
>> > > > >> own
>> > > > >> > changes
>> > > > >> >
>> > > > >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <
>> fargo@google.com
>> > >
>> > > > >> wrote:
>> > > > >> >
>> > > > >> > > Interesting suggestion. I can include the rewriter class
>> names
>> > and
>> > > > >> their
>> > > > >> > > class hash codes or some other such versioning construct.
>> > > > >> > >
>> > > > >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <benl@google.com
>> >
>> > > > wrote:
>> > > > >> > >
>> > > > >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
>> > > fargo@google.com>
>> > > > >> > wrote:
>> > > > >> > > > > Excellent, agreed. CLs forthcoming.
>> > > > >> > > >
>> > > > >> > > > Can you include the version numbers of the rewriters in the
>> > > cache
>> > > > >> key?
>> > > > >> > > >
>> > > > >> > > > >
>> > > > >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <
>> > etnu@google.com
>> > > >
>> > > > >> > wrote:
>> > > > >> > > > >
>> > > > >> > > > >> #2 is the only really viable option. If we have to put
>> > > caching
>> > > > >> logic
>> > > > >> > > in
>> > > > >> > > > 10
>> > > > >> > > > >> different places we'll screw it up 9 different times :).
>> > > > >> > > > >>
>> > > > >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
>> > > > >> fargo@google.com>
>> > > > >> > > > wrote:
>> > > > >> > > > >>
>> > > > >> > > > >> > As discussed on a few threads and tracked in JIRA
>> issue (
>> > > > >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we
>> > need
>> > > > to
>> > > > >> > move
>> > > > >> > > > >> > rewriting
>> > > > >> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
>> > > > >> rewritten
>> > > > >> > > > content
>> > > > >> > > > >> > caching capability. The question is where to put it.
>> > > > >> > > > >> > I see two options, at a high level:
>> > > > >> > > > >> > 1. In code that calls
>> > > > >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
>> > > > >> > > > >> > Eg. MakeRequestHandler, ProxyHandler,
>> ViewContentFetcher,
>> > > > >> > > > GadgetServer,
>> > > > >> > > > >> and
>> > > > >> > > > >> > the near-future Renderer and Preloader. This allows
>> > > > >> finer-grained
>> > > > >> > > > control
>> > > > >> > > > >> > over caching behavior in context, at the cost of
>> > > distributing
>> > > > >> > > caching
>> > > > >> > > > >> logic
>> > > > >> > > > >> > in various places.
>> > > > >> > > > >> > 2. In
>> > ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
>> > > > >> itself,
>> > > > >> > > if
>> > > > >> > > > so
>> > > > >> > > > >> > chosen. Caching logic can be consolidated in
>> > > > >> > > > >> > CachingContentRewriterRegistry,
>> > > > >> > > > >> > for instance (which will no longer subclass
>> > > > >> > > > CachingWebRetrievalFactory),
>> > > > >> > > > >> > and
>> > > > >> > > > >> > be considered an optimization to rewriting.
>> > > > >> > > > >> >
>> > > > >> > > > >> > I'm inclined to go #2. Rewriters themselves can be
>> > > augmented
>> > > > >> with
>> > > > >> > > > caching
>> > > > >> > > > >> > hints if necessary, and be assumed deterministic for a
>> > > given
>> > > > >> cache
>> > > > >> > > key
>> > > > >> > > > in
>> > > > >> > > > >> > the meantime. Consolidating rewriting logic makes it
>> > easier
>> > > > to
>> > > > >> > share
>> > > > >> > > > the
>> > > > >> > > > >> > cache itself.
>> > > > >> > > > >> >
>> > > > >> > > > >> > Still, I might be missing situations in which
>> additional
>> > > > >> context
>> > > > >> > > > inherent
>> > > > >> > > > >> > to
>> > > > >> > > > >> > the calling context is needed to make a caching
>> decision.
>> > > > >> > > > >> >
>> > > > >> > > > >> > --John
>> > > > >> > > > >> >
>> > > > >> > > > >>
>> > > > >> > > > >
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Yep, exactly, I like that way of structuring it.

On Wed, Sep 10, 2008 at 1:32 PM, Kevin Brown <et...@google.com> wrote:

> I was implying that you could keep the checksum on the HttpResponse object,
> yielding:
>
> cache.addEntry(response.getChecksum(), rewrittenContent);
>
> On Wed, Sep 10, 2008 at 1:14 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Perhaps I'm missing something, but note that in order to share a cache
> the
> > data I'm storing are Strings, not HttpResponses (or Gadgets).
> > As to hash method, MD5 it is.
> >
> > On Wed, Sep 10, 2008 at 11:18 AM, Kevin Brown <et...@google.com> wrote:
> >
> > > On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com> wrote:
> > >
> > > > A little reading would indicate the MD5 is about 40% faster to
> compute
> > > than
> > > > SHA1 so I would suggest we use it. It would also make sense to stuff
> > the
> > > > MD5
> > > > of the cached content into an HTTP header on the cached version of
> the
> > > > original content so we can avoid re-computing it.
> > >
> > >
> > > Since HTTP response is immutable, we can do this at construction time
> in
> > > pretty much the same way that it's done for GadgetSpec (i.e. adding a
> > > getChecksum method).
> > >
> > >
> > > >
> > > >
> > > > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
> > > >
> > > > > I would be a very wary of relying on VM implementation specifics
> like
> > > > > XXXRweriter.class.hashCode(). You certainly don't want the same
> class
> > > > > generating different coes on different instances in a cluster where
> > the
> > > > > cache is shared. Also dont use serialVersionUID as that is only
> > > intended
> > > > for
> > > > > serialization compatability. Reading the class as an InputStream
> from
> > > the
> > > > > ClassLoader and then computing a hash might work but seems more
> > trouble
> > > > than
> > > > > its worth.
> > > > >
> > > > > In general I think a a manually maintained version no. might
> actually
> > > be
> > > > > your best bet.
> > > > >
> > > > >
> > > > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com>
> > > wrote:
> > > > >
> > > > >> Back to this sub-thread...
> > > > >> Adding a getVersion() parameter seems error prone and a little
> > > > burdensome.
> > > > >> In theory the version needs to change when any modification in
> > > potential
> > > > >> output for a given input is made, ie. not just optimization.
> That's
> > > not
> > > > >> always the easiest thing to analyze and get right. Meanwhile,
> > > > invalidating
> > > > >> a
> > > > >> bunch of cache keys (by registering a new rewriter impl) isn't
> > > > especially
> > > > >> costly: if the cache is useful at all, it will be hit often, ie.
> > > > refreshed
> > > > >> quickly.
> > > > >>
> > > > >> --John
> > > > >>
> > > > >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com>
> > wrote:
> > > > >>
> > > > >> > The versioning should be more explicit than that I think. Maybe
> > add
> > > a
> > > > >> > getVersion function to the rewriter interface so they can manage
> > > their
> > > > >> own
> > > > >> > changes
> > > > >> >
> > > > >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <
> fargo@google.com
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > Interesting suggestion. I can include the rewriter class names
> > and
> > > > >> their
> > > > >> > > class hash codes or some other such versioning construct.
> > > > >> > >
> > > > >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> > > > wrote:
> > > > >> > >
> > > > >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
> > > fargo@google.com>
> > > > >> > wrote:
> > > > >> > > > > Excellent, agreed. CLs forthcoming.
> > > > >> > > >
> > > > >> > > > Can you include the version numbers of the rewriters in the
> > > cache
> > > > >> key?
> > > > >> > > >
> > > > >> > > > >
> > > > >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <
> > etnu@google.com
> > > >
> > > > >> > wrote:
> > > > >> > > > >
> > > > >> > > > >> #2 is the only really viable option. If we have to put
> > > caching
> > > > >> logic
> > > > >> > > in
> > > > >> > > > 10
> > > > >> > > > >> different places we'll screw it up 9 different times :).
> > > > >> > > > >>
> > > > >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> > > > >> fargo@google.com>
> > > > >> > > > wrote:
> > > > >> > > > >>
> > > > >> > > > >> > As discussed on a few threads and tracked in JIRA issue
> (
> > > > >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we
> > need
> > > > to
> > > > >> > move
> > > > >> > > > >> > rewriting
> > > > >> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> > > > >> rewritten
> > > > >> > > > content
> > > > >> > > > >> > caching capability. The question is where to put it.
> > > > >> > > > >> > I see two options, at a high level:
> > > > >> > > > >> > 1. In code that calls
> > > > >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > > >> > > > >> > Eg. MakeRequestHandler, ProxyHandler,
> ViewContentFetcher,
> > > > >> > > > GadgetServer,
> > > > >> > > > >> and
> > > > >> > > > >> > the near-future Renderer and Preloader. This allows
> > > > >> finer-grained
> > > > >> > > > control
> > > > >> > > > >> > over caching behavior in context, at the cost of
> > > distributing
> > > > >> > > caching
> > > > >> > > > >> logic
> > > > >> > > > >> > in various places.
> > > > >> > > > >> > 2. In
> > ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > > > >> itself,
> > > > >> > > if
> > > > >> > > > so
> > > > >> > > > >> > chosen. Caching logic can be consolidated in
> > > > >> > > > >> > CachingContentRewriterRegistry,
> > > > >> > > > >> > for instance (which will no longer subclass
> > > > >> > > > CachingWebRetrievalFactory),
> > > > >> > > > >> > and
> > > > >> > > > >> > be considered an optimization to rewriting.
> > > > >> > > > >> >
> > > > >> > > > >> > I'm inclined to go #2. Rewriters themselves can be
> > > augmented
> > > > >> with
> > > > >> > > > caching
> > > > >> > > > >> > hints if necessary, and be assumed deterministic for a
> > > given
> > > > >> cache
> > > > >> > > key
> > > > >> > > > in
> > > > >> > > > >> > the meantime. Consolidating rewriting logic makes it
> > easier
> > > > to
> > > > >> > share
> > > > >> > > > the
> > > > >> > > > >> > cache itself.
> > > > >> > > > >> >
> > > > >> > > > >> > Still, I might be missing situations in which
> additional
> > > > >> context
> > > > >> > > > inherent
> > > > >> > > > >> > to
> > > > >> > > > >> > the calling context is needed to make a caching
> decision.
> > > > >> > > > >> >
> > > > >> > > > >> > --John
> > > > >> > > > >> >
> > > > >> > > > >>
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by Kevin Brown <et...@google.com>.

I was implying that you could keep the checksum on the HttpResponse object,
yielding:

cache.addEntry(response.getChecksum(), rewrittenContent);

On Wed, Sep 10, 2008 at 1:14 PM, John Hjelmstad <fa...@google.com> wrote:

> Perhaps I'm missing something, but note that in order to share a cache the
> data I'm storing are Strings, not HttpResponses (or Gadgets).
> As to hash method, MD5 it is.
>
> On Wed, Sep 10, 2008 at 11:18 AM, Kevin Brown <et...@google.com> wrote:
>
> > On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com> wrote:
> >
> > > A little reading would indicate the MD5 is about 40% faster to compute
> > than
> > > SHA1 so I would suggest we use it. It would also make sense to stuff
> the
> > > MD5
> > > of the cached content into an HTTP header on the cached version of the
> > > original content so we can avoid re-computing it.
> >
> >
> > Since HTTP response is immutable, we can do this at construction time in
> > pretty much the same way that it's done for GadgetSpec (i.e. adding a
> > getChecksum method).
> >
> >
> > >
> > >
> > > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
> > >
> > > > I would be a very wary of relying on VM implementation specifics like
> > > > XXXRweriter.class.hashCode(). You certainly don't want the same class
> > > > generating different coes on different instances in a cluster where
> the
> > > > cache is shared. Also dont use serialVersionUID as that is only
> > intended
> > > for
> > > > serialization compatability. Reading the class as an InputStream from
> > the
> > > > ClassLoader and then computing a hash might work but seems more
> trouble
> > > than
> > > > its worth.
> > > >
> > > > In general I think a a manually maintained version no. might actually
> > be
> > > > your best bet.
> > > >
> > > >
> > > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > > >
> > > >> Back to this sub-thread...
> > > >> Adding a getVersion() parameter seems error prone and a little
> > > burdensome.
> > > >> In theory the version needs to change when any modification in
> > potential
> > > >> output for a given input is made, ie. not just optimization. That's
> > not
> > > >> always the easiest thing to analyze and get right. Meanwhile,
> > > invalidating
> > > >> a
> > > >> bunch of cache keys (by registering a new rewriter impl) isn't
> > > especially
> > > >> costly: if the cache is useful at all, it will be hit often, ie.
> > > refreshed
> > > >> quickly.
> > > >>
> > > >> --John
> > > >>
> > > >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com>
> wrote:
> > > >>
> > > >> > The versioning should be more explicit than that I think. Maybe
> add
> > a
> > > >> > getVersion function to the rewriter interface so they can manage
> > their
> > > >> own
> > > >> > changes
> > > >> >
> > > >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fargo@google.com
> >
> > > >> wrote:
> > > >> >
> > > >> > > Interesting suggestion. I can include the rewriter class names
> and
> > > >> their
> > > >> > > class hash codes or some other such versioning construct.
> > > >> > >
> > > >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> > > wrote:
> > > >> > >
> > > >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
> > fargo@google.com>
> > > >> > wrote:
> > > >> > > > > Excellent, agreed. CLs forthcoming.
> > > >> > > >
> > > >> > > > Can you include the version numbers of the rewriters in the
> > cache
> > > >> key?
> > > >> > > >
> > > >> > > > >
> > > >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <
> etnu@google.com
> > >
> > > >> > wrote:
> > > >> > > > >
> > > >> > > > >> #2 is the only really viable option. If we have to put
> > caching
> > > >> logic
> > > >> > > in
> > > >> > > > 10
> > > >> > > > >> different places we'll screw it up 9 different times :).
> > > >> > > > >>
> > > >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> > > >> fargo@google.com>
> > > >> > > > wrote:
> > > >> > > > >>
> > > >> > > > >> > As discussed on a few threads and tracked in JIRA issue (
> > > >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we
> need
> > > to
> > > >> > move
> > > >> > > > >> > rewriting
> > > >> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> > > >> rewritten
> > > >> > > > content
> > > >> > > > >> > caching capability. The question is where to put it.
> > > >> > > > >> > I see two options, at a high level:
> > > >> > > > >> > 1. In code that calls
> > > >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > >> > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > > >> > > > GadgetServer,
> > > >> > > > >> and
> > > >> > > > >> > the near-future Renderer and Preloader. This allows
> > > >> finer-grained
> > > >> > > > control
> > > >> > > > >> > over caching behavior in context, at the cost of
> > distributing
> > > >> > > caching
> > > >> > > > >> logic
> > > >> > > > >> > in various places.
> > > >> > > > >> > 2. In
> ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > > >> itself,
> > > >> > > if
> > > >> > > > so
> > > >> > > > >> > chosen. Caching logic can be consolidated in
> > > >> > > > >> > CachingContentRewriterRegistry,
> > > >> > > > >> > for instance (which will no longer subclass
> > > >> > > > CachingWebRetrievalFactory),
> > > >> > > > >> > and
> > > >> > > > >> > be considered an optimization to rewriting.
> > > >> > > > >> >
> > > >> > > > >> > I'm inclined to go #2. Rewriters themselves can be
> > augmented
> > > >> with
> > > >> > > > caching
> > > >> > > > >> > hints if necessary, and be assumed deterministic for a
> > given
> > > >> cache
> > > >> > > key
> > > >> > > > in
> > > >> > > > >> > the meantime. Consolidating rewriting logic makes it
> easier
> > > to
> > > >> > share
> > > >> > > > the
> > > >> > > > >> > cache itself.
> > > >> > > > >> >
> > > >> > > > >> > Still, I might be missing situations in which additional
> > > >> context
> > > >> > > > inherent
> > > >> > > > >> > to
> > > >> > > > >> > the calling context is needed to make a caching decision.
> > > >> > > > >> >
> > > >> > > > >> > --John
> > > >> > > > >> >
> > > >> > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Perhaps I'm missing something, but note that in order to share a cache the
data I'm storing are Strings, not HttpResponses (or Gadgets).
As to hash method, MD5 it is.

On Wed, Sep 10, 2008 at 11:18 AM, Kevin Brown <et...@google.com> wrote:

> On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com> wrote:
>
> > A little reading would indicate the MD5 is about 40% faster to compute
> than
> > SHA1 so I would suggest we use it. It would also make sense to stuff the
> > MD5
> > of the cached content into an HTTP header on the cached version of the
> > original content so we can avoid re-computing it.
>
>
> Since HTTP response is immutable, we can do this at construction time in
> pretty much the same way that it's done for GadgetSpec (i.e. adding a
> getChecksum method).
>
>
> >
> >
> > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
> >
> > > I would be a very wary of relying on VM implementation specifics like
> > > XXXRweriter.class.hashCode(). You certainly don't want the same class
> > > generating different coes on different instances in a cluster where the
> > > cache is shared. Also dont use serialVersionUID as that is only
> intended
> > for
> > > serialization compatability. Reading the class as an InputStream from
> the
> > > ClassLoader and then computing a hash might work but seems more trouble
> > than
> > > its worth.
> > >
> > > In general I think a a manually maintained version no. might actually
> be
> > > your best bet.
> > >
> > >
> > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com>
> wrote:
> > >
> > >> Back to this sub-thread...
> > >> Adding a getVersion() parameter seems error prone and a little
> > burdensome.
> > >> In theory the version needs to change when any modification in
> potential
> > >> output for a given input is made, ie. not just optimization. That's
> not
> > >> always the easiest thing to analyze and get right. Meanwhile,
> > invalidating
> > >> a
> > >> bunch of cache keys (by registering a new rewriter impl) isn't
> > especially
> > >> costly: if the cache is useful at all, it will be hit often, ie.
> > refreshed
> > >> quickly.
> > >>
> > >> --John
> > >>
> > >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:
> > >>
> > >> > The versioning should be more explicit than that I think. Maybe add
> a
> > >> > getVersion function to the rewriter interface so they can manage
> their
> > >> own
> > >> > changes
> > >> >
> > >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com>
> > >> wrote:
> > >> >
> > >> > > Interesting suggestion. I can include the rewriter class names and
> > >> their
> > >> > > class hash codes or some other such versioning construct.
> > >> > >
> > >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> > wrote:
> > >> > >
> > >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
> fargo@google.com>
> > >> > wrote:
> > >> > > > > Excellent, agreed. CLs forthcoming.
> > >> > > >
> > >> > > > Can you include the version numbers of the rewriters in the
> cache
> > >> key?
> > >> > > >
> > >> > > > >
> > >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <etnu@google.com
> >
> > >> > wrote:
> > >> > > > >
> > >> > > > >> #2 is the only really viable option. If we have to put
> caching
> > >> logic
> > >> > > in
> > >> > > > 10
> > >> > > > >> different places we'll screw it up 9 different times :).
> > >> > > > >>
> > >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> > >> fargo@google.com>
> > >> > > > wrote:
> > >> > > > >>
> > >> > > > >> > As discussed on a few threads and tracked in JIRA issue (
> > >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need
> > to
> > >> > move
> > >> > > > >> > rewriting
> > >> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> > >> rewritten
> > >> > > > content
> > >> > > > >> > caching capability. The question is where to put it.
> > >> > > > >> > I see two options, at a high level:
> > >> > > > >> > 1. In code that calls
> > >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > >> > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > >> > > > GadgetServer,
> > >> > > > >> and
> > >> > > > >> > the near-future Renderer and Preloader. This allows
> > >> finer-grained
> > >> > > > control
> > >> > > > >> > over caching behavior in context, at the cost of
> distributing
> > >> > > caching
> > >> > > > >> logic
> > >> > > > >> > in various places.
> > >> > > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > >> itself,
> > >> > > if
> > >> > > > so
> > >> > > > >> > chosen. Caching logic can be consolidated in
> > >> > > > >> > CachingContentRewriterRegistry,
> > >> > > > >> > for instance (which will no longer subclass
> > >> > > > CachingWebRetrievalFactory),
> > >> > > > >> > and
> > >> > > > >> > be considered an optimization to rewriting.
> > >> > > > >> >
> > >> > > > >> > I'm inclined to go #2. Rewriters themselves can be
> augmented
> > >> with
> > >> > > > caching
> > >> > > > >> > hints if necessary, and be assumed deterministic for a
> given
> > >> cache
> > >> > > key
> > >> > > > in
> > >> > > > >> > the meantime. Consolidating rewriting logic makes it easier
> > to
> > >> > share
> > >> > > > the
> > >> > > > >> > cache itself.
> > >> > > > >> >
> > >> > > > >> > Still, I might be missing situations in which additional
> > >> context
> > >> > > > inherent
> > >> > > > >> > to
> > >> > > > >> > the calling context is needed to make a caching decision.
> > >> > > > >> >
> > >> > > > >> > --John
> > >> > > > >> >
> > >> > > > >>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by Kevin Brown <et...@google.com>.

On Wed, Sep 10, 2008 at 10:57 AM, Louis Ryan <lr...@google.com> wrote:

> A little reading would indicate the MD5 is about 40% faster to compute than
> SHA1 so I would suggest we use it. It would also make sense to stuff the
> MD5
> of the cached content into an HTTP header on the cached version of the
> original content so we can avoid re-computing it.


Since HTTP response is immutable, we can do this at construction time in
pretty much the same way that it's done for GadgetSpec (i.e. adding a
getChecksum method).


>
>
> On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
>
> > I would be a very wary of relying on VM implementation specifics like
> > XXXRweriter.class.hashCode(). You certainly don't want the same class
> > generating different coes on different instances in a cluster where the
> > cache is shared. Also dont use serialVersionUID as that is only intended
> for
> > serialization compatability. Reading the class as an InputStream from the
> > ClassLoader and then computing a hash might work but seems more trouble
> than
> > its worth.
> >
> > In general I think a a manually maintained version no. might actually be
> > your best bet.
> >
> >
> > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com> wrote:
> >
> >> Back to this sub-thread...
> >> Adding a getVersion() parameter seems error prone and a little
> burdensome.
> >> In theory the version needs to change when any modification in potential
> >> output for a given input is made, ie. not just optimization. That's not
> >> always the easiest thing to analyze and get right. Meanwhile,
> invalidating
> >> a
> >> bunch of cache keys (by registering a new rewriter impl) isn't
> especially
> >> costly: if the cache is useful at all, it will be hit often, ie.
> refreshed
> >> quickly.
> >>
> >> --John
> >>
> >> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:
> >>
> >> > The versioning should be more explicit than that I think. Maybe add a
> >> > getVersion function to the rewriter interface so they can manage their
> >> own
> >> > changes
> >> >
> >> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com>
> >> wrote:
> >> >
> >> > > Interesting suggestion. I can include the rewriter class names and
> >> their
> >> > > class hash codes or some other such versioning construct.
> >> > >
> >> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> wrote:
> >> > >
> >> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com>
> >> > wrote:
> >> > > > > Excellent, agreed. CLs forthcoming.
> >> > > >
> >> > > > Can you include the version numbers of the rewriters in the cache
> >> key?
> >> > > >
> >> > > > >
> >> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com>
> >> > wrote:
> >> > > > >
> >> > > > >> #2 is the only really viable option. If we have to put caching
> >> logic
> >> > > in
> >> > > > 10
> >> > > > >> different places we'll screw it up 9 different times :).
> >> > > > >>
> >> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> >> fargo@google.com>
> >> > > > wrote:
> >> > > > >>
> >> > > > >> > As discussed on a few threads and tracked in JIRA issue (
> >> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need
> to
> >> > move
> >> > > > >> > rewriting
> >> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> >> rewritten
> >> > > > content
> >> > > > >> > caching capability. The question is where to put it.
> >> > > > >> > I see two options, at a high level:
> >> > > > >> > 1. In code that calls
> >> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> >> > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> >> > > > GadgetServer,
> >> > > > >> and
> >> > > > >> > the near-future Renderer and Preloader. This allows
> >> finer-grained
> >> > > > control
> >> > > > >> > over caching behavior in context, at the cost of distributing
> >> > > caching
> >> > > > >> logic
> >> > > > >> > in various places.
> >> > > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> >> itself,
> >> > > if
> >> > > > so
> >> > > > >> > chosen. Caching logic can be consolidated in
> >> > > > >> > CachingContentRewriterRegistry,
> >> > > > >> > for instance (which will no longer subclass
> >> > > > CachingWebRetrievalFactory),
> >> > > > >> > and
> >> > > > >> > be considered an optimization to rewriting.
> >> > > > >> >
> >> > > > >> > I'm inclined to go #2. Rewriters themselves can be augmented
> >> with
> >> > > > caching
> >> > > > >> > hints if necessary, and be assumed deterministic for a given
> >> cache
> >> > > key
> >> > > > in
> >> > > > >> > the meantime. Consolidating rewriting logic makes it easier
> to
> >> > share
> >> > > > the
> >> > > > >> > cache itself.
> >> > > > >> >
> >> > > > >> > Still, I might be missing situations in which additional
> >> context
> >> > > > inherent
> >> > > > >> > to
> >> > > > >> > the calling context is needed to make a caching decision.
> >> > > > >> >
> >> > > > >> > --John
> >> > > > >> >
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Where to cache rewritten content

Posted by Louis Ryan <lr...@google.com>.

A little reading would indicate the MD5 is about 40% faster to compute than
SHA1 so I would suggest we use it. It would also make sense to stuff the MD5
of the cached content into an HTTP header on the cached version of the
original content so we can avoid re-computing it.

On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:

> I would be a very wary of relying on VM implementation specifics like
> XXXRweriter.class.hashCode(). You certainly don't want the same class
> generating different coes on different instances in a cluster where the
> cache is shared. Also dont use serialVersionUID as that is only intended for
> serialization compatability. Reading the class as an InputStream from the
> ClassLoader and then computing a hash might work but seems more trouble than
> its worth.
>
> In general I think a a manually maintained version no. might actually be
> your best bet.
>
>
> On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com> wrote:
>
>> Back to this sub-thread...
>> Adding a getVersion() parameter seems error prone and a little burdensome.
>> In theory the version needs to change when any modification in potential
>> output for a given input is made, ie. not just optimization. That's not
>> always the easiest thing to analyze and get right. Meanwhile, invalidating
>> a
>> bunch of cache keys (by registering a new rewriter impl) isn't especially
>> costly: if the cache is useful at all, it will be hit often, ie. refreshed
>> quickly.
>>
>> --John
>>
>> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:
>>
>> > The versioning should be more explicit than that I think. Maybe add a
>> > getVersion function to the rewriter interface so they can manage their
>> own
>> > changes
>> >
>> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com>
>> wrote:
>> >
>> > > Interesting suggestion. I can include the rewriter class names and
>> their
>> > > class hash codes or some other such versioning construct.
>> > >
>> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com> wrote:
>> > >
>> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com>
>> > wrote:
>> > > > > Excellent, agreed. CLs forthcoming.
>> > > >
>> > > > Can you include the version numbers of the rewriters in the cache
>> key?
>> > > >
>> > > > >
>> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com>
>> > wrote:
>> > > > >
>> > > > >> #2 is the only really viable option. If we have to put caching
>> logic
>> > > in
>> > > > 10
>> > > > >> different places we'll screw it up 9 different times :).
>> > > > >>
>> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
>> fargo@google.com>
>> > > > wrote:
>> > > > >>
>> > > > >> > As discussed on a few threads and tracked in JIRA issue (
>> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to
>> > move
>> > > > >> > rewriting
>> > > > >> > logic out of AbstractHttpCache. Yet we should maintain
>> rewritten
>> > > > content
>> > > > >> > caching capability. The question is where to put it.
>> > > > >> > I see two options, at a high level:
>> > > > >> > 1. In code that calls
>> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
>> > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
>> > > > GadgetServer,
>> > > > >> and
>> > > > >> > the near-future Renderer and Preloader. This allows
>> finer-grained
>> > > > control
>> > > > >> > over caching behavior in context, at the cost of distributing
>> > > caching
>> > > > >> logic
>> > > > >> > in various places.
>> > > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
>> itself,
>> > > if
>> > > > so
>> > > > >> > chosen. Caching logic can be consolidated in
>> > > > >> > CachingContentRewriterRegistry,
>> > > > >> > for instance (which will no longer subclass
>> > > > CachingWebRetrievalFactory),
>> > > > >> > and
>> > > > >> > be considered an optimization to rewriting.
>> > > > >> >
>> > > > >> > I'm inclined to go #2. Rewriters themselves can be augmented
>> with
>> > > > caching
>> > > > >> > hints if necessary, and be assumed deterministic for a given
>> cache
>> > > key
>> > > > in
>> > > > >> > the meantime. Consolidating rewriting logic makes it easier to
>> > share
>> > > > the
>> > > > >> > cache itself.
>> > > > >> >
>> > > > >> > Still, I might be missing situations in which additional
>> context
>> > > > inherent
>> > > > >> > to
>> > > > >> > the calling context is needed to make a caching decision.
>> > > > >> >
>> > > > >> > --John
>> > > > >> >
>> > > > >>
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Just the class, but acknowledged, that wouldn't cover every use case. I
suppose a configuration-driven versioning scheme is reasonable. If not used,
everything with the same class gets the same version/key.

On Wed, Sep 10, 2008 at 1:17 PM, Louis Ryan <lr...@google.com> wrote:

> Would you hash just the class or all its dependent utility and nested
> classes? The codes not all that hard it just unlikely to achieve what you
> want. A signature to retrieve the version seems much safer as it avoids
> thrash from refactoring that maintains the same semantics etc. You can
> always make the version externally defined in a property file and bind it
> at
> runtime if we want it to be managed by configuration and not by code.
>
> On Wed, Sep 10, 2008 at 1:12 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Reading again the contract of hashCode(), it seems this comes down to
> > whether or not two instances of the same class (derived from
> > byte-equivalent
> > JAR or what have you) on different VMs are .equals(...) to one another. I
> > haven't found any decent resources to suggest that would even
> theoretically
> > be so, given it can't be tested.
> > I'm sold on either reading the class or supporting a manual version
> number.
> > Given that these values are computed only once, is it really that
> difficult
> > to do the class-reading technique you describe?
> >
> > --John
> >
> > On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
> >
> > > I would be a very wary of relying on VM implementation specifics like
> > > XXXRweriter.class.hashCode(). You certainly don't want the same class
> > > generating different coes on different instances in a cluster where the
> > > cache is shared. Also dont use serialVersionUID as that is only
> intended
> > > for
> > > serialization compatability. Reading the class as an InputStream from
> the
> > > ClassLoader and then computing a hash might work but seems more trouble
> > > than
> > > its worth.
> > >
> > > In general I think a a manually maintained version no. might actually
> be
> > > your best bet.
> > >
> > > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com>
> wrote:
> > >
> > > > Back to this sub-thread...
> > > > Adding a getVersion() parameter seems error prone and a little
> > > burdensome.
> > > > In theory the version needs to change when any modification in
> > potential
> > > > output for a given input is made, ie. not just optimization. That's
> not
> > > > always the easiest thing to analyze and get right. Meanwhile,
> > > invalidating
> > > > a
> > > > bunch of cache keys (by registering a new rewriter impl) isn't
> > especially
> > > > costly: if the cache is useful at all, it will be hit often, ie.
> > > refreshed
> > > > quickly.
> > > >
> > > > --John
> > > >
> > > > On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:
> > > >
> > > > > The versioning should be more explicit than that I think. Maybe add
> a
> > > > > getVersion function to the rewriter interface so they can manage
> > their
> > > > own
> > > > > changes
> > > > >
> > > > > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com>
> > > > wrote:
> > > > >
> > > > > > Interesting suggestion. I can include the rewriter class names
> and
> > > > their
> > > > > > class hash codes or some other such versioning construct.
> > > > > >
> > > > > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> > wrote:
> > > > > >
> > > > > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <
> fargo@google.com
> > >
> > > > > wrote:
> > > > > > > > Excellent, agreed. CLs forthcoming.
> > > > > > >
> > > > > > > Can you include the version numbers of the rewriters in the
> cache
> > > > key?
> > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <
> etnu@google.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > >> #2 is the only really viable option. If we have to put
> caching
> > > > logic
> > > > > > in
> > > > > > > 10
> > > > > > > >> different places we'll screw it up 9 different times :).
> > > > > > > >>
> > > > > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> > > fargo@google.com
> > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> > As discussed on a few threads and tracked in JIRA issue (
> > > > > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we
> need
> > to
> > > > > move
> > > > > > > >> > rewriting
> > > > > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> > > rewritten
> > > > > > > content
> > > > > > > >> > caching capability. The question is where to put it.
> > > > > > > >> > I see two options, at a high level:
> > > > > > > >> > 1. In code that calls
> > > > > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > > > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > > > > > > GadgetServer,
> > > > > > > >> and
> > > > > > > >> > the near-future Renderer and Preloader. This allows
> > > > finer-grained
> > > > > > > control
> > > > > > > >> > over caching behavior in context, at the cost of
> > distributing
> > > > > > caching
> > > > > > > >> logic
> > > > > > > >> > in various places.
> > > > > > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > > > itself,
> > > > > > if
> > > > > > > so
> > > > > > > >> > chosen. Caching logic can be consolidated in
> > > > > > > >> > CachingContentRewriterRegistry,
> > > > > > > >> > for instance (which will no longer subclass
> > > > > > > CachingWebRetrievalFactory),
> > > > > > > >> > and
> > > > > > > >> > be considered an optimization to rewriting.
> > > > > > > >> >
> > > > > > > >> > I'm inclined to go #2. Rewriters themselves can be
> augmented
> > > > with
> > > > > > > caching
> > > > > > > >> > hints if necessary, and be assumed deterministic for a
> given
> > > > cache
> > > > > > key
> > > > > > > in
> > > > > > > >> > the meantime. Consolidating rewriting logic makes it
> easier
> > to
> > > > > share
> > > > > > > the
> > > > > > > >> > cache itself.
> > > > > > > >> >
> > > > > > > >> > Still, I might be missing situations in which additional
> > > context
> > > > > > > inherent
> > > > > > > >> > to
> > > > > > > >> > the calling context is needed to make a caching decision.
> > > > > > > >> >
> > > > > > > >> > --John
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by Louis Ryan <lr...@google.com>.

Would you hash just the class or all its dependent utility and nested
classes? The codes not all that hard it just unlikely to achieve what you
want. A signature to retrieve the version seems much safer as it avoids
thrash from refactoring that maintains the same semantics etc. You can
always make the version externally defined in a property file and bind it at
runtime if we want it to be managed by configuration and not by code.

On Wed, Sep 10, 2008 at 1:12 PM, John Hjelmstad <fa...@google.com> wrote:

> Reading again the contract of hashCode(), it seems this comes down to
> whether or not two instances of the same class (derived from
> byte-equivalent
> JAR or what have you) on different VMs are .equals(...) to one another. I
> haven't found any decent resources to suggest that would even theoretically
> be so, given it can't be tested.
> I'm sold on either reading the class or supporting a manual version number.
> Given that these values are computed only once, is it really that difficult
> to do the class-reading technique you describe?
>
> --John
>
> On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:
>
> > I would be a very wary of relying on VM implementation specifics like
> > XXXRweriter.class.hashCode(). You certainly don't want the same class
> > generating different coes on different instances in a cluster where the
> > cache is shared. Also dont use serialVersionUID as that is only intended
> > for
> > serialization compatability. Reading the class as an InputStream from the
> > ClassLoader and then computing a hash might work but seems more trouble
> > than
> > its worth.
> >
> > In general I think a a manually maintained version no. might actually be
> > your best bet.
> >
> > On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com> wrote:
> >
> > > Back to this sub-thread...
> > > Adding a getVersion() parameter seems error prone and a little
> > burdensome.
> > > In theory the version needs to change when any modification in
> potential
> > > output for a given input is made, ie. not just optimization. That's not
> > > always the easiest thing to analyze and get right. Meanwhile,
> > invalidating
> > > a
> > > bunch of cache keys (by registering a new rewriter impl) isn't
> especially
> > > costly: if the cache is useful at all, it will be hit often, ie.
> > refreshed
> > > quickly.
> > >
> > > --John
> > >
> > > On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:
> > >
> > > > The versioning should be more explicit than that I think. Maybe add a
> > > > getVersion function to the rewriter interface so they can manage
> their
> > > own
> > > > changes
> > > >
> > > > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com>
> > > wrote:
> > > >
> > > > > Interesting suggestion. I can include the rewriter class names and
> > > their
> > > > > class hash codes or some other such versioning construct.
> > > > >
> > > > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com>
> wrote:
> > > > >
> > > > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fargo@google.com
> >
> > > > wrote:
> > > > > > > Excellent, agreed. CLs forthcoming.
> > > > > >
> > > > > > Can you include the version numbers of the rewriters in the cache
> > > key?
> > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com>
> > > > wrote:
> > > > > > >
> > > > > > >> #2 is the only really viable option. If we have to put caching
> > > logic
> > > > > in
> > > > > > 10
> > > > > > >> different places we'll screw it up 9 different times :).
> > > > > > >>
> > > > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> > fargo@google.com
> > > >
> > > > > > wrote:
> > > > > > >>
> > > > > > >> > As discussed on a few threads and tracked in JIRA issue (
> > > > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need
> to
> > > > move
> > > > > > >> > rewriting
> > > > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> > rewritten
> > > > > > content
> > > > > > >> > caching capability. The question is where to put it.
> > > > > > >> > I see two options, at a high level:
> > > > > > >> > 1. In code that calls
> > > > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > > > > > GadgetServer,
> > > > > > >> and
> > > > > > >> > the near-future Renderer and Preloader. This allows
> > > finer-grained
> > > > > > control
> > > > > > >> > over caching behavior in context, at the cost of
> distributing
> > > > > caching
> > > > > > >> logic
> > > > > > >> > in various places.
> > > > > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > > itself,
> > > > > if
> > > > > > so
> > > > > > >> > chosen. Caching logic can be consolidated in
> > > > > > >> > CachingContentRewriterRegistry,
> > > > > > >> > for instance (which will no longer subclass
> > > > > > CachingWebRetrievalFactory),
> > > > > > >> > and
> > > > > > >> > be considered an optimization to rewriting.
> > > > > > >> >
> > > > > > >> > I'm inclined to go #2. Rewriters themselves can be augmented
> > > with
> > > > > > caching
> > > > > > >> > hints if necessary, and be assumed deterministic for a given
> > > cache
> > > > > key
> > > > > > in
> > > > > > >> > the meantime. Consolidating rewriting logic makes it easier
> to
> > > > share
> > > > > > the
> > > > > > >> > cache itself.
> > > > > > >> >
> > > > > > >> > Still, I might be missing situations in which additional
> > context
> > > > > > inherent
> > > > > > >> > to
> > > > > > >> > the calling context is needed to make a caching decision.
> > > > > > >> >
> > > > > > >> > --John
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Reading again the contract of hashCode(), it seems this comes down to
whether or not two instances of the same class (derived from byte-equivalent
JAR or what have you) on different VMs are .equals(...) to one another. I
haven't found any decent resources to suggest that would even theoretically
be so, given it can't be tested.
I'm sold on either reading the class or supporting a manual version number.
Given that these values are computed only once, is it really that difficult
to do the class-reading technique you describe?

--John

On Tue, Sep 9, 2008 at 6:28 PM, Louis Ryan <lr...@google.com> wrote:

> I would be a very wary of relying on VM implementation specifics like
> XXXRweriter.class.hashCode(). You certainly don't want the same class
> generating different coes on different instances in a cluster where the
> cache is shared. Also dont use serialVersionUID as that is only intended
> for
> serialization compatability. Reading the class as an InputStream from the
> ClassLoader and then computing a hash might work but seems more trouble
> than
> its worth.
>
> In general I think a a manually maintained version no. might actually be
> your best bet.
>
> On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Back to this sub-thread...
> > Adding a getVersion() parameter seems error prone and a little
> burdensome.
> > In theory the version needs to change when any modification in potential
> > output for a given input is made, ie. not just optimization. That's not
> > always the easiest thing to analyze and get right. Meanwhile,
> invalidating
> > a
> > bunch of cache keys (by registering a new rewriter impl) isn't especially
> > costly: if the cache is useful at all, it will be hit often, ie.
> refreshed
> > quickly.
> >
> > --John
> >
> > On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:
> >
> > > The versioning should be more explicit than that I think. Maybe add a
> > > getVersion function to the rewriter interface so they can manage their
> > own
> > > changes
> > >
> > > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > >
> > > > Interesting suggestion. I can include the rewriter class names and
> > their
> > > > class hash codes or some other such versioning construct.
> > > >
> > > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com> wrote:
> > > >
> > > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com>
> > > wrote:
> > > > > > Excellent, agreed. CLs forthcoming.
> > > > >
> > > > > Can you include the version numbers of the rewriters in the cache
> > key?
> > > > >
> > > > > >
> > > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com>
> > > wrote:
> > > > > >
> > > > > >> #2 is the only really viable option. If we have to put caching
> > logic
> > > > in
> > > > > 10
> > > > > >> different places we'll screw it up 9 different times :).
> > > > > >>
> > > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <
> fargo@google.com
> > >
> > > > > wrote:
> > > > > >>
> > > > > >> > As discussed on a few threads and tracked in JIRA issue (
> > > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to
> > > move
> > > > > >> > rewriting
> > > > > >> > logic out of AbstractHttpCache. Yet we should maintain
> rewritten
> > > > > content
> > > > > >> > caching capability. The question is where to put it.
> > > > > >> > I see two options, at a high level:
> > > > > >> > 1. In code that calls
> > > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > > > > GadgetServer,
> > > > > >> and
> > > > > >> > the near-future Renderer and Preloader. This allows
> > finer-grained
> > > > > control
> > > > > >> > over caching behavior in context, at the cost of distributing
> > > > caching
> > > > > >> logic
> > > > > >> > in various places.
> > > > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> > itself,
> > > > if
> > > > > so
> > > > > >> > chosen. Caching logic can be consolidated in
> > > > > >> > CachingContentRewriterRegistry,
> > > > > >> > for instance (which will no longer subclass
> > > > > CachingWebRetrievalFactory),
> > > > > >> > and
> > > > > >> > be considered an optimization to rewriting.
> > > > > >> >
> > > > > >> > I'm inclined to go #2. Rewriters themselves can be augmented
> > with
> > > > > caching
> > > > > >> > hints if necessary, and be assumed deterministic for a given
> > cache
> > > > key
> > > > > in
> > > > > >> > the meantime. Consolidating rewriting logic makes it easier to
> > > share
> > > > > the
> > > > > >> > cache itself.
> > > > > >> >
> > > > > >> > Still, I might be missing situations in which additional
> context
> > > > > inherent
> > > > > >> > to
> > > > > >> > the calling context is needed to make a caching decision.
> > > > > >> >
> > > > > >> > --John
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by Louis Ryan <lr...@google.com>.

I would be a very wary of relying on VM implementation specifics like
XXXRweriter.class.hashCode(). You certainly don't want the same class
generating different coes on different instances in a cluster where the
cache is shared. Also dont use serialVersionUID as that is only intended for
serialization compatability. Reading the class as an InputStream from the
ClassLoader and then computing a hash might work but seems more trouble than
its worth.

In general I think a a manually maintained version no. might actually be
your best bet.

On Tue, Sep 9, 2008 at 5:29 PM, John Hjelmstad <fa...@google.com> wrote:

> Back to this sub-thread...
> Adding a getVersion() parameter seems error prone and a little burdensome.
> In theory the version needs to change when any modification in potential
> output for a given input is made, ie. not just optimization. That's not
> always the easiest thing to analyze and get right. Meanwhile, invalidating
> a
> bunch of cache keys (by registering a new rewriter impl) isn't especially
> costly: if the cache is useful at all, it will be hit often, ie. refreshed
> quickly.
>
> --John
>
> On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:
>
> > The versioning should be more explicit than that I think. Maybe add a
> > getVersion function to the rewriter interface so they can manage their
> own
> > changes
> >
> > On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >
> > > Interesting suggestion. I can include the rewriter class names and
> their
> > > class hash codes or some other such versioning construct.
> > >
> > > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com> wrote:
> > >
> > > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > > > > Excellent, agreed. CLs forthcoming.
> > > >
> > > > Can you include the version numbers of the rewriters in the cache
> key?
> > > >
> > > > >
> > > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com>
> > wrote:
> > > > >
> > > > >> #2 is the only really viable option. If we have to put caching
> logic
> > > in
> > > > 10
> > > > >> different places we'll screw it up 9 different times :).
> > > > >>
> > > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fargo@google.com
> >
> > > > wrote:
> > > > >>
> > > > >> > As discussed on a few threads and tracked in JIRA issue (
> > > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to
> > move
> > > > >> > rewriting
> > > > >> > logic out of AbstractHttpCache. Yet we should maintain rewritten
> > > > content
> > > > >> > caching capability. The question is where to put it.
> > > > >> > I see two options, at a high level:
> > > > >> > 1. In code that calls
> > > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > > > GadgetServer,
> > > > >> and
> > > > >> > the near-future Renderer and Preloader. This allows
> finer-grained
> > > > control
> > > > >> > over caching behavior in context, at the cost of distributing
> > > caching
> > > > >> logic
> > > > >> > in various places.
> > > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget)
> itself,
> > > if
> > > > so
> > > > >> > chosen. Caching logic can be consolidated in
> > > > >> > CachingContentRewriterRegistry,
> > > > >> > for instance (which will no longer subclass
> > > > CachingWebRetrievalFactory),
> > > > >> > and
> > > > >> > be considered an optimization to rewriting.
> > > > >> >
> > > > >> > I'm inclined to go #2. Rewriters themselves can be augmented
> with
> > > > caching
> > > > >> > hints if necessary, and be assumed deterministic for a given
> cache
> > > key
> > > > in
> > > > >> > the meantime. Consolidating rewriting logic makes it easier to
> > share
> > > > the
> > > > >> > cache itself.
> > > > >> >
> > > > >> > Still, I might be missing situations in which additional context
> > > > inherent
> > > > >> > to
> > > > >> > the calling context is needed to make a caching decision.
> > > > >> >
> > > > >> > --John
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Back to this sub-thread...
Adding a getVersion() parameter seems error prone and a little burdensome.
In theory the version needs to change when any modification in potential
output for a given input is made, ie. not just optimization. That's not
always the easiest thing to analyze and get right. Meanwhile, invalidating a
bunch of cache keys (by registering a new rewriter impl) isn't especially
costly: if the cache is useful at all, it will be hit often, ie. refreshed
quickly.

--John

On Tue, Sep 9, 2008 at 1:50 PM, Louis Ryan <lr...@google.com> wrote:

> The versioning should be more explicit than that I think. Maybe add a
> getVersion function to the rewriter interface so they can manage their own
> changes
>
> On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Interesting suggestion. I can include the rewriter class names and their
> > class hash codes or some other such versioning construct.
> >
> > On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com> wrote:
> >
> > > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com>
> wrote:
> > > > Excellent, agreed. CLs forthcoming.
> > >
> > > Can you include the version numbers of the rewriters in the cache key?
> > >
> > > >
> > > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com>
> wrote:
> > > >
> > > >> #2 is the only really viable option. If we have to put caching logic
> > in
> > > 10
> > > >> different places we'll screw it up 9 different times :).
> > > >>
> > > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com>
> > > wrote:
> > > >>
> > > >> > As discussed on a few threads and tracked in JIRA issue (
> > > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to
> move
> > > >> > rewriting
> > > >> > logic out of AbstractHttpCache. Yet we should maintain rewritten
> > > content
> > > >> > caching capability. The question is where to put it.
> > > >> > I see two options, at a high level:
> > > >> > 1. In code that calls
> > > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > > GadgetServer,
> > > >> and
> > > >> > the near-future Renderer and Preloader. This allows finer-grained
> > > control
> > > >> > over caching behavior in context, at the cost of distributing
> > caching
> > > >> logic
> > > >> > in various places.
> > > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget) itself,
> > if
> > > so
> > > >> > chosen. Caching logic can be consolidated in
> > > >> > CachingContentRewriterRegistry,
> > > >> > for instance (which will no longer subclass
> > > CachingWebRetrievalFactory),
> > > >> > and
> > > >> > be considered an optimization to rewriting.
> > > >> >
> > > >> > I'm inclined to go #2. Rewriters themselves can be augmented with
> > > caching
> > > >> > hints if necessary, and be assumed deterministic for a given cache
> > key
> > > in
> > > >> > the meantime. Consolidating rewriting logic makes it easier to
> share
> > > the
> > > >> > cache itself.
> > > >> >
> > > >> > Still, I might be missing situations in which additional context
> > > inherent
> > > >> > to
> > > >> > the calling context is needed to make a caching decision.
> > > >> >
> > > >> > --John
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Where to cache rewritten content

Posted by Louis Ryan <lr...@google.com>.

The versioning should be more explicit than that I think. Maybe add a
getVersion function to the rewriter interface so they can manage their own
changes

On Tue, Sep 9, 2008 at 12:39 PM, John Hjelmstad <fa...@google.com> wrote:

> Interesting suggestion. I can include the rewriter class names and their
> class hash codes or some other such versioning construct.
>
> On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com> wrote:
>
> > On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com> wrote:
> > > Excellent, agreed. CLs forthcoming.
> >
> > Can you include the version numbers of the rewriters in the cache key?
> >
> > >
> > > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com> wrote:
> > >
> > >> #2 is the only really viable option. If we have to put caching logic
> in
> > 10
> > >> different places we'll screw it up 9 different times :).
> > >>
> > >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > >>
> > >> > As discussed on a few threads and tracked in JIRA issue (
> > >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to move
> > >> > rewriting
> > >> > logic out of AbstractHttpCache. Yet we should maintain rewritten
> > content
> > >> > caching capability. The question is where to put it.
> > >> > I see two options, at a high level:
> > >> > 1. In code that calls
> > >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> > GadgetServer,
> > >> and
> > >> > the near-future Renderer and Preloader. This allows finer-grained
> > control
> > >> > over caching behavior in context, at the cost of distributing
> caching
> > >> logic
> > >> > in various places.
> > >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget) itself,
> if
> > so
> > >> > chosen. Caching logic can be consolidated in
> > >> > CachingContentRewriterRegistry,
> > >> > for instance (which will no longer subclass
> > CachingWebRetrievalFactory),
> > >> > and
> > >> > be considered an optimization to rewriting.
> > >> >
> > >> > I'm inclined to go #2. Rewriters themselves can be augmented with
> > caching
> > >> > hints if necessary, and be assumed deterministic for a given cache
> key
> > in
> > >> > the meantime. Consolidating rewriting logic makes it easier to share
> > the
> > >> > cache itself.
> > >> >
> > >> > Still, I might be missing situations in which additional context
> > inherent
> > >> > to
> > >> > the calling context is needed to make a caching decision.
> > >> >
> > >> > --John
> > >> >
> > >>
> > >
> >
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Interesting suggestion. I can include the rewriter class names and their
class hash codes or some other such versioning construct.

On Tue, Sep 9, 2008 at 12:32 PM, Ben Laurie <be...@google.com> wrote:

> On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com> wrote:
> > Excellent, agreed. CLs forthcoming.
>
> Can you include the version numbers of the rewriters in the cache key?
>
> >
> > On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com> wrote:
> >
> >> #2 is the only really viable option. If we have to put caching logic in
> 10
> >> different places we'll screw it up 9 different times :).
> >>
> >> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >>
> >> > As discussed on a few threads and tracked in JIRA issue (
> >> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to move
> >> > rewriting
> >> > logic out of AbstractHttpCache. Yet we should maintain rewritten
> content
> >> > caching capability. The question is where to put it.
> >> > I see two options, at a high level:
> >> > 1. In code that calls
> >> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> >> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher,
> GadgetServer,
> >> and
> >> > the near-future Renderer and Preloader. This allows finer-grained
> control
> >> > over caching behavior in context, at the cost of distributing caching
> >> logic
> >> > in various places.
> >> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget) itself, if
> so
> >> > chosen. Caching logic can be consolidated in
> >> > CachingContentRewriterRegistry,
> >> > for instance (which will no longer subclass
> CachingWebRetrievalFactory),
> >> > and
> >> > be considered an optimization to rewriting.
> >> >
> >> > I'm inclined to go #2. Rewriters themselves can be augmented with
> caching
> >> > hints if necessary, and be assumed deterministic for a given cache key
> in
> >> > the meantime. Consolidating rewriting logic makes it easier to share
> the
> >> > cache itself.
> >> >
> >> > Still, I might be missing situations in which additional context
> inherent
> >> > to
> >> > the calling context is needed to make a caching decision.
> >> >
> >> > --John
> >> >
> >>
> >
>

Re: Where to cache rewritten content

Posted by Ben Laurie <be...@google.com>.

On Tue, Sep 9, 2008 at 8:17 PM, John Hjelmstad <fa...@google.com> wrote:
> Excellent, agreed. CLs forthcoming.

Can you include the version numbers of the rewriters in the cache key?

>
> On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com> wrote:
>
>> #2 is the only really viable option. If we have to put caching logic in 10
>> different places we'll screw it up 9 different times :).
>>
>> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com> wrote:
>>
>> > As discussed on a few threads and tracked in JIRA issue (
>> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to move
>> > rewriting
>> > logic out of AbstractHttpCache. Yet we should maintain rewritten content
>> > caching capability. The question is where to put it.
>> > I see two options, at a high level:
>> > 1. In code that calls
>> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
>> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher, GadgetServer,
>> and
>> > the near-future Renderer and Preloader. This allows finer-grained control
>> > over caching behavior in context, at the cost of distributing caching
>> logic
>> > in various places.
>> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget) itself, if so
>> > chosen. Caching logic can be consolidated in
>> > CachingContentRewriterRegistry,
>> > for instance (which will no longer subclass CachingWebRetrievalFactory),
>> > and
>> > be considered an optimization to rewriting.
>> >
>> > I'm inclined to go #2. Rewriters themselves can be augmented with caching
>> > hints if necessary, and be assumed deterministic for a given cache key in
>> > the meantime. Consolidating rewriting logic makes it easier to share the
>> > cache itself.
>> >
>> > Still, I might be missing situations in which additional context inherent
>> > to
>> > the calling context is needed to make a caching decision.
>> >
>> > --John
>> >
>>
>

Re: Where to cache rewritten content

Posted by John Hjelmstad <fa...@google.com>.

Excellent, agreed. CLs forthcoming.

On Tue, Sep 9, 2008 at 12:14 PM, Kevin Brown <et...@google.com> wrote:

> #2 is the only really viable option. If we have to put caching logic in 10
> different places we'll screw it up 9 different times :).
>
> On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > As discussed on a few threads and tracked in JIRA issue (
> > http://issues.apache.org/jira/browse/SHINDIG-579), we need to move
> > rewriting
> > logic out of AbstractHttpCache. Yet we should maintain rewritten content
> > caching capability. The question is where to put it.
> > I see two options, at a high level:
> > 1. In code that calls
> ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> > Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher, GadgetServer,
> and
> > the near-future Renderer and Preloader. This allows finer-grained control
> > over caching behavior in context, at the cost of distributing caching
> logic
> > in various places.
> > 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget) itself, if so
> > chosen. Caching logic can be consolidated in
> > CachingContentRewriterRegistry,
> > for instance (which will no longer subclass CachingWebRetrievalFactory),
> > and
> > be considered an optimization to rewriting.
> >
> > I'm inclined to go #2. Rewriters themselves can be augmented with caching
> > hints if necessary, and be assumed deterministic for a given cache key in
> > the meantime. Consolidating rewriting logic makes it easier to share the
> > cache itself.
> >
> > Still, I might be missing situations in which additional context inherent
> > to
> > the calling context is needed to make a caching decision.
> >
> > --John
> >
>

Re: Where to cache rewritten content

Posted by Kevin Brown <et...@google.com>.

#2 is the only really viable option. If we have to put caching logic in 10
different places we'll screw it up 9 different times :).

On Tue, Sep 9, 2008 at 12:11 PM, John Hjelmstad <fa...@google.com> wrote:

> As discussed on a few threads and tracked in JIRA issue (
> http://issues.apache.org/jira/browse/SHINDIG-579), we need to move
> rewriting
> logic out of AbstractHttpCache. Yet we should maintain rewritten content
> caching capability. The question is where to put it.
> I see two options, at a high level:
> 1. In code that calls ContentRewriterRegistry.rewrite(HttpResponse|Gadget).
> Eg. MakeRequestHandler, ProxyHandler, ViewContentFetcher, GadgetServer, and
> the near-future Renderer and Preloader. This allows finer-grained control
> over caching behavior in context, at the cost of distributing caching logic
> in various places.
> 2. In ContentRewriterRegistry.rewrite(HttpResponse|Gadget) itself, if so
> chosen. Caching logic can be consolidated in
> CachingContentRewriterRegistry,
> for instance (which will no longer subclass CachingWebRetrievalFactory),
> and
> be considered an optimization to rewriting.
>
> I'm inclined to go #2. Rewriters themselves can be augmented with caching
> hints if necessary, and be assumed deterministic for a given cache key in
> the meantime. Consolidating rewriting logic makes it easier to share the
> cache itself.
>
> Still, I might be missing situations in which additional context inherent
> to
> the calling context is needed to make a caching decision.
>
> --John
>