You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@shindig.apache.org by John Hjelmstad <fa...@google.com> on 2008/10/03 02:46:58 UTC

Serializing parsed content and caching GadgetHtmlParsers

All,
We've had a number of discussions on this list regarding our ability to get
rid of rewritten-content caching altogether. The primary cost savings
associated with doing so, by percentage, comes from avoiding the re-parsing
of gadget contents in order to apply rewriter passes on them (which
themselves are typically very cheap, in the sub-1ms range for reasonable
large input).

With this in mind, I've written and submitted r701267, which provides custom
serialization and deserialization routines for parsed content, along with a
helper base class for any GadgetHtmlParser choosing to support caching.

In coming to this solution, I implemented three mechanisms: Java
serialization, overridden Java serialization routines
(writeObject/readObject), and finally a simplified, ad hoc byte-packed
routine. Standard and overridden Java serialization results were virtually
identical.

I ran each serialization/deserialization routine across a variety of gadget
contents. In sum:
* Custom serialization measured 10-30% more efficient in space. Space
savings largely came from lack of Java class information and other metadata,
so are more pronounced for highly structured content.
* Custom serialization measured 30-40% faster than Java's, and
deserialization was 40-50% faster.

As one example, I took the BuddyPoke gadget's canvas view contents and ran
them through these routines, as well as through CajaHtmlParser. Results:
* CajaHtmlParser average parse time = 25ms.
* Java serialization average = 2.25ms; deserialization = 3.35ms; size =
35kB.
* Custom serialization average = 1.25ms; deserialization = 2.3ms; size =
30kB.

So I removed the Java serialization impl and stuck with custom. This has the
corollary minor benefit that different tools can easily write and read the
same format - consider a cache warmer job for instance.

Given these results, combined with fast, relatively cheap caching by things
like memcache, I'm encouraged that we're getting close to where we can
remove rewritten content caching altogether. Per several previous comments,
many rewriting passes simply can't be cached anyway. The remainder are
extremely cheap given a low-cost parse tree.

The biggest risk with caching content in this way is the universe of
possible input. Now seems like the time we should reduce that, by finally
going ahead with our long-proposed plan to allow hangman variable
substitution only in String contexts (HTML attributes, cdata, and text
nodes). Assuming we reach agreement on this, we can hook up parsed content
caching and implement all existing rewriting operations in terms of a parse
tree with relatively low cost.

In the meantime, I still plan to enable this for CajaHtmlParser, since the
parse tree is only used in opt-in fashion today by "new" gadgets that don't
use __UP substitution in structural elements. I'm also inclined to get rid
of rewritten content caching, since it's largely useless today. I'd be
interested to hear others' opinions on this.

--John

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by Louis Ryan <lr...@google.com>.

Probably wouldn't hurt to look into some alternative HTML parsers other than
Caja as a performance test. NekoHTML seems to be a good candidate, its
Apache2 and seems to be decent performance wise. I took a quick scan of the
code and it looks pretty reasonable
See
http://nekohtml.svn.sourceforge.net/viewvc/nekohtml/trunk/doc/index.html?revision=194

Some comparative benchmarks and samples (usual disclaimer applies)

http://www.portletbridge.org/saxbenchmark/results.html

On Fri, Oct 3, 2008 at 11:48 AM, Kevin Brown <et...@google.com> wrote:

> On Fri, Oct 3, 2008 at 10:07 AM, John Hjelmstad <fa...@google.com> wrote:
>
> > On Fri, Oct 3, 2008 at 9:50 AM, Paul Lindner <pl...@hi5.com> wrote:
> >
> > > It seems to me that parsing Gadget XML should be the least of our
> > worries,
> > > especially if you can insure that the content that's hitting the
> browser
> > > only varies by country and language and view.
> >
> >
> > Correct me if I'm wrong, but that would seem to be thrown out the window
> > with proxied rendering.
> >
> >
> > >
> > >
> > > By moving the parentUrl to the hash I've been able to accomplish this
> for
> > > hi5 -- iframes are cached at the browser and CDN level.
> > >
> > > That said this leave out UserPref support, but I'm fine with that as a
> > > tradeoff.
> > >
> > > Perhaps we should focus on delivering the application as one cacheable
> > > chunk and the per-user/preload data in a second chunk?
> >
> >
> > That's actually what type=html applications do, even with __UP
> substitution
> > (since the UPs are passed on the querystring). Likewise any
> > OpenSocial-templated applications. But proxied throws that out as well,
> > unless we add some caching headers that proxied content rendering can
> pass
> > back telling us the content is cacheable for all users (eg. it contains
> > only
> > substitution constructs such as templates).
>
>
> You can't ever do that because you need a security token to do the proxied
> rendering.
>
>
> > Thinking about this some more:
> > 1. It seems unlikely that parsing cost, even if 25ms, will be a
> substantial
> > component to proxied content rendering latency. Getting the actual,
> > uncacheable data from the app server will be, unless it's specially
> > indicated as cacheable in the first place - at which point we may as well
> > cache the parsed tree as a small optimization. IMO that just calls for
> > passing isCacheable to a caching parser.
>
>
> The data itself is frequently cacheable, especially if it's owner keyed --
> but you MUST validate the security token for this data, because it contains
> user information. If you visit the same profile 100x in a row, the data
> from
> the remote site is still cached, it's just that the iframe isn't.
>
>
> > 2. FWIW, the parsing time of 25ms for BuddyPoke (as with the other
> numbers)
> > comes from my developer workstation running the test in Eclipse. The
> > numbers
> > are intended to be relative. Kevin -- under which environment did you see
> > 10ms cajoling?
>
>
> Running shindig on my own workstation through the YourKit java profiler I
> got 22ms for link rewriting (after making modifications so that it worked
> correctly) and 9ms for cajoling.
>
>
> >
> >
> > Paul, what's your plan for dealing with proxied content latency?
> >
> > -John
> >
> >
> > >
> > >
> > >
> > > On Oct 3, 2008, at 9:35 AM, John Hjelmstad wrote:
> > >
> > >  Hi Ian:
> > >> You're right, it's the gadget XML parse prior to manipulation. It's
> > doing
> > >> DOM-based parsing, and I suspect you're right about the load of small
> > >> objects involved. At present I see that as a requirement, though, to
> > deal
> > >> with semi-well-formed input. We've talked about requiring XHTML or
> > >> something
> > >> close to it as a prerequisite for rewriting - which would make parsing
> > >> vastly easier and rather trivial to implement - but that's a spec
> issue
> > if
> > >> it's to be a general platform requirement.
> > >>
> > >> --John
> > >>
> > >> On Fri, Oct 3, 2008 at 12:53 AM, Ian Boston <ie...@tfd.co.uk> wrote:
> > >>
> > >>  I don't know the precise details of this conversation or exactly
> which
> > >>> parsing, CajaHtmlParser or XmlUtil.parse, you are talking about, but
> if
> > >>> its
> > >>> the Gadget XML parse prior to manipulation, and this is still using
> DOM
> > >>> based parsing, then its probably going to be slower than SAX under
> > load,
> > >>> and
> > >>> vastly slower than Stax. The reason I say under load, is that DOM
> > parsers
> > >>> tend emit lots of small objects which, once they get out of eden,
> > >>> overload
> > >>> the GC which will dominate as resources become scarce. Having said
> > that,
> > >>> gadget parse trees probably don't exist long enough to get out of
> eden.
> > >>>
> > >>> Ignore me if you are talking about some other parsing going on within
> > >>> gadgets.
> > >>> Ian
> > >>>
> > >>>
> > >>> On 3 Oct 2008, at 02:03, Kevin Brown wrote:
> > >>>
> > >>> The real thing we should be investigating is why it takes 25ms to use
> > the
> > >>>
> > >>>> parser on buddypoke when it only takes 10ms to cajole it.
> > >>>>
> > >>>>
> > >>>
> > >>>
> > > Paul Lindner
> > > plindner@hi5.com
> > >
> > >
> > >
> > >
> >
>

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by Kevin Brown <et...@google.com>.

On Fri, Oct 3, 2008 at 10:07 AM, John Hjelmstad <fa...@google.com> wrote:

> On Fri, Oct 3, 2008 at 9:50 AM, Paul Lindner <pl...@hi5.com> wrote:
>
> > It seems to me that parsing Gadget XML should be the least of our
> worries,
> > especially if you can insure that the content that's hitting the browser
> > only varies by country and language and view.
>
>
> Correct me if I'm wrong, but that would seem to be thrown out the window
> with proxied rendering.
>
>
> >
> >
> > By moving the parentUrl to the hash I've been able to accomplish this for
> > hi5 -- iframes are cached at the browser and CDN level.
> >
> > That said this leave out UserPref support, but I'm fine with that as a
> > tradeoff.
> >
> > Perhaps we should focus on delivering the application as one cacheable
> > chunk and the per-user/preload data in a second chunk?
>
>
> That's actually what type=html applications do, even with __UP substitution
> (since the UPs are passed on the querystring). Likewise any
> OpenSocial-templated applications. But proxied throws that out as well,
> unless we add some caching headers that proxied content rendering can pass
> back telling us the content is cacheable for all users (eg. it contains
> only
> substitution constructs such as templates).


You can't ever do that because you need a security token to do the proxied
rendering.


> Thinking about this some more:
> 1. It seems unlikely that parsing cost, even if 25ms, will be a substantial
> component to proxied content rendering latency. Getting the actual,
> uncacheable data from the app server will be, unless it's specially
> indicated as cacheable in the first place - at which point we may as well
> cache the parsed tree as a small optimization. IMO that just calls for
> passing isCacheable to a caching parser.


The data itself is frequently cacheable, especially if it's owner keyed --
but you MUST validate the security token for this data, because it contains
user information. If you visit the same profile 100x in a row, the data from
the remote site is still cached, it's just that the iframe isn't.


> 2. FWIW, the parsing time of 25ms for BuddyPoke (as with the other numbers)
> comes from my developer workstation running the test in Eclipse. The
> numbers
> are intended to be relative. Kevin -- under which environment did you see
> 10ms cajoling?


Running shindig on my own workstation through the YourKit java profiler I
got 22ms for link rewriting (after making modifications so that it worked
correctly) and 9ms for cajoling.


>
>
> Paul, what's your plan for dealing with proxied content latency?
>
> -John
>
>
> >
> >
> >
> > On Oct 3, 2008, at 9:35 AM, John Hjelmstad wrote:
> >
> >  Hi Ian:
> >> You're right, it's the gadget XML parse prior to manipulation. It's
> doing
> >> DOM-based parsing, and I suspect you're right about the load of small
> >> objects involved. At present I see that as a requirement, though, to
> deal
> >> with semi-well-formed input. We've talked about requiring XHTML or
> >> something
> >> close to it as a prerequisite for rewriting - which would make parsing
> >> vastly easier and rather trivial to implement - but that's a spec issue
> if
> >> it's to be a general platform requirement.
> >>
> >> --John
> >>
> >> On Fri, Oct 3, 2008 at 12:53 AM, Ian Boston <ie...@tfd.co.uk> wrote:
> >>
> >>  I don't know the precise details of this conversation or exactly which
> >>> parsing, CajaHtmlParser or XmlUtil.parse, you are talking about, but if
> >>> its
> >>> the Gadget XML parse prior to manipulation, and this is still using DOM
> >>> based parsing, then its probably going to be slower than SAX under
> load,
> >>> and
> >>> vastly slower than Stax. The reason I say under load, is that DOM
> parsers
> >>> tend emit lots of small objects which, once they get out of eden,
> >>> overload
> >>> the GC which will dominate as resources become scarce. Having said
> that,
> >>> gadget parse trees probably don't exist long enough to get out of eden.
> >>>
> >>> Ignore me if you are talking about some other parsing going on within
> >>> gadgets.
> >>> Ian
> >>>
> >>>
> >>> On 3 Oct 2008, at 02:03, Kevin Brown wrote:
> >>>
> >>> The real thing we should be investigating is why it takes 25ms to use
> the
> >>>
> >>>> parser on buddypoke when it only takes 10ms to cajole it.
> >>>>
> >>>>
> >>>
> >>>
> > Paul Lindner
> > plindner@hi5.com
> >
> >
> >
> >
>

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by John Hjelmstad <fa...@google.com>.

On Fri, Oct 3, 2008 at 9:50 AM, Paul Lindner <pl...@hi5.com> wrote:

> It seems to me that parsing Gadget XML should be the least of our worries,
> especially if you can insure that the content that's hitting the browser
> only varies by country and language and view.

Correct me if I'm wrong, but that would seem to be thrown out the window
with proxied rendering.

>
>
> By moving the parentUrl to the hash I've been able to accomplish this for
> hi5 -- iframes are cached at the browser and CDN level.
>
> That said this leave out UserPref support, but I'm fine with that as a
> tradeoff.
>
> Perhaps we should focus on delivering the application as one cacheable
> chunk and the per-user/preload data in a second chunk?

That's actually what type=html applications do, even with __UP substitution
(since the UPs are passed on the querystring). Likewise any
OpenSocial-templated applications. But proxied throws that out as well,
unless we add some caching headers that proxied content rendering can pass
back telling us the content is cacheable for all users (eg. it contains only
substitution constructs such as templates).

Thinking about this some more:
1. It seems unlikely that parsing cost, even if 25ms, will be a substantial
component to proxied content rendering latency. Getting the actual,
uncacheable data from the app server will be, unless it's specially
indicated as cacheable in the first place - at which point we may as well
cache the parsed tree as a small optimization. IMO that just calls for
passing isCacheable to a caching parser.

2. FWIW, the parsing time of 25ms for BuddyPoke (as with the other numbers)
comes from my developer workstation running the test in Eclipse. The numbers
are intended to be relative. Kevin -- under which environment did you see
10ms cajoling?

Paul, what's your plan for dealing with proxied content latency?

-John

>
>
>
> On Oct 3, 2008, at 9:35 AM, John Hjelmstad wrote:
>
>  Hi Ian:
>> You're right, it's the gadget XML parse prior to manipulation. It's doing
>> DOM-based parsing, and I suspect you're right about the load of small
>> objects involved. At present I see that as a requirement, though, to deal
>> with semi-well-formed input. We've talked about requiring XHTML or
>> something
>> close to it as a prerequisite for rewriting - which would make parsing
>> vastly easier and rather trivial to implement - but that's a spec issue if
>> it's to be a general platform requirement.
>>
>> --John
>>
>> On Fri, Oct 3, 2008 at 12:53 AM, Ian Boston <ie...@tfd.co.uk> wrote:
>>
>>  I don't know the precise details of this conversation or exactly which
>>> parsing, CajaHtmlParser or XmlUtil.parse, you are talking about, but if
>>> its
>>> the Gadget XML parse prior to manipulation, and this is still using DOM
>>> based parsing, then its probably going to be slower than SAX under load,
>>> and
>>> vastly slower than Stax. The reason I say under load, is that DOM parsers
>>> tend emit lots of small objects which, once they get out of eden,
>>> overload
>>> the GC which will dominate as resources become scarce. Having said that,
>>> gadget parse trees probably don't exist long enough to get out of eden.
>>>
>>> Ignore me if you are talking about some other parsing going on within
>>> gadgets.
>>> Ian
>>>
>>>
>>> On 3 Oct 2008, at 02:03, Kevin Brown wrote:
>>>
>>> The real thing we should be investigating is why it takes 25ms to use the
>>>
>>>> parser on buddypoke when it only takes 10ms to cajole it.
>>>>
>>>>
>>>
>>>
> Paul Lindner
> plindner@hi5.com
>
>
>
>

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by Paul Lindner <pl...@hi5.com>.

It seems to me that parsing Gadget XML should be the least of our  
worries, especially if you can insure that the content that's hitting  
the browser only varies by country and language and view.

By moving the parentUrl to the hash I've been able to accomplish this  
for hi5 -- iframes are cached at the browser and CDN level.

That said this leave out UserPref support, but I'm fine with that as a  
tradeoff.

Perhaps we should focus on delivering the application as one cacheable  
chunk and the per-user/preload data in a second chunk?


On Oct 3, 2008, at 9:35 AM, John Hjelmstad wrote:

> Hi Ian:
> You're right, it's the gadget XML parse prior to manipulation. It's  
> doing
> DOM-based parsing, and I suspect you're right about the load of small
> objects involved. At present I see that as a requirement, though, to  
> deal
> with semi-well-formed input. We've talked about requiring XHTML or  
> something
> close to it as a prerequisite for rewriting - which would make parsing
> vastly easier and rather trivial to implement - but that's a spec  
> issue if
> it's to be a general platform requirement.
>
> --John
>
> On Fri, Oct 3, 2008 at 12:53 AM, Ian Boston <ie...@tfd.co.uk> wrote:
>
>> I don't know the precise details of this conversation or exactly  
>> which
>> parsing, CajaHtmlParser or XmlUtil.parse, you are talking about,  
>> but if its
>> the Gadget XML parse prior to manipulation, and this is still using  
>> DOM
>> based parsing, then its probably going to be slower than SAX under  
>> load, and
>> vastly slower than Stax. The reason I say under load, is that DOM  
>> parsers
>> tend emit lots of small objects which, once they get out of eden,  
>> overload
>> the GC which will dominate as resources become scarce. Having said  
>> that,
>> gadget parse trees probably don't exist long enough to get out of  
>> eden.
>>
>> Ignore me if you are talking about some other parsing going on within
>> gadgets.
>> Ian
>>
>>
>> On 3 Oct 2008, at 02:03, Kevin Brown wrote:
>>
>> The real thing we should be investigating is why it takes 25ms to  
>> use the
>>> parser on buddypoke when it only takes 10ms to cajole it.
>>>
>>
>>

Paul Lindner
plindner@hi5.com

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by John Hjelmstad <fa...@google.com>.

Hi Ian:
You're right, it's the gadget XML parse prior to manipulation. It's doing
DOM-based parsing, and I suspect you're right about the load of small
objects involved. At present I see that as a requirement, though, to deal
with semi-well-formed input. We've talked about requiring XHTML or something
close to it as a prerequisite for rewriting - which would make parsing
vastly easier and rather trivial to implement - but that's a spec issue if
it's to be a general platform requirement.

--John

On Fri, Oct 3, 2008 at 12:53 AM, Ian Boston <ie...@tfd.co.uk> wrote:

> I don't know the precise details of this conversation or exactly which
> parsing, CajaHtmlParser or XmlUtil.parse, you are talking about, but if its
> the Gadget XML parse prior to manipulation, and this is still using DOM
> based parsing, then its probably going to be slower than SAX under load, and
> vastly slower than Stax. The reason I say under load, is that DOM parsers
> tend emit lots of small objects which, once they get out of eden, overload
> the GC which will dominate as resources become scarce. Having said that,
> gadget parse trees probably don't exist long enough to get out of eden.
>
> Ignore me if you are talking about some other parsing going on within
> gadgets.
> Ian
>
>
> On 3 Oct 2008, at 02:03, Kevin Brown wrote:
>
>  The real thing we should be investigating is why it takes 25ms to use the
>> parser on buddypoke when it only takes 10ms to cajole it.
>>
>
>

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by Ian Boston <ie...@tfd.co.uk>.

I don't know the precise details of this conversation or exactly  
which parsing, CajaHtmlParser or XmlUtil.parse, you are talking  
about, but if its the Gadget XML parse prior to manipulation, and  
this is still using DOM based parsing, then its probably going to be  
slower than SAX under load, and vastly slower than Stax. The reason I  
say under load, is that DOM parsers tend emit lots of small objects  
which, once they get out of eden, overload the GC which will dominate  
as resources become scarce. Having said that, gadget parse trees  
probably don't exist long enough to get out of eden.

Ignore me if you are talking about some other parsing going on within  
gadgets.
Ian

On 3 Oct 2008, at 02:03, Kevin Brown wrote:

> The real thing we should be investigating is why it takes 25ms to  
> use the
> parser on buddypoke when it only takes 10ms to cajole it.

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by John Hjelmstad <fa...@google.com>.

My interest here is for gadget rendering in its various forms, not for
makeRequest (which hopefully will stop being used for gadget rendering
purposes as proxied continues deployment). Proxied could derail all this,
given it's likely to generate new contents for each user, and thus yield
terrible cache hit rates. Sigh, a possible waste of time here.

So let's ask the Caja folks. Do any of you have time to help figure out why
cajoling is faster than just using Caja's DomParser to yield a parse tree?

--John

On Thu, Oct 2, 2008 at 6:03 PM, Kevin Brown <et...@google.com> wrote:

> On Thu, Oct 2, 2008 at 5:46 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > All,
> > We've had a number of discussions on this list regarding our ability to
> get
> > rid of rewritten-content caching altogether. The primary cost savings
> > associated with doing so, by percentage, comes from avoiding the
> re-parsing
> > of gadget contents in order to apply rewriter passes on them (which
> > themselves are typically very cheap, in the sub-1ms range for reasonable
> > large input).
>
>
> The primary cost is for parsing content that isn't cacheable to begin with
> because it changes every request (proxied gadget renders, makeRequest,
> etc.)
>
> Until we can get a very fast parser, we can't actually do the more complex
> optimizations that a parse tree facilitates, so we're stuck with
> string-based manipulations anyway.
>
> The real thing we should be investigating is why it takes 25ms to use the
> parser on buddypoke when it only takes 10ms to cajole it.
>
>
> >
> > With this in mind, I've written and submitted r701267, which provides
> > custom
> > serialization and deserialization routines for parsed content, along with
> a
> > helper base class for any GadgetHtmlParser choosing to support caching.
> >
> > In coming to this solution, I implemented three mechanisms: Java
> > serialization, overridden Java serialization routines
> > (writeObject/readObject), and finally a simplified, ad hoc byte-packed
> > routine. Standard and overridden Java serialization results were
> virtually
> > identical.
> >
> > I ran each serialization/deserialization routine across a variety of
> gadget
> > contents. In sum:
> > * Custom serialization measured 10-30% more efficient in space. Space
> > savings largely came from lack of Java class information and other
> > metadata,
> > so are more pronounced for highly structured content.
> > * Custom serialization measured 30-40% faster than Java's, and
> > deserialization was 40-50% faster.
> >
> > As one example, I took the BuddyPoke gadget's canvas view contents and
> ran
> > them through these routines, as well as through CajaHtmlParser. Results:
> > * CajaHtmlParser average parse time = 25ms.
> > * Java serialization average = 2.25ms; deserialization = 3.35ms; size =
> > 35kB.
> > * Custom serialization average = 1.25ms; deserialization = 2.3ms; size =
> > 30kB.
> >
> > So I removed the Java serialization impl and stuck with custom. This has
> > the
> > corollary minor benefit that different tools can easily write and read
> the
> > same format - consider a cache warmer job for instance.
> >
> > Given these results, combined with fast, relatively cheap caching by
> things
> > like memcache, I'm encouraged that we're getting close to where we can
> > remove rewritten content caching altogether. Per several previous
> comments,
> > many rewriting passes simply can't be cached anyway. The remainder are
> > extremely cheap given a low-cost parse tree.
> >
> > The biggest risk with caching content in this way is the universe of
> > possible input. Now seems like the time we should reduce that, by finally
> > going ahead with our long-proposed plan to allow hangman variable
> > substitution only in String contexts (HTML attributes, cdata, and text
> > nodes). Assuming we reach agreement on this, we can hook up parsed
> content
> > caching and implement all existing rewriting operations in terms of a
> parse
> > tree with relatively low cost.
> >
> > In the meantime, I still plan to enable this for CajaHtmlParser, since
> the
> > parse tree is only used in opt-in fashion today by "new" gadgets that
> don't
> > use __UP substitution in structural elements. I'm also inclined to get
> rid
> > of rewritten content caching, since it's largely useless today. I'd be
> > interested to hear others' opinions on this.
> >
> > --John
> >
>

Re: Serializing parsed content and caching GadgetHtmlParsers

Posted by Kevin Brown <et...@google.com>.

On Thu, Oct 2, 2008 at 5:46 PM, John Hjelmstad <fa...@google.com> wrote:

> All,
> We've had a number of discussions on this list regarding our ability to get
> rid of rewritten-content caching altogether. The primary cost savings
> associated with doing so, by percentage, comes from avoiding the re-parsing
> of gadget contents in order to apply rewriter passes on them (which
> themselves are typically very cheap, in the sub-1ms range for reasonable
> large input).


The primary cost is for parsing content that isn't cacheable to begin with
because it changes every request (proxied gadget renders, makeRequest, etc.)

Until we can get a very fast parser, we can't actually do the more complex
optimizations that a parse tree facilitates, so we're stuck with
string-based manipulations anyway.

The real thing we should be investigating is why it takes 25ms to use the
parser on buddypoke when it only takes 10ms to cajole it.


>
> With this in mind, I've written and submitted r701267, which provides
> custom
> serialization and deserialization routines for parsed content, along with a
> helper base class for any GadgetHtmlParser choosing to support caching.
>
> In coming to this solution, I implemented three mechanisms: Java
> serialization, overridden Java serialization routines
> (writeObject/readObject), and finally a simplified, ad hoc byte-packed
> routine. Standard and overridden Java serialization results were virtually
> identical.
>
> I ran each serialization/deserialization routine across a variety of gadget
> contents. In sum:
> * Custom serialization measured 10-30% more efficient in space. Space
> savings largely came from lack of Java class information and other
> metadata,
> so are more pronounced for highly structured content.
> * Custom serialization measured 30-40% faster than Java's, and
> deserialization was 40-50% faster.
>
> As one example, I took the BuddyPoke gadget's canvas view contents and ran
> them through these routines, as well as through CajaHtmlParser. Results:
> * CajaHtmlParser average parse time = 25ms.
> * Java serialization average = 2.25ms; deserialization = 3.35ms; size =
> 35kB.
> * Custom serialization average = 1.25ms; deserialization = 2.3ms; size =
> 30kB.
>
> So I removed the Java serialization impl and stuck with custom. This has
> the
> corollary minor benefit that different tools can easily write and read the
> same format - consider a cache warmer job for instance.
>
> Given these results, combined with fast, relatively cheap caching by things
> like memcache, I'm encouraged that we're getting close to where we can
> remove rewritten content caching altogether. Per several previous comments,
> many rewriting passes simply can't be cached anyway. The remainder are
> extremely cheap given a low-cost parse tree.
>
> The biggest risk with caching content in this way is the universe of
> possible input. Now seems like the time we should reduce that, by finally
> going ahead with our long-proposed plan to allow hangman variable
> substitution only in String contexts (HTML attributes, cdata, and text
> nodes). Assuming we reach agreement on this, we can hook up parsed content
> caching and implement all existing rewriting operations in terms of a parse
> tree with relatively low cost.
>
> In the meantime, I still plan to enable this for CajaHtmlParser, since the
> parse tree is only used in opt-in fashion today by "new" gadgets that don't
> use __UP substitution in structural elements. I'm also inclined to get rid
> of rewritten content caching, since it's largely useless today. I'd be
> interested to hear others' opinions on this.
>
> --John
>