You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@shindig.apache.org by John Hjelmstad <fa...@google.com> on 2008/08/07 00:34:22 UTC

Content Rewriter Modularization: Design/Change

*Summary
*
I would like to refactor Shindig Java's content rewriter to allow multiple
independent rewriter modules to be registered with a server and apply to a
request. Each particular CL should be of reasonably modest size, so while
I'd like to get coding, I wanted first to describe the overall structure I'm
working toward to A) solicit feedback, so B) other implementations can
hopefully get a more holistic view of the approach and C) hopefully ensure
people don't feel a lot of code was introduced without
discussion/documentation.

The enhancements I would like to add lead to a scenario in which:

   - The Gadget object maintains and can be manipulated by a parse tree
   (HTML/CSS/JS) of its Content in addition to a String representation.
   - A parse tree is generated by HTML/CSSParser objects.
   - A ContentRewriter implements one method: rewrite(...) (method given
   full request context and returns caching metadata)
   - Multiple ContentRewriters can be activated by a given rendering server.
   - Rewriting can either be forced (no developer opt-in) or
   signaled/configured by a <Require> directive.

This work involves *no* change to the semantics of the current content
rewriter, which simply becomes the default rewriter "module" provided by
Shindig.

My ask from the community is for any and all feedback on this idea, in
particular discussion on the overall approach (as I'm keen to get to
coding). I admit this proposal is lengthy and reasonably ambitious, so
thanks for reading (if you do! :)).
*
Motivation/Rationale*

This proposal effectively enables the renderer to become a multi-pass
compiler for gadget content (essentially, arbitrary web content). Such a
compiler can provide several benefits: static optimization of gadget content
(auto-proxying of images, whitespace/comment removal, consolidation of CSS
blocks), security benefits (caja et al), new functionality (annotation of
content for stats, document analysis, container-specific features), etc. To
my knowledge no such infrastructure exists today (with the possible
exception of Caja itself, which I'd like to dovetail with this work).

While any of the benefits here could be achieved through direct injection of
code into the rendering code path, doing so both A) ties the new semantics
to sundry details of the current implementation and B) makes testing
difficult due to a lack of a clear boundary between the new rewriter and the
renderer. Together these raise the barrier of entry for those wishing to add
new rewriting capabilities, and make the code harder to maintain.

Rewriting is conceptually simple: modify web content (HTML/JS/CSS) given
some context (query params, gadget metadata). This proposal seeks to reflect
that in a clear API.

*Details*

I would like to make the following changes to Shindig [Java impl], in the
order specified. Each numbered point corresponds with one or more CLs of
reasonably modest size. By breaking up changes into manageable parts,
testing and verifying existing behavior along the way, I hope to avoid the
overhead associated with introduction of a new branch, though I'm open to
the idea if needed.

1. Make the Gadget object's Content that of the active View.

Discussion: Code manipulating a Gadget should have ready access to the
Content that applies to the request. Setters should apply to the Gadget
object rather than View, since Gadget is the intermediary representation for
all processing state, while View is metadata that feeds into that. A
corollary benefit to this change is that new GadgetSpecFactory instances can
be provided that only fetch a spec and don't reimplement rewriting (or,
don't have to subclass BasicGadgetSpecFactory).

  a. Inject ContainerConfig into GadgetServer.
  b. Construct Gadget using ContainerConfig object.
  c. Rename Gadget.getView(ContainerConfig) to Gadget.getCurrentView().
  d. Add Gadget.getContent(), Gadget.getRewrittenContent(), and
Gadget.setRewrittenContent(), while removing same from View.
  e. Migrate rewriter functionality out of BasicGadgetSpecFactory.getSpec()
and into GadgetServer.createGadgetFromSpec().

2. Define an abstract gadget Content parse tree and the HTMLParser that
generates it.

Discussion: Most truly interesting content manipulation is done to a parse
tree, as the Content Rewriter does with Caja's today. Such a parse tree
should be a first-class concept. It's essentially DOM-lite, but not DOM
itself for A) simplicity and B) ability to extend to gadgets-specific
concepts. It should attach to the Gadget object so that it can be kept in
sync with the equivalent String representation of content. It should also be
abstract to separate it from unnecessary dependencies and, more importantly,
to allow competing (ever-improving) parser implementations to be
transparently shifted in as they're available. This change purely introduces
new concepts and doesn't integrate with processing code.

  a. interface GadgetContentParseTree
    i. GadgetElement getBaseElement() - returns parsed base element of
Gadget contents
  b. interface GadgetElement
    i. List<GadgetElement> getChildren() - returns ordered list of children
    ii. boolean isText() - returns true if element contains only text
content (in the DOM sense, contains only 1 child of type TEXT)
    iii. String getText() - returns text content if (isText()), otherwise
null
    iv. void setText(content) - sets text content if (isText()), otherwise
[does nothing || replaces all child text nodes with one text node at the end
with content]
    v. List<GadgetAttribute> getAttributes() - returns unmodifiable list of
tag's attributes
    vi. String getAttribute(name) - Returns attribute with the given key, or
null if not present
    vii. void setAttribute(name, val) - Sets attribute with the given n/v
    viii. GadgetCssParseTree getCss() - Returns parsed CSS for contents of a
<style> block, or for an a style="" attribute
  c. interface GadgetCssParseTree
    i. boolean isInline() - returns true if content was parsed from inline
style block
    ii. List<GadgetCssRule> getRules() - Returns an ordered list of CssRules
if (!isInline()), null otherwise
    iii. List<GadgetCssAttribute> getAttributes() - Returns an ordered list
of CssAttributes if (isInline()), null otherwise
  d. interface GadgetCssRule
    i. List<String> getSelectors() - returns ordered list of selectors for
this CSS rule
    ii. List<GadgetCssAttribute> getAttributes() - returns ordered list of
CssAttributes for the rule
  e. interface GadgetCssAttribute
    i. String getKey() - returns key of the attribute
    ii. String getValue() - returns value of the attribute
  f. interface GadgetContentParser
    i. GadgetContentParseTree parse(content) - Parses content into a parse
tree
  g. interface GadgetCssParser
    i. List<GadgetCssRule> parse(content) - Parses CSS rules from a string
    ii. List<GadgetCssAttribute> parseInline(content) - Parses inline style
block contents

3. Implement GadgetContentParser and GadgetCssParser using Caja (existing
impl).

Discussion: The existing content rewriter uses Caja's HTML and CSS lexers
for processing gadget contents. This change hides this functionality behind
the newly defined interfaces. Shindig should provide an out-of-the-box
default for each interface, and these are likely candidates. The Caja-based
CSS parser will probably be sufficient in perpetuity given CSS's much more
limited acceptable syntax, but I'd expect (and would like) other
GadgetContentParser implementations to surface that are based on Mozilla or
WebKit's much more lenient HTML parsers, ensuring that the server "sees"
what a browser would.

  a. Implement CajaContentParser using Caja's lexer.
  b. Implement CajaCssParser using Caja's lexer.

4. Improve and augment the Gadget object to maintain two forms of its
rendering state.

Discussion: A Gadget should be able to maintain its current state in two
ways: parse tree and String. Manipulation of either should result in changes
to the other. Without doubt, it will be optimal to ensure the fewest number
of conversions between the two, but behavior should be consistent in any
case. We should also clean up the API by removing the "rewrittenContent"
APIs, since rewritten content is just the current content having been
replaced. One important note is that the new rewriter programming model
supports only one form of rewriting at a time: parse tree or string. That
is, you can't manipulate the parse tree object after setting new content as
a String unless you retrieve a brand-new parse tree. This also underscores
that rewriters must run serially, which avoids a host of threading issues
anyway (and is consistent with current behavior).

  a. Remove Gadget.getRewrittenContent() and Gadget.setRewrittenContent(),
while adding Gadget.setContent(). Newly-set content is considered
"rewritten" if it changed.
    i. Update the rest of the implementation to use the new method (points
a. and a.i. reflected in separate CL).
  b. Add GadgetContentParser and GadgetCssParser to Gadget during
construction (each of these injected by Guice into GadgetServer).
  c. Add Gadget.getParsedContent().
    i. Returns a mutable GadgetContentParseTree used to manipulate Gadget
Contents.
    ii. Mutable tree calls back to the Gadget object indicating when any
change is made, and emits an error if setContent() has been called in the
interim.
    iii. setContent() calls back to Gadget if new content has been set.
    iv. getContent() serializes GadgetContentParseTree if modified, and vice
versa.

5. Simplify ContentRewriter interface to include only one method:
RewriteResults rewrite(GadgetContext, Gadget).

Discussion: The idea here is to simplify the rewriting API to its
essentials: given some context, manipulate a Gadget in some way. Then, tell
the server the properties of what was done: did rewriting happen? If so, did
it modify caching characteristics of the gadget? And so on. The
infrastructure takes care of the rest. The implementation of RewriteResults
in particular is designed to allow new "signals" to be introduced in
backward-compatible way should any become relevant.

  a. Remove existing ContentRewriter interface methods, replacing with:
RewriteResults rewrite(GadgetContext, Gadget).
  b. Add GadgetContext.getOptions() returning protocol-agnostic versions of
inbound request options.
    i. Separate out Options from HttpRequest as precursor.
  c. Define RewriteResults.
    i. boolean wasRewritten() -- returns true if some modification occurred
    ii. cache control stuff may be added in the future as needed
  d. Update DefaultContentRewriter to use the new interface
(DefaultContentRewriter still uses inline Caja impl).
  e. Update ContentRewriter calling code (in GadgetServer,
AbstractHttpCache, et al) to reference the new interface.

6. Update DefaultContentRewriter to use new Gadget syntax tree APIs.

Discussion: The default content rewriter should utilize the new
infrastructure put in place for robust HTML and CSS parsing to duplicate
code and demonstrate the new APIs. Rather than implementing atop an HTML
tokenizer, direct manipulation of the DOM-lite is used.

  a. Update DefaultContentRewriter and associated apparatus (LinkRewriter,
HtmlRewriter et al) to directly manipulate the parse tree.
  b. Update all tests to ensure no change in semantics has occurred.

7. Provide helper base classes for ContentRewriters of various types.

Discussion: Several patterns are likely to emerge with rewriters, notably
regarding when they apply. These classes provide helpers for new rewriters
while helping to clean up DefaultContentRewriter's ContentRewriterFeature
stuff, and helps simplify testing somewhat.

  a. Define FeatureKeyedContentRewriter.
    i. Passes in Feature params to augmented rewrite(...) method.
    ii. boolean requires_feature supplied to constructor, indicating whether
a <Requrie> or <Optional> block must be present for the rewriter to be
active.
  b. ParamKeyedContentRewriter
    i. Activates rewriter based on request param info
  c. Make DefaultContentRewriter a subclass of FeatureKeyedContentRewriter
with requires_feature=true.

8. Modularize ContentRewriter with ContentRewriterRegistry.

Discussion: At this point rewriters can be written in isolation from one
another, independently tested, verified, and experimented. A registry makes
this infrastructure a general extension mechanism for Shindig installations.
The registry tells the server what rewriters are supported, and each is
applied to requests that come in. It's essentially a way to generate a
composition of rewriters, and is itself injected into the server via Guice.

  a. interface ContentRewriterRegistry
    i. List<ContentRewriter> getRewriters() - mutable list of
ContentRewriters.
  b. Implement DefaultContentRewriterRegistry, which registers only
DefaultContentRewriter.
  c. Inject DefaultContentRewriterRegistry via Guice.
  d. Switch out use of single ContentRewriter with ContentRewriterFactory in
GadgetServer code.


If you've made it this far, thanks for reading. As aforementioned, I'm
interested in any and all feedback. I'll be tracking my progress, barring
any fundamental issues, in JIRA as well. In any case I fully expect some
details of this to change (particularly the abstract parse tree
representation), but think this impl will serve as a good base to build on
for cool new functionality.

Thanks!
John

Re: Content Rewriter Modularization: Design/Change

Posted by Louis Ryan <lr...@google.com>.
Once more with feeling....

On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com> wrote:

> Can we prove this out incrementally bottom-up. In general I think using DOM
> is the right thing to do from a rewriting standpoint. So here's how I
> propose we proceed
>
> 1. If the Caja dom is a little awkward wrap it, if not lets just use it as
> is. We can always resolve this later
> 2. Change the existing content rewriters to use the DOM instead of a lexer,
> should be pretty easy. Maybe add some fancier rewriting like moving CSS into
> HEAD
> 3. Do some perf testing, look into memory overhead of dom transformation
> etc.
> 4. Alter GadgetSpec's to retain the dom when they are cached
> 5. Alter the gadget rendering phase to serialize the content of the dom to
> output
> 6. Annotate the dom at parse time to make render time user-pref
> substituions faster, this should be easy enough too...
>
> This should be enough to prove out the pipeline end-to-end and identify any
> major perf niggles. Once this is done we can look into how to inject a
> rewriter pipeline into the parsing phase and the rendering phase.
>
> -Louis
>
>
>
>
> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com> wrote:
>
>> Re-responding in order to apply the last few exchanges to
>> google-caja-discuss@ (@gmail vs. @google membership issues).
>>
>> On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fa...@google.com> wrote:
>>
>> > Hello,
>> >
>> > While beginning to refactor the rewriter APIs I've discovered that there
>> > unfortunately is one semantic difference inherent to moving getContent()
>> and
>> > setContent() methods into the Gadget object (replacing
>> > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer caches
>> > rewritten content.
>> >
>> > I've written a discussion of this in issue SHINDIG-500, which tracks
>> this
>> > implementation sub-task:
>> https://issues.apache.org/jira/browse/SHINDIG-500
>> >
>> > To summarize:
>> > 1. Is this change acceptable for the time being?
>> > 2. I suggest that we can, at a later date, move fetching of gadget specs
>> > into GadgetServer while injecting a Gadget(Spec) cache there as well,
>> > offering finer-tuned control over caching characteristics.
>> >
>> > Thanks,
>> > John
>> >
>> >
>> > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fa...@google.com>
>> wrote:
>> >
>> >> I understand these concerns, and should be clear that I don't (despite
>> my
>> >> personal interest in experimenting with the idea, agreed that we don't
>> have
>> >> time for it at the moment) have any plans to introduce this sort of RPC
>> >> anywhere - certainly not in Shindig itself, as any such call would be
>> hidden
>> >> behind an interface anyway.
>> >>
>> >> Putting the RPC hypothetical aside, I still feel that there's value to
>> >> implementing HTML parsing in terms of an interface:
>> >> * Clearer separation of concerns/boundary between projects.
>> >>   - Corollary simplicity in testing.
>> >> * Clearer API for content manipulation (that doesn't require knowledge
>> of
>> >> Caja).
>> >>
>> >> I could be convinced otherwise, but at this point the code involved
>> seems
>> >> of manageable size, so still worth doing. Thoughts?
>> >>
>> >> John
>> >>
>> >>
>> >>
>> >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com> wrote:
>> >>
>> >>> I agree with Louis -- that's just not practical. Every rewriting
>> >>> operation
>> >>> must work in real time. Caja's existing html parser is adequate for
>> our
>> >>> needs, and we shouldn't go out of our way to tolerate every oddity of
>> >>> random
>> >>> web browsers (especially as it simply wouldn't work unless you farmed
>> it
>> >>> out
>> >>> to *every* browser). Any new code needs to be grounded in practical,
>> >>> current
>> >>> needs, not theoretical options. We can always change code later if we
>> >>> find a
>> >>> real need for something like that. We have real work to do in the
>> >>> meantime.
>> >>>
>> >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com>
>> wrote:
>> >>>
>> >>> > John,
>> >>> >
>> >>> > From a practicality standpoint I'm a little nervous about this plan
>> to
>> >>> make
>> >>> > RPCs calls out of a Java process to a native process to fetch a
>> parse
>> >>> tree
>> >>> > for transformations that have to occur realtime. I don't think the
>> >>> > motivating factor here is to accept all inputs that browsers can.
>> >>> Gadget
>> >>> > developers will tailor their markup to the platform as they have
>> done
>> >>> > already. I would greatly prefer us to pick one 'good' parser and
>> stick
>> >>> with
>> >>> > it for all the manageability and consumability benefits that come
>> with
>> >>> that
>> >>> > decision. Perhaps Im missing something here?
>> >>> >
>> >>> > -Louis
>> >>> >
>> >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fa...@google.com>
>> >>> wrote:
>> >>> >
>> >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com>
>> wrote:
>> >>> > >
>> >>> > > > [+google-caja-discuss]
>> >>> > > >
>> >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
>> fargo@google.com>
>> >>> > wrote:
>> >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com>
>> >>> wrote:
>> >>> > > > >
>> >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>> >>> fargo@google.com>
>> >>> > > > wrote:
>> >>> > > > >> > This proposal effectively enables the renderer to become a
>> >>> > > multi-pass
>> >>> > > > >> > compiler for gadget content (essentially, arbitrary web
>> >>> content).
>> >>> > > Such
>> >>> > > > a
>> >>> > > > >> > compiler can provide several benefits: static optimization
>> of
>> >>> > gadget
>> >>> > > > >> content
>> >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
>> >>> > consolidation
>> >>> > > of
>> >>> > > > >> CSS
>> >>> > > > >> > blocks), security benefits (caja et al), new functionality
>> >>> > > (annotation
>> >>> > > > of
>> >>> > > > >> > content for stats, document analysis, container-specific
>> >>> > features),
>> >>> > > > etc.
>> >>> > > > >> To
>> >>> > > > >> > my knowledge no such infrastructure exists today (with the
>> >>> > possible
>> >>> > > > >> > exception of Caja itself, which I'd like to dovetail with
>> this
>> >>> > > work).
>> >>> > > > >>
>> >>> > > > >> Caja clearly provides a large chunk of the code you'd need
>> for
>> >>> this.
>> >>> > > > >> I'd like to hear how we'd manage to avoid duplication between
>> >>> the
>> >>> > two
>> >>> > > > >> projects.
>> >>> > > > >>
>> >>> > > > >> A generalised framework for manipulating content sounds like
>> a
>> >>> great
>> >>> > > > >> idea, but probably should not live in either of the two
>> projects
>> >>> > (Caja
>> >>> > > > >> and Shindig) but rather should be shared by both of them, I
>> >>> suspect.
>> >>> > > > >
>> >>> > > > >
>> >>> > > > > I agree on both counts. As I mentioned, the piece of this idea
>> >>> that I
>> >>> > > > expect
>> >>> > > > > to change the most is the parse tree, and Caja's .parser.html
>> and
>> >>> > > > > .parser.css packages contain much of what I've thrown in here
>> as
>> >>> a
>> >>> > > base.
>> >>> > > > >
>> >>> > > > > My key requirements are:
>> >>> > > > > * Lightweight framework.
>> >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the
>> good
>> >>> work
>> >>> > > > done
>> >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
>> >>> > > > > * Automatic maintenance of DOM<->String conversion.
>> >>> > > > > * Easy to manipulate structure.
>> >>> > > >
>> >>> > > > I'm not sure what the value of parser modularity is? If the
>> >>> resulting
>> >>> > > > tree is different, then that's a problem for people processing
>> the
>> >>> > > > tree. And if it is not, then why do we care?
>> >>> > >
>> >>> > >
>> >>> > > IMO the value of parser modularity is that the lenient parsers
>> native
>> >>> to
>> >>> > > browsers can be used in place of those that might not accept all
>> >>> inputs.
>> >>> > > One
>> >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into
>> a
>> >>> > server
>> >>> > > that runs parallel to Shindig and provides a "local RPC" service
>> for
>> >>> > > parsing
>> >>> > > semi-structured HTML. The resulting tree for WebKit's parser might
>> be
>> >>> > > different than that for an XHTML parser, Gecko's parser, etc, but
>> if
>> >>> the
>> >>> > > algorithm implemented atop it is rule-based rather than
>> >>> strict-structure
>> >>> > > based that should be fine, no?
>> >>> > >
>> >>> > >
>> >>> > > >
>> >>> > > >
>> >>> > > > >
>> >>> > > > > I'd love to see both projects share the same base syntax tree
>> >>> > > > > representations. I considered .parser.html(.DomTree) and
>> >>> .parser.css
>> >>> > > for
>> >>> > > > > these, but at the moment these appeared to be a little more
>> tied
>> >>> to
>> >>> > > > Caja's
>> >>> > > > > lexer/parser implementation than I preferred (though I admit
>> >>> > > > > AbstractParseTreeNode contains most of what's needed).
>> >>> > > > >
>> >>> > > > > To be sure, I don't see this as an end-all-be-all
>> transformation
>> >>> > system
>> >>> > > > in
>> >>> > > > > any way. I'd just like to put *something* reasonable in place
>> >>> that we
>> >>> > > can
>> >>> > > > > play with, provide some benefit, and enhance into a truly
>> >>> > sophisticated
>> >>> > > > > vision of document rewriting.
>> >>> > > > >
>> >>> > > > >
>> >>> > > > >>
>> >>> > > > >>
>> >>> > > > >> >  c. Add Gadget.getParsedContent().
>> >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
>> >>> manipulate
>> >>> > > > Gadget
>> >>> > > > >> > Contents.
>> >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
>> indicating
>> >>> > when
>> >>> > > > any
>> >>> > > > >> > change is made, and emits an error if setContent() has been
>> >>> called
>> >>> > > in
>> >>> > > > the
>> >>> > > > >> > interim.
>> >>> > > > >>
>> >>> > > > >> In Caja we have been moving towards immutable trees...
>> >>> > > > >
>> >>> > > > >
>> >>> > > > > Interested to hear more about this. The whole idea is for the
>> >>> > gadget's
>> >>> > > > tree
>> >>> > > > > representation to be modifiable. Doing that with immutable
>> trees
>> >>> to
>> >>> > me
>> >>> > > > > suggests that a rewriter would have to create a completely new
>> >>> tree
>> >>> > and
>> >>> > > > set
>> >>> > > > > it as a representation of new content. That's convenient as
>> far
>> >>> as
>> >>> > the
>> >>> > > > > Gadget's maintenance of String<->Tree representations is
>> >>> concerned...
>> >>> > > but
>> >>> > > > > seems pretty heavyweight for many types of edits: in-situ
>> >>> > modifications
>> >>> > > > of
>> >>> > > > > text, content reordering, etc. That's particularly so in a
>> >>> > > > single-threaded
>> >>> > > > > (viz rewriting) environment.
>> >>> > > >
>> >>> > > > Never having been entirely sold on the concept, I'll let those
>> on
>> >>> the
>> >>> > > > Caja team who advocate immutability explain why.
>> >>> > > >
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>>
>
>

Re: [Caja] Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
On Tue, Sep 2, 2008 at 8:45 AM, Ben Laurie <be...@google.com> wrote:

>
> On Wed, Aug 27, 2008 at 2:12 AM, John Hjelmstad <fa...@google.com> wrote:
> > Agreed with these conclusions. There are a number of avenues open to us
> for
> > rewriter optimization, and per stats, working with the Caja team on an
> > optimized, ideally purpose-built-for-Shindig parser is tops on the list.
>
> Caja also has optimisation on the road map. "Working" comes first, though
> :-)


Details, details. :)


>
>
> Curious what "purpose-built-for-Shindig" would entail?


In general this means that rather than creating the many intermediary
DomTree objects Caja's DomParser presently does, instead directly generating
Shindig's more basic parse-tree representation (ParsedHtmlNode et al).

The Caja-based HTML parser is currently in:
http://svn.apache.org/repos/asf/incubator/shindig/trunk/java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/caja/CajaHtmlParser.java

Yet another optimization would be to directly generate a class that
subclasses:
http://svn.apache.org/repos/asf/incubator/shindig/trunk/java/gadgets/src/main/java/org/apache/shindig/gadgets/parse/GadgetHtmlNode.java

These are the mutable nodes that are created from parsed content.

--John


>
> > Re: making ParseTreeDefaultContentRewriter the default... thanks! I plan
> to
> > do this, but FYI I'll be making the actual change only after cleaning up
> > plumbing a bit: making GadgetSpec immutable, then cleaning up the
> rewriter
> > interface to consume a (mutable) Gadget, which maintains its own parse
> tree
> > (per the original plan). This in turn will allow
> > ParseTreeDefaultContentRewriter to be split into its constituent parts
> (link
> > rewriter, js consolidator, style rewriter), and more importantly allow
> other
> > rewriters to modularly take advantage of the parse tree. Until then, the
> > parse tree is hidden from other rewriters.
> > Expect these changes soon, along with resuscitation of the discussion on
> how
> > best to handle caching of rewritten content.
> > --John
> >
> > On Mon, Aug 25, 2008 at 5:21 PM, Kevin Brown <et...@google.com> wrote:
> >>
> >> I think any arguments one way or another contingent on caching should be
> >> ignored, because caching isn't always going to be possible (many
> >> makeRequest
> >> calls, proxied content, etc.)
> >>
> >> Instead, we should just make sure that the total time spent rewriting a
> >> given piece of content is under some target -- say 50-100ms. We can
> >> optimize
> >> aggressively to meet this goal, regardless of which approach we take.
> From
> >> the initial results, I think we can simply work with the caja team to
> >> improve the performance of the parser to bring it into that range.
> Having
> >> a
> >> parse tree is more useful than not having one, so we should go ahead and
> >> use
> >> the parser instead of the lexer and then just focus on making it better.
> >>
> >> On Mon, Aug 25, 2008 at 4:44 PM, Louis Ryan <lr...@google.com> wrote:
> >>
> >> > These numbers seem pretty solid so I believe its fine to switch the
> >> > general
> >> > rewriter over to this new code and ditch the lexer based stuff. Nice
> >> > work.
> >> >
> >> > On Mon, Aug 25, 2008 at 2:27 PM, John Hjelmstad <fa...@google.com>
> >> > wrote:
> >> >
> >> > > I've implemented a parse-tree based ContentRewriter using the
> existing
> >> > > plumbing (same caching semantics et al), as
> >> > ParseTreeDefaultContentRewriter
> >> > > and ParseTreeHtmlRewriter, respectively. The latter contains
> >> > > essentially
> >> > > all
> >> > > rewriting functionality from the previous HtmlTagTransformer
> classes.
> >> > > The
> >> > > parse-tree based rewriter is now functionally equivalent to the
> >> > > previous
> >> > > rewriter. The new classes use a Caja-based HTML parser under the
> hood.
> >> > > This proves out the functional viability of a tree-based rewriter,
> but
> >> > > performance also needs to be assessed.
> >> > >
> >> > > I've gone ahead and profiled the comparative performance of each
> >> > rewriter,
> >> > > "Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins
> out
> >> > every
> >> > > time essentially by definition, since obviously Caja's parser uses
> its
> >> > own
> >> > > lexer under the hood.
> >> > >
> >> > > Summary:
> >> > > * The performance of each is fundamentally O(n), though...
> >> > > * For any given input size, Lexer-based rewriting averages between
> 2.5
> >> > > -
> >> > > 3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
> >> > > * By far, the majority of time involved in ParseTree-based
> >> > > optimization
> >> > is
> >> > > initial parsing: 75% of all processing.
> >> > >
> >> > > Details:
> >> > > 1. I wrote a simple rewriter/parser profiler which rewrites (the
> >> > > sample
> >> > > rewriter gadget's content * X repetitions) N times, recording the
> >> > resulting
> >> > > run time. The run time of parse-based rewriting degraded as N
> >> > > increased,
> >> > in
> >> > > all likelihood due to the additional cost of object management
> >> > (lexer-based
> >> > > rewriting involves few intermediate objects). Given that the results
> >> > > of
> >> > > rewriting will be variously cached, it's very unlikely that
> rewriting
> >> > will
> >> > > happen in immediate succession hundreds or thousands of times. As
> >> > > such, I
> >> > > fixed N = 1 to re-run the tests in relative isolation from one
> >> > > another.
> >> > > Results from a given run:
> >> > >
> >> > > LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704]
> >> > > millis/run
> >> > >
> >> > > PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128]
> >> > > millis/run
> >> > >
> >> > > Parse/lex ratio: 3.4405327398939263
> >> > >
> >> > > LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run
> >> > >
> >> > > PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968]
> >> > millis/run
> >> > >
> >> > > Parse/lex ratio: 3.3908412235979606
> >> > >
> >> > > LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992]
> millis/run
> >> > >
> >> > > PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032]
> >> > > millis/run
> >> > >
> >> > > Parse/lex ratio: 3.5628839314581313
> >> > >
> >> > > LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run
> >> > >
> >> > > PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136]
> millis/run
> >> > >
> >> > > Parse/lex ratio: 3.0316666666666667
> >> > >
> >> > >
> >> > > 2. Drilling down, I added simple operation profiling to each
> component
> >> > > of
> >> > > parse-tree rewriting: original parse (CajaHtmlParser); building
> >> > > mutable
> >> > > tree
> >> > > nodes; rewriting links; concatenating JS nodes; rewriting style
> >> > > blocks;
> >> > > rendering parse tree. I then reran the same tests.
> >> > >
> >> > > Results from subsequent run:
> >> > >
> >> > > LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096]
> >> > > millis/run
> >> > >
> >> > > PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384]
> >> > millis/run
> >> > >
> >> > > Parse/lex ratio: 3.912896728183352
> >> > >
> >> > > [PARSE OPS]
> >> > >
> >> > > Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms
> >> > >
> >> > > Op[render] min:36.851ms, max:36.851ms, avg:36.851ms
> >> > >
> >> > > Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms
> >> > >
> >> > > Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms
> >> > >
> >> > > Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms
> >> > >
> >> > > Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms
> >> > >
> >> > >
> >> > > LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904]
> >> > > millis/run
> >> > >
> >> > > PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032]
> >> > millis/run
> >> > >
> >> > > Parse/lex ratio: 5.251336201277539
> >> > >
> >> > > [PARSE OPS]
> >> > >
> >> > > Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms
> >> > >
> >> > > Op[render] min:5.184ms, max:5.184ms, avg:5.184ms
> >> > >
> >> > > Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms
> >> > >
> >> > > Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms
> >> > >
> >> > > Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms
> >> > >
> >> > > Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms
> >> > >
> >> > >
> >> > > LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008]
> millis/run
> >> > >
> >> > > PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336]
> millis/run
> >> > >
> >> > > Parse/lex ratio: 3.066152477009789
> >> > >
> >> > > [PARSE OPS]
> >> > >
> >> > > Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms
> >> > >
> >> > > Op[render] min:0.678ms, max:0.678ms, avg:0.678ms
> >> > >
> >> > > Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms
> >> > >
> >> > > Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms
> >> > >
> >> > > Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms
> >> > >
> >> > > Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms
> >> > >
> >> > >
> >> > > LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run
> >> > >
> >> > > PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072]
> millis/run
> >> > >
> >> > > Parse/lex ratio: 3.518581081081081
> >> > >
> >> > > [PARSE OPS]
> >> > >
> >> > > Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms
> >> > >
> >> > > Op[render] min:0.077ms, max:0.077ms, avg:0.077ms
> >> > >
> >> > > Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms
> >> > >
> >> > > Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms
> >> > >
> >> > > Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms
> >> > >
> >> > > Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms
> >> > >
> >> > >
> >> > > 3. Drilling further, I wrote a separate test breaking out the
> >> > > performance
> >> > > components to parsing: calling the Caja DomParser.parseFragment(...)
> >> > > API,
> >> > > and subsequently wrapping the results of that call with
> ParsedHtmlNode
> >> > > objects to satisfy interface requirements:
> >> > >
> >> > > Typical run:
> >> > >
> >> > > Caja parser [size*1, runs:1] in 97538 microsecs [97.538048]
> millis/run
> >> > >
> >> > > [PARSER COMPONENTS]
> >> > >
> >> > > Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms
> >> > >
> >> > > Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms
> >> > >
> >> > >
> >> > > Caja parser [size*10, runs:1] in 42915 microsecs [42.915072]
> >> > > millis/run
> >> > >
> >> > > [PARSER COMPONENTS]
> >> > >
> >> > > Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms
> >> > >
> >> > > Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms
> >> > >
> >> > >
> >> > > Caja parser [size*50, runs:1] in 157048 microsecs [157.048064]
> >> > > millis/run
> >> > >
> >> > > [PARSER COMPONENTS]
> >> > >
> >> > > Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms
> >> > >
> >> > > Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms
> >> > >
> >> > >
> >> > > Caja parser [size*100, runs:1] in 236073 microsecs [236.07296]
> >> > > millis/run
> >> > >
> >> > > [PARSER COMPONENTS]
> >> > >
> >> > > Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms
> >> > >
> >> > > Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms
> >> > >
> >> > >
> >> > > Conclusions and Discussion:
> >> > >
> >> > > The purpose of this task was to prove that tree-based parsing is
> >> > > functionally viable, which has succeeded. Past that, it's a matter
> of
> >> > > choosing functionality vs. performance. Given that rewriting results
> >> > > are
> >> > > cached, perhaps even ~3x increase in rewriting cost will be worth
> >> > > paying.
> >> > >
> >> > >
> >> > > That's particularly true given the new class of
> optimizations/rewrites
> >> > made
> >> > > possible with a parse tree, as well as some bugs that are more
> easily
> >> > fixed
> >> > > using it. For instance, I recently discovered a bug with the
> existing
> >> > > JS
> >> > > tag
> >> > > rewriter which ignores type="..." attributes and doesn't maintain
> "id"
> >> > > attributes in certain situations. These can be resolved in the lexer
> >> > case,
> >> > > but are clearer in the parser one.
> >> > >
> >> > >
> >> > > Lastly, as mentioned at the beginning of this thread, I plan to
> >> > > maintain
> >> > > the
> >> > > ability to manipulate a gadget by string, meaning a lexer-based
> >> > > approach
> >> > > can
> >> > > still be used where desired and parse-tree isn't required.
> >> > >
> >> > >
> >> > > Next steps:
> >> > >
> >> > > 1. My next step is to add modularity to content rewriting, but again
> >> > > without
> >> > > changing any caching semantics. Instead, rather than a single
> >> > > ContentRewriter being injected, a ContentRewriterRegistry will be.
> The
> >> > > default Registry will support injection of a single ContentRewriter
> to
> >> > > maintain backward compatibility for now.
> >> > >
> >> > > 2. GadgetSpec immutability restored, ensuring post-rewritten
> caching.
> >> > >
> >> > > 3. ContentRewriter API cleanup.
> >> > >
> >> > >
> >> > > --John
> >> > >
> >> > >
> >> > > On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <fa...@google.com>
> >> > wrote:
> >> > >
> >> > > > Interesting idea, and sounds fine to me. Concretely, this lets me
> >> > > sidestep
> >> > > > SHINDIG-500 for a little while, which is nice (though I'd _really_
> >> > > > like
> >> > > to
> >> > > > see the API cleanup go in! :)), in favor of migrating the existing
> >> > > rewriter
> >> > > > to a tree-based approach. Turns out I've been working on #1 and #2
> >> > > > independently anyway. I'll post a patch soon. Thanks!
> >> > > >
> >> > > > John
> >> > > >
> >> > > >
> >> > > > On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com>
> >> > > > wrote:
> >> > > >
> >> > > >> Can we prove this out incrementally bottom-up. In general I think
> >> > using
> >> > > >> DOM
> >> > > >> is the right thing to do from a rewriting standpoint. So here's
> how
> >> > > >> I
> >> > > >> propose we proceed
> >> > > >>
> >> > > >> 1. If the Caja dom is a little awkward wrap it, if not lets just
> >> > > >> use
> >> > it
> >> > > as
> >> > > >> is. We can always resolve this later
> >> > > >> 2. Change the existing content rewriters to use the DOM instead
> of
> >> > > >> a
> >> > > >> lexer,
> >> > > >> should be pretty easy. Maybe add some fancier rewriting like
> moving
> >> > CSS
> >> > > >> into
> >> > > >> HEAD
> >> > > >> 3. Do some perf testing, look into memory overhead of dom
> >> > transformation
> >> > > >> etc.
> >> > > >> 4. Alter GadgetSpec's to retain the dom when they are cached
> >> > > >> 5. Alter the gadget rendering phase to serialize the content of
> the
> >> > dom
> >> > > to
> >> > > >> output
> >> > > >> 6. Annotate the dom at parse time to make render time user-pref
> >> > > >> substituions
> >> > > >> faster, this should be easy enough too...
> >> > > >>
> >> > > >> This should be enough to prove out the pipeline end-to-end and
> >> > identify
> >> > > >> any
> >> > > >> major perf niggles. Once this is done we can look into how to
> >> > > >> inject a
> >> > > >> rewriter pipeline into the parsing phase and the rendering phase.
> >> > > >>
> >> > > >> -Louis
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <
> fargo@google.com>
> >> > > wrote:
> >> > > >>
> >> > > >> > Re-responding in order to apply the last few exchanges to
> >> > > >> > google-caja-discuss@ (@gmail vs. @google membership issues).
> >> > > >> >
> >> > > >> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad
> >> > > >> > <fa...@google.com>
> >> > > >> wrote:
> >> > > >> >
> >> > > >> > > Hello,
> >> > > >> > >
> >> > > >> > > While beginning to refactor the rewriter APIs I've discovered
> >> > > >> > > that
> >> > > >> there
> >> > > >> > > unfortunately is one semantic difference inherent to moving
> >> > > >> getContent()
> >> > > >> > and
> >> > > >> > > setContent() methods into the Gadget object (replacing
> >> > > >> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no
> >> > > >> > > longer
> >> > > >> caches
> >> > > >> > > rewritten content.
> >> > > >> > >
> >> > > >> > > I've written a discussion of this in issue SHINDIG-500, which
> >> > tracks
> >> > > >> this
> >> > > >> > > implementation sub-task:
> >> > > >> > https://issues.apache.org/jira/browse/SHINDIG-500
> >> > > >> > >
> >> > > >> > > To summarize:
> >> > > >> > > 1. Is this change acceptable for the time being?
> >> > > >> > > 2. I suggest that we can, at a later date, move fetching of
> >> > > >> > > gadget
> >> > > >> specs
> >> > > >> > > into GadgetServer while injecting a Gadget(Spec) cache there
> as
> >> > > well,
> >> > > >> > > offering finer-tuned control over caching characteristics.
> >> > > >> > >
> >> > > >> > > Thanks,
> >> > > >> > > John
> >> > > >> > >
> >> > > >> > >
> >> > > >> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad
> >> > > >> > > <fargo@google.com
> >> > >
> >> > > >> > wrote:
> >> > > >> > >
> >> > > >> > >> I understand these concerns, and should be clear that I
> don't
> >> > > >> (despite
> >> > > >> > my
> >> > > >> > >> personal interest in experimenting with the idea, agreed
> that
> >> > > >> > >> we
> >> > > >> don't
> >> > > >> > have
> >> > > >> > >> time for it at the moment) have any plans to introduce this
> >> > > >> > >> sort
> >> > of
> >> > > >> RPC
> >> > > >> > >> anywhere - certainly not in Shindig itself, as any such call
> >> > would
> >> > > be
> >> > > >> > hidden
> >> > > >> > >> behind an interface anyway.
> >> > > >> > >>
> >> > > >> > >> Putting the RPC hypothetical aside, I still feel that
> there's
> >> > value
> >> > > >> to
> >> > > >> > >> implementing HTML parsing in terms of an interface:
> >> > > >> > >> * Clearer separation of concerns/boundary between projects.
> >> > > >> > >>   - Corollary simplicity in testing.
> >> > > >> > >> * Clearer API for content manipulation (that doesn't require
> >> > > >> knowledge
> >> > > >> > of
> >> > > >> > >> Caja).
> >> > > >> > >>
> >> > > >> > >> I could be convinced otherwise, but at this point the code
> >> > involved
> >> > > >> > seems
> >> > > >> > >> of manageable size, so still worth doing. Thoughts?
> >> > > >> > >>
> >> > > >> > >> John
> >> > > >> > >>
> >> > > >> > >>
> >> > > >> > >>
> >> > > >> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <
> etnu@google.com>
> >> > > >> wrote:
> >> > > >> > >>
> >> > > >> > >>> I agree with Louis -- that's just not practical. Every
> >> > > >> > >>> rewriting
> >> > > >> > >>> operation
> >> > > >> > >>> must work in real time. Caja's existing html parser is
> >> > > >> > >>> adequate
> >> > > for
> >> > > >> our
> >> > > >> > >>> needs, and we shouldn't go out of our way to tolerate every
> >> > oddity
> >> > > >> of
> >> > > >> > >>> random
> >> > > >> > >>> web browsers (especially as it simply wouldn't work unless
> >> > > >> > >>> you
> >> > > >> farmed
> >> > > >> > it
> >> > > >> > >>> out
> >> > > >> > >>> to *every* browser). Any new code needs to be grounded in
> >> > > practical,
> >> > > >> > >>> current
> >> > > >> > >>> needs, not theoretical options. We can always change code
> >> > > >> > >>> later
> >> > if
> >> > > >> we
> >> > > >> > >>> find a
> >> > > >> > >>> real need for something like that. We have real work to do
> in
> >> > the
> >> > > >> > >>> meantime.
> >> > > >> > >>>
> >> > > >> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan
> >> > > >> > >>> <lr...@google.com>
> >> > > >> wrote:
> >> > > >> > >>>
> >> > > >> > >>> > John,
> >> > > >> > >>> >
> >> > > >> > >>> > From a practicality standpoint I'm a little nervous about
> >> > > >> > >>> > this
> >> > > >> plan
> >> > > >> > to
> >> > > >> > >>> make
> >> > > >> > >>> > RPCs calls out of a Java process to a native process to
> >> > > >> > >>> > fetch
> >> > a
> >> > > >> parse
> >> > > >> > >>> tree
> >> > > >> > >>> > for transformations that have to occur realtime. I don't
> >> > > >> > >>> > think
> >> > > the
> >> > > >> > >>> > motivating factor here is to accept all inputs that
> >> > > >> > >>> > browsers
> >> > > can.
> >> > > >> > >>> Gadget
> >> > > >> > >>> > developers will tailor their markup to the platform as
> they
> >> > have
> >> > > >> done
> >> > > >> > >>> > already. I would greatly prefer us to pick one 'good'
> >> > > >> > >>> > parser
> >> > and
> >> > > >> > stick
> >> > > >> > >>> with
> >> > > >> > >>> > it for all the manageability and consumability benefits
> >> > > >> > >>> > that
> >> > > come
> >> > > >> > with
> >> > > >> > >>> that
> >> > > >> > >>> > decision. Perhaps Im missing something here?
> >> > > >> > >>> >
> >> > > >> > >>> > -Louis
> >> > > >> > >>> >
> >> > > >> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
> >> > > >> fargo@google.com>
> >> > > >> > >>> wrote:
> >> > > >> > >>> >
> >> > > >> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie
> >> > > >> > >>> > > <benl@google.com
> >> > >
> >> > > >> > wrote:
> >> > > >> > >>> > >
> >> > > >> > >>> > > > [+google-caja-discuss]
> >> > > >> > >>> > > >
> >> > > >> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
> >> > > >> fargo@google.com
> >> > > >> > >
> >> > > >> > >>> > wrote:
> >> > > >> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <
> >> > > benl@google.com
> >> > > >> >
> >> > > >> > >>> wrote:
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> >> > > >> > >>> fargo@google.com>
> >> > > >> > >>> > > > wrote:
> >> > > >> > >>> > > > >> > This proposal effectively enables the renderer
> to
> >> > > become
> >> > > >> a
> >> > > >> > >>> > > multi-pass
> >> > > >> > >>> > > > >> > compiler for gadget content (essentially,
> >> > > >> > >>> > > > >> > arbitrary
> >> > web
> >> > > >> > >>> content).
> >> > > >> > >>> > > Such
> >> > > >> > >>> > > > a
> >> > > >> > >>> > > > >> > compiler can provide several benefits: static
> >> > > >> optimization
> >> > > >> > of
> >> > > >> > >>> > gadget
> >> > > >> > >>> > > > >> content
> >> > > >> > >>> > > > >> > (auto-proxying of images, whitespace/comment
> >> > > >> > >>> > > > >> > removal,
> >> > > >> > >>> > consolidation
> >> > > >> > >>> > > of
> >> > > >> > >>> > > > >> CSS
> >> > > >> > >>> > > > >> > blocks), security benefits (caja et al), new
> >> > > >> functionality
> >> > > >> > >>> > > (annotation
> >> > > >> > >>> > > > of
> >> > > >> > >>> > > > >> > content for stats, document analysis,
> >> > > container-specific
> >> > > >> > >>> > features),
> >> > > >> > >>> > > > etc.
> >> > > >> > >>> > > > >> To
> >> > > >> > >>> > > > >> > my knowledge no such infrastructure exists today
> >> > (with
> >> > > >> the
> >> > > >> > >>> > possible
> >> > > >> > >>> > > > >> > exception of Caja itself, which I'd like to
> >> > > >> > >>> > > > >> > dovetail
> >> > > with
> >> > > >> > this
> >> > > >> > >>> > > work).
> >> > > >> > >>> > > > >>
> >> > > >> > >>> > > > >> Caja clearly provides a large chunk of the code
> >> > > >> > >>> > > > >> you'd
> >> > > need
> >> > > >> for
> >> > > >> > >>> this.
> >> > > >> > >>> > > > >> I'd like to hear how we'd manage to avoid
> >> > > >> > >>> > > > >> duplication
> >> > > >> between
> >> > > >> > >>> the
> >> > > >> > >>> > two
> >> > > >> > >>> > > > >> projects.
> >> > > >> > >>> > > > >>
> >> > > >> > >>> > > > >> A generalised framework for manipulating content
> >> > > >> > >>> > > > >> sounds
> >> > > >> like a
> >> > > >> > >>> great
> >> > > >> > >>> > > > >> idea, but probably should not live in either of
> the
> >> > > >> > >>> > > > >> two
> >> > > >> > projects
> >> > > >> > >>> > (Caja
> >> > > >> > >>> > > > >> and Shindig) but rather should be shared by both
> of
> >> > them,
> >> > > I
> >> > > >> > >>> suspect.
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > > I agree on both counts. As I mentioned, the piece
> of
> >> > this
> >> > > >> idea
> >> > > >> > >>> that I
> >> > > >> > >>> > > > expect
> >> > > >> > >>> > > > > to change the most is the parse tree, and Caja's
> >> > > >> .parser.html
> >> > > >> > and
> >> > > >> > >>> > > > > .parser.css packages contain much of what I've
> thrown
> >> > > >> > >>> > > > > in
> >> > > >> here
> >> > > >> > as
> >> > > >> > >>> a
> >> > > >> > >>> > > base.
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > > My key requirements are:
> >> > > >> > >>> > > > > * Lightweight framework.
> >> > > >> > >>> > > > > * Parser modularity, mostly for HTML parsers (to
> >> > > >> > >>> > > > > re-use
> >> > > the
> >> > > >> > good
> >> > > >> > >>> work
> >> > > >> > >>> > > > done
> >> > > >> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from
> Caja
> >> > I'd
> >> > > >> bet)
> >> > > >> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
> >> > > >> > >>> > > > > * Easy to manipulate structure.
> >> > > >> > >>> > > >
> >> > > >> > >>> > > > I'm not sure what the value of parser modularity is?
> If
> >> > the
> >> > > >> > >>> resulting
> >> > > >> > >>> > > > tree is different, then that's a problem for people
> >> > > processing
> >> > > >> > the
> >> > > >> > >>> > > > tree. And if it is not, then why do we care?
> >> > > >> > >>> > >
> >> > > >> > >>> > >
> >> > > >> > >>> > > IMO the value of parser modularity is that the lenient
> >> > parsers
> >> > > >> > native
> >> > > >> > >>> to
> >> > > >> > >>> > > browsers can be used in place of those that might not
> >> > > >> > >>> > > accept
> >> > > all
> >> > > >> > >>> inputs.
> >> > > >> > >>> > > One
> >> > > >> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing
> >> > > >> > >>> > > code
> >> > > >> into a
> >> > > >> > >>> > server
> >> > > >> > >>> > > that runs parallel to Shindig and provides a "local
> RPC"
> >> > > service
> >> > > >> > for
> >> > > >> > >>> > > parsing
> >> > > >> > >>> > > semi-structured HTML. The resulting tree for WebKit's
> >> > > >> > >>> > > parser
> >> > > >> might
> >> > > >> > be
> >> > > >> > >>> > > different than that for an XHTML parser, Gecko's
> parser,
> >> > etc,
> >> > > >> but
> >> > > >> > if
> >> > > >> > >>> the
> >> > > >> > >>> > > algorithm implemented atop it is rule-based rather than
> >> > > >> > >>> strict-structure
> >> > > >> > >>> > > based that should be fine, no?
> >> > > >> > >>> > >
> >> > > >> > >>> > >
> >> > > >> > >>> > > >
> >> > > >> > >>> > > >
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > > I'd love to see both projects share the same base
> >> > > >> > >>> > > > > syntax
> >> > > >> tree
> >> > > >> > >>> > > > > representations. I considered
> .parser.html(.DomTree)
> >> > > >> > >>> > > > > and
> >> > > >> > >>> .parser.css
> >> > > >> > >>> > > for
> >> > > >> > >>> > > > > these, but at the moment these appeared to be a
> >> > > >> > >>> > > > > little
> >> > > more
> >> > > >> > tied
> >> > > >> > >>> to
> >> > > >> > >>> > > > Caja's
> >> > > >> > >>> > > > > lexer/parser implementation than I preferred
> (though
> >> > > >> > >>> > > > > I
> >> > > admit
> >> > > >> > >>> > > > > AbstractParseTreeNode contains most of what's
> >> > > >> > >>> > > > > needed).
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > > To be sure, I don't see this as an end-all-be-all
> >> > > >> > transformation
> >> > > >> > >>> > system
> >> > > >> > >>> > > > in
> >> > > >> > >>> > > > > any way. I'd just like to put *something*
> reasonable
> >> > > >> > >>> > > > > in
> >> > > >> place
> >> > > >> > >>> that we
> >> > > >> > >>> > > can
> >> > > >> > >>> > > > > play with, provide some benefit, and enhance into a
> >> > truly
> >> > > >> > >>> > sophisticated
> >> > > >> > >>> > > > > vision of document rewriting.
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > >>
> >> > > >> > >>> > > > >>
> >> > > >> > >>> > > > >> >  c. Add Gadget.getParsedContent().
> >> > > >> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree
> >> > > >> > >>> > > > >> > used
> >> > to
> >> > > >> > >>> manipulate
> >> > > >> > >>> > > > Gadget
> >> > > >> > >>> > > > >> > Contents.
> >> > > >> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget
> >> > > >> > >>> > > > >> > object
> >> > > >> > indicating
> >> > > >> > >>> > when
> >> > > >> > >>> > > > any
> >> > > >> > >>> > > > >> > change is made, and emits an error if
> setContent()
> >> > has
> >> > > >> been
> >> > > >> > >>> called
> >> > > >> > >>> > > in
> >> > > >> > >>> > > > the
> >> > > >> > >>> > > > >> > interim.
> >> > > >> > >>> > > > >>
> >> > > >> > >>> > > > >> In Caja we have been moving towards immutable
> >> > > >> > >>> > > > >> trees...
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > >
> >> > > >> > >>> > > > > Interested to hear more about this. The whole idea
> is
> >> > for
> >> > > >> the
> >> > > >> > >>> > gadget's
> >> > > >> > >>> > > > tree
> >> > > >> > >>> > > > > representation to be modifiable. Doing that with
> >> > immutable
> >> > > >> > trees
> >> > > >> > >>> to
> >> > > >> > >>> > me
> >> > > >> > >>> > > > > suggests that a rewriter would have to create a
> >> > completely
> >> > > >> new
> >> > > >> > >>> tree
> >> > > >> > >>> > and
> >> > > >> > >>> > > > set
> >> > > >> > >>> > > > > it as a representation of new content. That's
> >> > > >> > >>> > > > > convenient
> >> > > as
> >> > > >> far
> >> > > >> > >>> as
> >> > > >> > >>> > the
> >> > > >> > >>> > > > > Gadget's maintenance of String<->Tree
> representations
> >> > > >> > >>> > > > > is
> >> > > >> > >>> concerned...
> >> > > >> > >>> > > but
> >> > > >> > >>> > > > > seems pretty heavyweight for many types of edits:
> >> > in-situ
> >> > > >> > >>> > modifications
> >> > > >> > >>> > > > of
> >> > > >> > >>> > > > > text, content reordering, etc. That's particularly
> so
> >> > > >> > >>> > > > > in
> >> > a
> >> > > >> > >>> > > > single-threaded
> >> > > >> > >>> > > > > (viz rewriting) environment.
> >> > > >> > >>> > > >
> >> > > >> > >>> > > > Never having been entirely sold on the concept, I'll
> >> > > >> > >>> > > > let
> >> > > those
> >> > > >> on
> >> > > >> > >>> the
> >> > > >> > >>> > > > Caja team who advocate immutability explain why.
> >> > > >> > >>> > > >
> >> > > >> > >>> > >
> >> > > >> > >>> >
> >> > > >> > >>>
> >> > > >> > >>
> >> > > >> > >>
> >> > > >> > >
> >> > > >> >
> >> > > >>
> >> > > >
> >> > > >
> >> > >
> >> >
> >
> >
> > >
> >
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to
> http://groups.google.com/group/google-caja-discuss
> To unsubscribe, email google-caja-discuss-unsubscribe@googlegroups.com
> -~----------~----~----~----~------~----~------~--~---
>
>

Re: [Caja] Re: Content Rewriter Modularization: Design/Change

Posted by Ben Laurie <be...@google.com>.
On Wed, Aug 27, 2008 at 2:12 AM, John Hjelmstad <fa...@google.com> wrote:
> Agreed with these conclusions. There are a number of avenues open to us for
> rewriter optimization, and per stats, working with the Caja team on an
> optimized, ideally purpose-built-for-Shindig parser is tops on the list.

Caja also has optimisation on the road map. "Working" comes first, though :-)

Curious what "purpose-built-for-Shindig" would entail?

> Re: making ParseTreeDefaultContentRewriter the default... thanks! I plan to
> do this, but FYI I'll be making the actual change only after cleaning up
> plumbing a bit: making GadgetSpec immutable, then cleaning up the rewriter
> interface to consume a (mutable) Gadget, which maintains its own parse tree
> (per the original plan). This in turn will allow
> ParseTreeDefaultContentRewriter to be split into its constituent parts (link
> rewriter, js consolidator, style rewriter), and more importantly allow other
> rewriters to modularly take advantage of the parse tree. Until then, the
> parse tree is hidden from other rewriters.
> Expect these changes soon, along with resuscitation of the discussion on how
> best to handle caching of rewritten content.
> --John
>
> On Mon, Aug 25, 2008 at 5:21 PM, Kevin Brown <et...@google.com> wrote:
>>
>> I think any arguments one way or another contingent on caching should be
>> ignored, because caching isn't always going to be possible (many
>> makeRequest
>> calls, proxied content, etc.)
>>
>> Instead, we should just make sure that the total time spent rewriting a
>> given piece of content is under some target -- say 50-100ms. We can
>> optimize
>> aggressively to meet this goal, regardless of which approach we take. From
>> the initial results, I think we can simply work with the caja team to
>> improve the performance of the parser to bring it into that range. Having
>> a
>> parse tree is more useful than not having one, so we should go ahead and
>> use
>> the parser instead of the lexer and then just focus on making it better.
>>
>> On Mon, Aug 25, 2008 at 4:44 PM, Louis Ryan <lr...@google.com> wrote:
>>
>> > These numbers seem pretty solid so I believe its fine to switch the
>> > general
>> > rewriter over to this new code and ditch the lexer based stuff. Nice
>> > work.
>> >
>> > On Mon, Aug 25, 2008 at 2:27 PM, John Hjelmstad <fa...@google.com>
>> > wrote:
>> >
>> > > I've implemented a parse-tree based ContentRewriter using the existing
>> > > plumbing (same caching semantics et al), as
>> > ParseTreeDefaultContentRewriter
>> > > and ParseTreeHtmlRewriter, respectively. The latter contains
>> > > essentially
>> > > all
>> > > rewriting functionality from the previous HtmlTagTransformer classes.
>> > > The
>> > > parse-tree based rewriter is now functionally equivalent to the
>> > > previous
>> > > rewriter. The new classes use a Caja-based HTML parser under the hood.
>> > > This proves out the functional viability of a tree-based rewriter, but
>> > > performance also needs to be assessed.
>> > >
>> > > I've gone ahead and profiled the comparative performance of each
>> > rewriter,
>> > > "Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out
>> > every
>> > > time essentially by definition, since obviously Caja's parser uses its
>> > own
>> > > lexer under the hood.
>> > >
>> > > Summary:
>> > > * The performance of each is fundamentally O(n), though...
>> > > * For any given input size, Lexer-based rewriting averages between 2.5
>> > > -
>> > > 3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
>> > > * By far, the majority of time involved in ParseTree-based
>> > > optimization
>> > is
>> > > initial parsing: 75% of all processing.
>> > >
>> > > Details:
>> > > 1. I wrote a simple rewriter/parser profiler which rewrites (the
>> > > sample
>> > > rewriter gadget's content * X repetitions) N times, recording the
>> > resulting
>> > > run time. The run time of parse-based rewriting degraded as N
>> > > increased,
>> > in
>> > > all likelihood due to the additional cost of object management
>> > (lexer-based
>> > > rewriting involves few intermediate objects). Given that the results
>> > > of
>> > > rewriting will be variously cached, it's very unlikely that rewriting
>> > will
>> > > happen in immediate succession hundreds or thousands of times. As
>> > > such, I
>> > > fixed N = 1 to re-run the tests in relative isolation from one
>> > > another.
>> > > Results from a given run:
>> > >
>> > > LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704]
>> > > millis/run
>> > >
>> > > PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128]
>> > > millis/run
>> > >
>> > > Parse/lex ratio: 3.4405327398939263
>> > >
>> > > LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run
>> > >
>> > > PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968]
>> > millis/run
>> > >
>> > > Parse/lex ratio: 3.3908412235979606
>> > >
>> > > LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run
>> > >
>> > > PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032]
>> > > millis/run
>> > >
>> > > Parse/lex ratio: 3.5628839314581313
>> > >
>> > > LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run
>> > >
>> > > PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run
>> > >
>> > > Parse/lex ratio: 3.0316666666666667
>> > >
>> > >
>> > > 2. Drilling down, I added simple operation profiling to each component
>> > > of
>> > > parse-tree rewriting: original parse (CajaHtmlParser); building
>> > > mutable
>> > > tree
>> > > nodes; rewriting links; concatenating JS nodes; rewriting style
>> > > blocks;
>> > > rendering parse tree. I then reran the same tests.
>> > >
>> > > Results from subsequent run:
>> > >
>> > > LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096]
>> > > millis/run
>> > >
>> > > PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384]
>> > millis/run
>> > >
>> > > Parse/lex ratio: 3.912896728183352
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms
>> > >
>> > > Op[render] min:36.851ms, max:36.851ms, avg:36.851ms
>> > >
>> > > Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms
>> > >
>> > > Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms
>> > >
>> > > Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms
>> > >
>> > > Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms
>> > >
>> > >
>> > > LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904]
>> > > millis/run
>> > >
>> > > PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032]
>> > millis/run
>> > >
>> > > Parse/lex ratio: 5.251336201277539
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms
>> > >
>> > > Op[render] min:5.184ms, max:5.184ms, avg:5.184ms
>> > >
>> > > Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms
>> > >
>> > > Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms
>> > >
>> > > Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms
>> > >
>> > > Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms
>> > >
>> > >
>> > > LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run
>> > >
>> > > PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run
>> > >
>> > > Parse/lex ratio: 3.066152477009789
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms
>> > >
>> > > Op[render] min:0.678ms, max:0.678ms, avg:0.678ms
>> > >
>> > > Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms
>> > >
>> > > Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms
>> > >
>> > > Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms
>> > >
>> > > Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms
>> > >
>> > >
>> > > LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run
>> > >
>> > > PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run
>> > >
>> > > Parse/lex ratio: 3.518581081081081
>> > >
>> > > [PARSE OPS]
>> > >
>> > > Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms
>> > >
>> > > Op[render] min:0.077ms, max:0.077ms, avg:0.077ms
>> > >
>> > > Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms
>> > >
>> > > Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms
>> > >
>> > > Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms
>> > >
>> > > Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms
>> > >
>> > >
>> > > 3. Drilling further, I wrote a separate test breaking out the
>> > > performance
>> > > components to parsing: calling the Caja DomParser.parseFragment(...)
>> > > API,
>> > > and subsequently wrapping the results of that call with ParsedHtmlNode
>> > > objects to satisfy interface requirements:
>> > >
>> > > Typical run:
>> > >
>> > > Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms
>> > >
>> > > Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms
>> > >
>> > >
>> > > Caja parser [size*10, runs:1] in 42915 microsecs [42.915072]
>> > > millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms
>> > >
>> > > Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms
>> > >
>> > >
>> > > Caja parser [size*50, runs:1] in 157048 microsecs [157.048064]
>> > > millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms
>> > >
>> > > Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms
>> > >
>> > >
>> > > Caja parser [size*100, runs:1] in 236073 microsecs [236.07296]
>> > > millis/run
>> > >
>> > > [PARSER COMPONENTS]
>> > >
>> > > Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms
>> > >
>> > > Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms
>> > >
>> > >
>> > > Conclusions and Discussion:
>> > >
>> > > The purpose of this task was to prove that tree-based parsing is
>> > > functionally viable, which has succeeded. Past that, it's a matter of
>> > > choosing functionality vs. performance. Given that rewriting results
>> > > are
>> > > cached, perhaps even ~3x increase in rewriting cost will be worth
>> > > paying.
>> > >
>> > >
>> > > That's particularly true given the new class of optimizations/rewrites
>> > made
>> > > possible with a parse tree, as well as some bugs that are more easily
>> > fixed
>> > > using it. For instance, I recently discovered a bug with the existing
>> > > JS
>> > > tag
>> > > rewriter which ignores type="..." attributes and doesn't maintain "id"
>> > > attributes in certain situations. These can be resolved in the lexer
>> > case,
>> > > but are clearer in the parser one.
>> > >
>> > >
>> > > Lastly, as mentioned at the beginning of this thread, I plan to
>> > > maintain
>> > > the
>> > > ability to manipulate a gadget by string, meaning a lexer-based
>> > > approach
>> > > can
>> > > still be used where desired and parse-tree isn't required.
>> > >
>> > >
>> > > Next steps:
>> > >
>> > > 1. My next step is to add modularity to content rewriting, but again
>> > > without
>> > > changing any caching semantics. Instead, rather than a single
>> > > ContentRewriter being injected, a ContentRewriterRegistry will be. The
>> > > default Registry will support injection of a single ContentRewriter to
>> > > maintain backward compatibility for now.
>> > >
>> > > 2. GadgetSpec immutability restored, ensuring post-rewritten caching.
>> > >
>> > > 3. ContentRewriter API cleanup.
>> > >
>> > >
>> > > --John
>> > >
>> > >
>> > > On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <fa...@google.com>
>> > wrote:
>> > >
>> > > > Interesting idea, and sounds fine to me. Concretely, this lets me
>> > > sidestep
>> > > > SHINDIG-500 for a little while, which is nice (though I'd _really_
>> > > > like
>> > > to
>> > > > see the API cleanup go in! :)), in favor of migrating the existing
>> > > rewriter
>> > > > to a tree-based approach. Turns out I've been working on #1 and #2
>> > > > independently anyway. I'll post a patch soon. Thanks!
>> > > >
>> > > > John
>> > > >
>> > > >
>> > > > On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com>
>> > > > wrote:
>> > > >
>> > > >> Can we prove this out incrementally bottom-up. In general I think
>> > using
>> > > >> DOM
>> > > >> is the right thing to do from a rewriting standpoint. So here's how
>> > > >> I
>> > > >> propose we proceed
>> > > >>
>> > > >> 1. If the Caja dom is a little awkward wrap it, if not lets just
>> > > >> use
>> > it
>> > > as
>> > > >> is. We can always resolve this later
>> > > >> 2. Change the existing content rewriters to use the DOM instead of
>> > > >> a
>> > > >> lexer,
>> > > >> should be pretty easy. Maybe add some fancier rewriting like moving
>> > CSS
>> > > >> into
>> > > >> HEAD
>> > > >> 3. Do some perf testing, look into memory overhead of dom
>> > transformation
>> > > >> etc.
>> > > >> 4. Alter GadgetSpec's to retain the dom when they are cached
>> > > >> 5. Alter the gadget rendering phase to serialize the content of the
>> > dom
>> > > to
>> > > >> output
>> > > >> 6. Annotate the dom at parse time to make render time user-pref
>> > > >> substituions
>> > > >> faster, this should be easy enough too...
>> > > >>
>> > > >> This should be enough to prove out the pipeline end-to-end and
>> > identify
>> > > >> any
>> > > >> major perf niggles. Once this is done we can look into how to
>> > > >> inject a
>> > > >> rewriter pipeline into the parsing phase and the rendering phase.
>> > > >>
>> > > >> -Louis
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com>
>> > > wrote:
>> > > >>
>> > > >> > Re-responding in order to apply the last few exchanges to
>> > > >> > google-caja-discuss@ (@gmail vs. @google membership issues).
>> > > >> >
>> > > >> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad
>> > > >> > <fa...@google.com>
>> > > >> wrote:
>> > > >> >
>> > > >> > > Hello,
>> > > >> > >
>> > > >> > > While beginning to refactor the rewriter APIs I've discovered
>> > > >> > > that
>> > > >> there
>> > > >> > > unfortunately is one semantic difference inherent to moving
>> > > >> getContent()
>> > > >> > and
>> > > >> > > setContent() methods into the Gadget object (replacing
>> > > >> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no
>> > > >> > > longer
>> > > >> caches
>> > > >> > > rewritten content.
>> > > >> > >
>> > > >> > > I've written a discussion of this in issue SHINDIG-500, which
>> > tracks
>> > > >> this
>> > > >> > > implementation sub-task:
>> > > >> > https://issues.apache.org/jira/browse/SHINDIG-500
>> > > >> > >
>> > > >> > > To summarize:
>> > > >> > > 1. Is this change acceptable for the time being?
>> > > >> > > 2. I suggest that we can, at a later date, move fetching of
>> > > >> > > gadget
>> > > >> specs
>> > > >> > > into GadgetServer while injecting a Gadget(Spec) cache there as
>> > > well,
>> > > >> > > offering finer-tuned control over caching characteristics.
>> > > >> > >
>> > > >> > > Thanks,
>> > > >> > > John
>> > > >> > >
>> > > >> > >
>> > > >> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad
>> > > >> > > <fargo@google.com
>> > >
>> > > >> > wrote:
>> > > >> > >
>> > > >> > >> I understand these concerns, and should be clear that I don't
>> > > >> (despite
>> > > >> > my
>> > > >> > >> personal interest in experimenting with the idea, agreed that
>> > > >> > >> we
>> > > >> don't
>> > > >> > have
>> > > >> > >> time for it at the moment) have any plans to introduce this
>> > > >> > >> sort
>> > of
>> > > >> RPC
>> > > >> > >> anywhere - certainly not in Shindig itself, as any such call
>> > would
>> > > be
>> > > >> > hidden
>> > > >> > >> behind an interface anyway.
>> > > >> > >>
>> > > >> > >> Putting the RPC hypothetical aside, I still feel that there's
>> > value
>> > > >> to
>> > > >> > >> implementing HTML parsing in terms of an interface:
>> > > >> > >> * Clearer separation of concerns/boundary between projects.
>> > > >> > >>   - Corollary simplicity in testing.
>> > > >> > >> * Clearer API for content manipulation (that doesn't require
>> > > >> knowledge
>> > > >> > of
>> > > >> > >> Caja).
>> > > >> > >>
>> > > >> > >> I could be convinced otherwise, but at this point the code
>> > involved
>> > > >> > seems
>> > > >> > >> of manageable size, so still worth doing. Thoughts?
>> > > >> > >>
>> > > >> > >> John
>> > > >> > >>
>> > > >> > >>
>> > > >> > >>
>> > > >> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com>
>> > > >> wrote:
>> > > >> > >>
>> > > >> > >>> I agree with Louis -- that's just not practical. Every
>> > > >> > >>> rewriting
>> > > >> > >>> operation
>> > > >> > >>> must work in real time. Caja's existing html parser is
>> > > >> > >>> adequate
>> > > for
>> > > >> our
>> > > >> > >>> needs, and we shouldn't go out of our way to tolerate every
>> > oddity
>> > > >> of
>> > > >> > >>> random
>> > > >> > >>> web browsers (especially as it simply wouldn't work unless
>> > > >> > >>> you
>> > > >> farmed
>> > > >> > it
>> > > >> > >>> out
>> > > >> > >>> to *every* browser). Any new code needs to be grounded in
>> > > practical,
>> > > >> > >>> current
>> > > >> > >>> needs, not theoretical options. We can always change code
>> > > >> > >>> later
>> > if
>> > > >> we
>> > > >> > >>> find a
>> > > >> > >>> real need for something like that. We have real work to do in
>> > the
>> > > >> > >>> meantime.
>> > > >> > >>>
>> > > >> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan
>> > > >> > >>> <lr...@google.com>
>> > > >> wrote:
>> > > >> > >>>
>> > > >> > >>> > John,
>> > > >> > >>> >
>> > > >> > >>> > From a practicality standpoint I'm a little nervous about
>> > > >> > >>> > this
>> > > >> plan
>> > > >> > to
>> > > >> > >>> make
>> > > >> > >>> > RPCs calls out of a Java process to a native process to
>> > > >> > >>> > fetch
>> > a
>> > > >> parse
>> > > >> > >>> tree
>> > > >> > >>> > for transformations that have to occur realtime. I don't
>> > > >> > >>> > think
>> > > the
>> > > >> > >>> > motivating factor here is to accept all inputs that
>> > > >> > >>> > browsers
>> > > can.
>> > > >> > >>> Gadget
>> > > >> > >>> > developers will tailor their markup to the platform as they
>> > have
>> > > >> done
>> > > >> > >>> > already. I would greatly prefer us to pick one 'good'
>> > > >> > >>> > parser
>> > and
>> > > >> > stick
>> > > >> > >>> with
>> > > >> > >>> > it for all the manageability and consumability benefits
>> > > >> > >>> > that
>> > > come
>> > > >> > with
>> > > >> > >>> that
>> > > >> > >>> > decision. Perhaps Im missing something here?
>> > > >> > >>> >
>> > > >> > >>> > -Louis
>> > > >> > >>> >
>> > > >> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
>> > > >> fargo@google.com>
>> > > >> > >>> wrote:
>> > > >> > >>> >
>> > > >> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie
>> > > >> > >>> > > <benl@google.com
>> > >
>> > > >> > wrote:
>> > > >> > >>> > >
>> > > >> > >>> > > > [+google-caja-discuss]
>> > > >> > >>> > > >
>> > > >> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
>> > > >> fargo@google.com
>> > > >> > >
>> > > >> > >>> > wrote:
>> > > >> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <
>> > > benl@google.com
>> > > >> >
>> > > >> > >>> wrote:
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>> > > >> > >>> fargo@google.com>
>> > > >> > >>> > > > wrote:
>> > > >> > >>> > > > >> > This proposal effectively enables the renderer to
>> > > become
>> > > >> a
>> > > >> > >>> > > multi-pass
>> > > >> > >>> > > > >> > compiler for gadget content (essentially,
>> > > >> > >>> > > > >> > arbitrary
>> > web
>> > > >> > >>> content).
>> > > >> > >>> > > Such
>> > > >> > >>> > > > a
>> > > >> > >>> > > > >> > compiler can provide several benefits: static
>> > > >> optimization
>> > > >> > of
>> > > >> > >>> > gadget
>> > > >> > >>> > > > >> content
>> > > >> > >>> > > > >> > (auto-proxying of images, whitespace/comment
>> > > >> > >>> > > > >> > removal,
>> > > >> > >>> > consolidation
>> > > >> > >>> > > of
>> > > >> > >>> > > > >> CSS
>> > > >> > >>> > > > >> > blocks), security benefits (caja et al), new
>> > > >> functionality
>> > > >> > >>> > > (annotation
>> > > >> > >>> > > > of
>> > > >> > >>> > > > >> > content for stats, document analysis,
>> > > container-specific
>> > > >> > >>> > features),
>> > > >> > >>> > > > etc.
>> > > >> > >>> > > > >> To
>> > > >> > >>> > > > >> > my knowledge no such infrastructure exists today
>> > (with
>> > > >> the
>> > > >> > >>> > possible
>> > > >> > >>> > > > >> > exception of Caja itself, which I'd like to
>> > > >> > >>> > > > >> > dovetail
>> > > with
>> > > >> > this
>> > > >> > >>> > > work).
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> Caja clearly provides a large chunk of the code
>> > > >> > >>> > > > >> you'd
>> > > need
>> > > >> for
>> > > >> > >>> this.
>> > > >> > >>> > > > >> I'd like to hear how we'd manage to avoid
>> > > >> > >>> > > > >> duplication
>> > > >> between
>> > > >> > >>> the
>> > > >> > >>> > two
>> > > >> > >>> > > > >> projects.
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> A generalised framework for manipulating content
>> > > >> > >>> > > > >> sounds
>> > > >> like a
>> > > >> > >>> great
>> > > >> > >>> > > > >> idea, but probably should not live in either of the
>> > > >> > >>> > > > >> two
>> > > >> > projects
>> > > >> > >>> > (Caja
>> > > >> > >>> > > > >> and Shindig) but rather should be shared by both of
>> > them,
>> > > I
>> > > >> > >>> suspect.
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > I agree on both counts. As I mentioned, the piece of
>> > this
>> > > >> idea
>> > > >> > >>> that I
>> > > >> > >>> > > > expect
>> > > >> > >>> > > > > to change the most is the parse tree, and Caja's
>> > > >> .parser.html
>> > > >> > and
>> > > >> > >>> > > > > .parser.css packages contain much of what I've thrown
>> > > >> > >>> > > > > in
>> > > >> here
>> > > >> > as
>> > > >> > >>> a
>> > > >> > >>> > > base.
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > My key requirements are:
>> > > >> > >>> > > > > * Lightweight framework.
>> > > >> > >>> > > > > * Parser modularity, mostly for HTML parsers (to
>> > > >> > >>> > > > > re-use
>> > > the
>> > > >> > good
>> > > >> > >>> work
>> > > >> > >>> > > > done
>> > > >> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja
>> > I'd
>> > > >> bet)
>> > > >> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
>> > > >> > >>> > > > > * Easy to manipulate structure.
>> > > >> > >>> > > >
>> > > >> > >>> > > > I'm not sure what the value of parser modularity is? If
>> > the
>> > > >> > >>> resulting
>> > > >> > >>> > > > tree is different, then that's a problem for people
>> > > processing
>> > > >> > the
>> > > >> > >>> > > > tree. And if it is not, then why do we care?
>> > > >> > >>> > >
>> > > >> > >>> > >
>> > > >> > >>> > > IMO the value of parser modularity is that the lenient
>> > parsers
>> > > >> > native
>> > > >> > >>> to
>> > > >> > >>> > > browsers can be used in place of those that might not
>> > > >> > >>> > > accept
>> > > all
>> > > >> > >>> inputs.
>> > > >> > >>> > > One
>> > > >> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing
>> > > >> > >>> > > code
>> > > >> into a
>> > > >> > >>> > server
>> > > >> > >>> > > that runs parallel to Shindig and provides a "local RPC"
>> > > service
>> > > >> > for
>> > > >> > >>> > > parsing
>> > > >> > >>> > > semi-structured HTML. The resulting tree for WebKit's
>> > > >> > >>> > > parser
>> > > >> might
>> > > >> > be
>> > > >> > >>> > > different than that for an XHTML parser, Gecko's parser,
>> > etc,
>> > > >> but
>> > > >> > if
>> > > >> > >>> the
>> > > >> > >>> > > algorithm implemented atop it is rule-based rather than
>> > > >> > >>> strict-structure
>> > > >> > >>> > > based that should be fine, no?
>> > > >> > >>> > >
>> > > >> > >>> > >
>> > > >> > >>> > > >
>> > > >> > >>> > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > I'd love to see both projects share the same base
>> > > >> > >>> > > > > syntax
>> > > >> tree
>> > > >> > >>> > > > > representations. I considered .parser.html(.DomTree)
>> > > >> > >>> > > > > and
>> > > >> > >>> .parser.css
>> > > >> > >>> > > for
>> > > >> > >>> > > > > these, but at the moment these appeared to be a
>> > > >> > >>> > > > > little
>> > > more
>> > > >> > tied
>> > > >> > >>> to
>> > > >> > >>> > > > Caja's
>> > > >> > >>> > > > > lexer/parser implementation than I preferred (though
>> > > >> > >>> > > > > I
>> > > admit
>> > > >> > >>> > > > > AbstractParseTreeNode contains most of what's
>> > > >> > >>> > > > > needed).
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > To be sure, I don't see this as an end-all-be-all
>> > > >> > transformation
>> > > >> > >>> > system
>> > > >> > >>> > > > in
>> > > >> > >>> > > > > any way. I'd just like to put *something* reasonable
>> > > >> > >>> > > > > in
>> > > >> place
>> > > >> > >>> that we
>> > > >> > >>> > > can
>> > > >> > >>> > > > > play with, provide some benefit, and enhance into a
>> > truly
>> > > >> > >>> > sophisticated
>> > > >> > >>> > > > > vision of document rewriting.
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> >  c. Add Gadget.getParsedContent().
>> > > >> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree
>> > > >> > >>> > > > >> > used
>> > to
>> > > >> > >>> manipulate
>> > > >> > >>> > > > Gadget
>> > > >> > >>> > > > >> > Contents.
>> > > >> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget
>> > > >> > >>> > > > >> > object
>> > > >> > indicating
>> > > >> > >>> > when
>> > > >> > >>> > > > any
>> > > >> > >>> > > > >> > change is made, and emits an error if setContent()
>> > has
>> > > >> been
>> > > >> > >>> called
>> > > >> > >>> > > in
>> > > >> > >>> > > > the
>> > > >> > >>> > > > >> > interim.
>> > > >> > >>> > > > >>
>> > > >> > >>> > > > >> In Caja we have been moving towards immutable
>> > > >> > >>> > > > >> trees...
>> > > >> > >>> > > > >
>> > > >> > >>> > > > >
>> > > >> > >>> > > > > Interested to hear more about this. The whole idea is
>> > for
>> > > >> the
>> > > >> > >>> > gadget's
>> > > >> > >>> > > > tree
>> > > >> > >>> > > > > representation to be modifiable. Doing that with
>> > immutable
>> > > >> > trees
>> > > >> > >>> to
>> > > >> > >>> > me
>> > > >> > >>> > > > > suggests that a rewriter would have to create a
>> > completely
>> > > >> new
>> > > >> > >>> tree
>> > > >> > >>> > and
>> > > >> > >>> > > > set
>> > > >> > >>> > > > > it as a representation of new content. That's
>> > > >> > >>> > > > > convenient
>> > > as
>> > > >> far
>> > > >> > >>> as
>> > > >> > >>> > the
>> > > >> > >>> > > > > Gadget's maintenance of String<->Tree representations
>> > > >> > >>> > > > > is
>> > > >> > >>> concerned...
>> > > >> > >>> > > but
>> > > >> > >>> > > > > seems pretty heavyweight for many types of edits:
>> > in-situ
>> > > >> > >>> > modifications
>> > > >> > >>> > > > of
>> > > >> > >>> > > > > text, content reordering, etc. That's particularly so
>> > > >> > >>> > > > > in
>> > a
>> > > >> > >>> > > > single-threaded
>> > > >> > >>> > > > > (viz rewriting) environment.
>> > > >> > >>> > > >
>> > > >> > >>> > > > Never having been entirely sold on the concept, I'll
>> > > >> > >>> > > > let
>> > > those
>> > > >> on
>> > > >> > >>> the
>> > > >> > >>> > > > Caja team who advocate immutability explain why.
>> > > >> > >>> > > >
>> > > >> > >>> > >
>> > > >> > >>> >
>> > > >> > >>>
>> > > >> > >>
>> > > >> > >>
>> > > >> > >
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > >
>> >
>
>
> --~--~---------~--~----~------------~-------~--~----~
> You received this message because you are subscribed to
> http://groups.google.com/group/google-caja-discuss
> To unsubscribe, email google-caja-discuss-unsubscribe@googlegroups.com
> -~----------~----~----~----~------~----~------~--~---
>
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
Agreed with these conclusions. There are a number of avenues open to us for
rewriter optimization, and per stats, working with the Caja team on an
optimized, ideally purpose-built-for-Shindig parser is tops on the list.
Re: making ParseTreeDefaultContentRewriter the default... thanks! I plan to
do this, but FYI I'll be making the actual change only after cleaning up
plumbing a bit: making GadgetSpec immutable, then cleaning up the rewriter
interface to consume a (mutable) Gadget, which maintains its own parse tree
(per the original plan). This in turn will allow
ParseTreeDefaultContentRewriter to be split into its constituent parts (link
rewriter, js consolidator, style rewriter), and more importantly allow other
rewriters to modularly take advantage of the parse tree. Until then, the
parse tree is hidden from other rewriters.

Expect these changes soon, along with resuscitation of the discussion on how
best to handle caching of rewritten content.

--John

On Mon, Aug 25, 2008 at 5:21 PM, Kevin Brown <et...@google.com> wrote:

> I think any arguments one way or another contingent on caching should be
> ignored, because caching isn't always going to be possible (many
> makeRequest
> calls, proxied content, etc.)
>
> Instead, we should just make sure that the total time spent rewriting a
> given piece of content is under some target -- say 50-100ms. We can
> optimize
> aggressively to meet this goal, regardless of which approach we take. From
> the initial results, I think we can simply work with the caja team to
> improve the performance of the parser to bring it into that range. Having a
> parse tree is more useful than not having one, so we should go ahead and
> use
> the parser instead of the lexer and then just focus on making it better.
>
> On Mon, Aug 25, 2008 at 4:44 PM, Louis Ryan <lr...@google.com> wrote:
>
> > These numbers seem pretty solid so I believe its fine to switch the
> general
> > rewriter over to this new code and ditch the lexer based stuff. Nice
> work.
> >
> > On Mon, Aug 25, 2008 at 2:27 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >
> > > I've implemented a parse-tree based ContentRewriter using the existing
> > > plumbing (same caching semantics et al), as
> > ParseTreeDefaultContentRewriter
> > > and ParseTreeHtmlRewriter, respectively. The latter contains
> essentially
> > > all
> > > rewriting functionality from the previous HtmlTagTransformer classes.
> The
> > > parse-tree based rewriter is now functionally equivalent to the
> previous
> > > rewriter. The new classes use a Caja-based HTML parser under the hood.
> > > This proves out the functional viability of a tree-based rewriter, but
> > > performance also needs to be assessed.
> > >
> > > I've gone ahead and profiled the comparative performance of each
> > rewriter,
> > > "Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out
> > every
> > > time essentially by definition, since obviously Caja's parser uses its
> > own
> > > lexer under the hood.
> > >
> > > Summary:
> > > * The performance of each is fundamentally O(n), though...
> > > * For any given input size, Lexer-based rewriting averages between 2.5
> -
> > > 3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
> > > * By far, the majority of time involved in ParseTree-based optimization
> > is
> > > initial parsing: 75% of all processing.
> > >
> > > Details:
> > > 1. I wrote a simple rewriter/parser profiler which rewrites (the sample
> > > rewriter gadget's content * X repetitions) N times, recording the
> > resulting
> > > run time. The run time of parse-based rewriting degraded as N
> increased,
> > in
> > > all likelihood due to the additional cost of object management
> > (lexer-based
> > > rewriting involves few intermediate objects). Given that the results of
> > > rewriting will be variously cached, it's very unlikely that rewriting
> > will
> > > happen in immediate succession hundreds or thousands of times. As such,
> I
> > > fixed N = 1 to re-run the tests in relative isolation from one another.
> > > Results from a given run:
> > >
> > > LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704]
> millis/run
> > >
> > > PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128]
> > > millis/run
> > >
> > > Parse/lex ratio: 3.4405327398939263
> > >
> > > LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run
> > >
> > > PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968]
> > millis/run
> > >
> > > Parse/lex ratio: 3.3908412235979606
> > >
> > > LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run
> > >
> > > PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032]
> millis/run
> > >
> > > Parse/lex ratio: 3.5628839314581313
> > >
> > > LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run
> > >
> > > PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run
> > >
> > > Parse/lex ratio: 3.0316666666666667
> > >
> > >
> > > 2. Drilling down, I added simple operation profiling to each component
> of
> > > parse-tree rewriting: original parse (CajaHtmlParser); building mutable
> > > tree
> > > nodes; rewriting links; concatenating JS nodes; rewriting style blocks;
> > > rendering parse tree. I then reran the same tests.
> > >
> > > Results from subsequent run:
> > >
> > > LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096]
> millis/run
> > >
> > > PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384]
> > millis/run
> > >
> > > Parse/lex ratio: 3.912896728183352
> > >
> > > [PARSE OPS]
> > >
> > > Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms
> > >
> > > Op[render] min:36.851ms, max:36.851ms, avg:36.851ms
> > >
> > > Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms
> > >
> > > Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms
> > >
> > > Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms
> > >
> > > Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms
> > >
> > >
> > > LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904] millis/run
> > >
> > > PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032]
> > millis/run
> > >
> > > Parse/lex ratio: 5.251336201277539
> > >
> > > [PARSE OPS]
> > >
> > > Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms
> > >
> > > Op[render] min:5.184ms, max:5.184ms, avg:5.184ms
> > >
> > > Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms
> > >
> > > Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms
> > >
> > > Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms
> > >
> > > Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms
> > >
> > >
> > > LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run
> > >
> > > PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run
> > >
> > > Parse/lex ratio: 3.066152477009789
> > >
> > > [PARSE OPS]
> > >
> > > Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms
> > >
> > > Op[render] min:0.678ms, max:0.678ms, avg:0.678ms
> > >
> > > Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms
> > >
> > > Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms
> > >
> > > Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms
> > >
> > > Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms
> > >
> > >
> > > LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run
> > >
> > > PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run
> > >
> > > Parse/lex ratio: 3.518581081081081
> > >
> > > [PARSE OPS]
> > >
> > > Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms
> > >
> > > Op[render] min:0.077ms, max:0.077ms, avg:0.077ms
> > >
> > > Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms
> > >
> > > Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms
> > >
> > > Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms
> > >
> > > Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms
> > >
> > >
> > > 3. Drilling further, I wrote a separate test breaking out the
> performance
> > > components to parsing: calling the Caja DomParser.parseFragment(...)
> API,
> > > and subsequently wrapping the results of that call with ParsedHtmlNode
> > > objects to satisfy interface requirements:
> > >
> > > Typical run:
> > >
> > > Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run
> > >
> > > [PARSER COMPONENTS]
> > >
> > > Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms
> > >
> > > Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms
> > >
> > >
> > > Caja parser [size*10, runs:1] in 42915 microsecs [42.915072] millis/run
> > >
> > > [PARSER COMPONENTS]
> > >
> > > Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms
> > >
> > > Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms
> > >
> > >
> > > Caja parser [size*50, runs:1] in 157048 microsecs [157.048064]
> millis/run
> > >
> > > [PARSER COMPONENTS]
> > >
> > > Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms
> > >
> > > Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms
> > >
> > >
> > > Caja parser [size*100, runs:1] in 236073 microsecs [236.07296]
> millis/run
> > >
> > > [PARSER COMPONENTS]
> > >
> > > Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms
> > >
> > > Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms
> > >
> > >
> > > Conclusions and Discussion:
> > >
> > > The purpose of this task was to prove that tree-based parsing is
> > > functionally viable, which has succeeded. Past that, it's a matter of
> > > choosing functionality vs. performance. Given that rewriting results
> are
> > > cached, perhaps even ~3x increase in rewriting cost will be worth
> paying.
> > >
> > >
> > > That's particularly true given the new class of optimizations/rewrites
> > made
> > > possible with a parse tree, as well as some bugs that are more easily
> > fixed
> > > using it. For instance, I recently discovered a bug with the existing
> JS
> > > tag
> > > rewriter which ignores type="..." attributes and doesn't maintain "id"
> > > attributes in certain situations. These can be resolved in the lexer
> > case,
> > > but are clearer in the parser one.
> > >
> > >
> > > Lastly, as mentioned at the beginning of this thread, I plan to
> maintain
> > > the
> > > ability to manipulate a gadget by string, meaning a lexer-based
> approach
> > > can
> > > still be used where desired and parse-tree isn't required.
> > >
> > >
> > > Next steps:
> > >
> > > 1. My next step is to add modularity to content rewriting, but again
> > > without
> > > changing any caching semantics. Instead, rather than a single
> > > ContentRewriter being injected, a ContentRewriterRegistry will be. The
> > > default Registry will support injection of a single ContentRewriter to
> > > maintain backward compatibility for now.
> > >
> > > 2. GadgetSpec immutability restored, ensuring post-rewritten caching.
> > >
> > > 3. ContentRewriter API cleanup.
> > >
> > >
> > > --John
> > >
> > >
> > > On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > >
> > > > Interesting idea, and sounds fine to me. Concretely, this lets me
> > > sidestep
> > > > SHINDIG-500 for a little while, which is nice (though I'd _really_
> like
> > > to
> > > > see the API cleanup go in! :)), in favor of migrating the existing
> > > rewriter
> > > > to a tree-based approach. Turns out I've been working on #1 and #2
> > > > independently anyway. I'll post a patch soon. Thanks!
> > > >
> > > > John
> > > >
> > > >
> > > > On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com>
> wrote:
> > > >
> > > >> Can we prove this out incrementally bottom-up. In general I think
> > using
> > > >> DOM
> > > >> is the right thing to do from a rewriting standpoint. So here's how
> I
> > > >> propose we proceed
> > > >>
> > > >> 1. If the Caja dom is a little awkward wrap it, if not lets just use
> > it
> > > as
> > > >> is. We can always resolve this later
> > > >> 2. Change the existing content rewriters to use the DOM instead of a
> > > >> lexer,
> > > >> should be pretty easy. Maybe add some fancier rewriting like moving
> > CSS
> > > >> into
> > > >> HEAD
> > > >> 3. Do some perf testing, look into memory overhead of dom
> > transformation
> > > >> etc.
> > > >> 4. Alter GadgetSpec's to retain the dom when they are cached
> > > >> 5. Alter the gadget rendering phase to serialize the content of the
> > dom
> > > to
> > > >> output
> > > >> 6. Annotate the dom at parse time to make render time user-pref
> > > >> substituions
> > > >> faster, this should be easy enough too...
> > > >>
> > > >> This should be enough to prove out the pipeline end-to-end and
> > identify
> > > >> any
> > > >> major perf niggles. Once this is done we can look into how to inject
> a
> > > >> rewriter pipeline into the parsing phase and the rendering phase.
> > > >>
> > > >> -Louis
> > > >>
> > > >>
> > > >>
> > > >> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com>
> > > wrote:
> > > >>
> > > >> > Re-responding in order to apply the last few exchanges to
> > > >> > google-caja-discuss@ (@gmail vs. @google membership issues).
> > > >> >
> > > >> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fargo@google.com
> >
> > > >> wrote:
> > > >> >
> > > >> > > Hello,
> > > >> > >
> > > >> > > While beginning to refactor the rewriter APIs I've discovered
> that
> > > >> there
> > > >> > > unfortunately is one semantic difference inherent to moving
> > > >> getContent()
> > > >> > and
> > > >> > > setContent() methods into the Gadget object (replacing
> > > >> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no
> longer
> > > >> caches
> > > >> > > rewritten content.
> > > >> > >
> > > >> > > I've written a discussion of this in issue SHINDIG-500, which
> > tracks
> > > >> this
> > > >> > > implementation sub-task:
> > > >> > https://issues.apache.org/jira/browse/SHINDIG-500
> > > >> > >
> > > >> > > To summarize:
> > > >> > > 1. Is this change acceptable for the time being?
> > > >> > > 2. I suggest that we can, at a later date, move fetching of
> gadget
> > > >> specs
> > > >> > > into GadgetServer while injecting a Gadget(Spec) cache there as
> > > well,
> > > >> > > offering finer-tuned control over caching characteristics.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > John
> > > >> > >
> > > >> > >
> > > >> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <
> fargo@google.com
> > >
> > > >> > wrote:
> > > >> > >
> > > >> > >> I understand these concerns, and should be clear that I don't
> > > >> (despite
> > > >> > my
> > > >> > >> personal interest in experimenting with the idea, agreed that
> we
> > > >> don't
> > > >> > have
> > > >> > >> time for it at the moment) have any plans to introduce this
> sort
> > of
> > > >> RPC
> > > >> > >> anywhere - certainly not in Shindig itself, as any such call
> > would
> > > be
> > > >> > hidden
> > > >> > >> behind an interface anyway.
> > > >> > >>
> > > >> > >> Putting the RPC hypothetical aside, I still feel that there's
> > value
> > > >> to
> > > >> > >> implementing HTML parsing in terms of an interface:
> > > >> > >> * Clearer separation of concerns/boundary between projects.
> > > >> > >>   - Corollary simplicity in testing.
> > > >> > >> * Clearer API for content manipulation (that doesn't require
> > > >> knowledge
> > > >> > of
> > > >> > >> Caja).
> > > >> > >>
> > > >> > >> I could be convinced otherwise, but at this point the code
> > involved
> > > >> > seems
> > > >> > >> of manageable size, so still worth doing. Thoughts?
> > > >> > >>
> > > >> > >> John
> > > >> > >>
> > > >> > >>
> > > >> > >>
> > > >> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com>
> > > >> wrote:
> > > >> > >>
> > > >> > >>> I agree with Louis -- that's just not practical. Every
> rewriting
> > > >> > >>> operation
> > > >> > >>> must work in real time. Caja's existing html parser is
> adequate
> > > for
> > > >> our
> > > >> > >>> needs, and we shouldn't go out of our way to tolerate every
> > oddity
> > > >> of
> > > >> > >>> random
> > > >> > >>> web browsers (especially as it simply wouldn't work unless you
> > > >> farmed
> > > >> > it
> > > >> > >>> out
> > > >> > >>> to *every* browser). Any new code needs to be grounded in
> > > practical,
> > > >> > >>> current
> > > >> > >>> needs, not theoretical options. We can always change code
> later
> > if
> > > >> we
> > > >> > >>> find a
> > > >> > >>> real need for something like that. We have real work to do in
> > the
> > > >> > >>> meantime.
> > > >> > >>>
> > > >> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <
> lryan@google.com>
> > > >> wrote:
> > > >> > >>>
> > > >> > >>> > John,
> > > >> > >>> >
> > > >> > >>> > From a practicality standpoint I'm a little nervous about
> this
> > > >> plan
> > > >> > to
> > > >> > >>> make
> > > >> > >>> > RPCs calls out of a Java process to a native process to
> fetch
> > a
> > > >> parse
> > > >> > >>> tree
> > > >> > >>> > for transformations that have to occur realtime. I don't
> think
> > > the
> > > >> > >>> > motivating factor here is to accept all inputs that browsers
> > > can.
> > > >> > >>> Gadget
> > > >> > >>> > developers will tailor their markup to the platform as they
> > have
> > > >> done
> > > >> > >>> > already. I would greatly prefer us to pick one 'good' parser
> > and
> > > >> > stick
> > > >> > >>> with
> > > >> > >>> > it for all the manageability and consumability benefits that
> > > come
> > > >> > with
> > > >> > >>> that
> > > >> > >>> > decision. Perhaps Im missing something here?
> > > >> > >>> >
> > > >> > >>> > -Louis
> > > >> > >>> >
> > > >> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
> > > >> fargo@google.com>
> > > >> > >>> wrote:
> > > >> > >>> >
> > > >> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <
> benl@google.com
> > >
> > > >> > wrote:
> > > >> > >>> > >
> > > >> > >>> > > > [+google-caja-discuss]
> > > >> > >>> > > >
> > > >> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
> > > >> fargo@google.com
> > > >> > >
> > > >> > >>> > wrote:
> > > >> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <
> > > benl@google.com
> > > >> >
> > > >> > >>> wrote:
> > > >> > >>> > > > >
> > > >> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> > > >> > >>> fargo@google.com>
> > > >> > >>> > > > wrote:
> > > >> > >>> > > > >> > This proposal effectively enables the renderer to
> > > become
> > > >> a
> > > >> > >>> > > multi-pass
> > > >> > >>> > > > >> > compiler for gadget content (essentially, arbitrary
> > web
> > > >> > >>> content).
> > > >> > >>> > > Such
> > > >> > >>> > > > a
> > > >> > >>> > > > >> > compiler can provide several benefits: static
> > > >> optimization
> > > >> > of
> > > >> > >>> > gadget
> > > >> > >>> > > > >> content
> > > >> > >>> > > > >> > (auto-proxying of images, whitespace/comment
> removal,
> > > >> > >>> > consolidation
> > > >> > >>> > > of
> > > >> > >>> > > > >> CSS
> > > >> > >>> > > > >> > blocks), security benefits (caja et al), new
> > > >> functionality
> > > >> > >>> > > (annotation
> > > >> > >>> > > > of
> > > >> > >>> > > > >> > content for stats, document analysis,
> > > container-specific
> > > >> > >>> > features),
> > > >> > >>> > > > etc.
> > > >> > >>> > > > >> To
> > > >> > >>> > > > >> > my knowledge no such infrastructure exists today
> > (with
> > > >> the
> > > >> > >>> > possible
> > > >> > >>> > > > >> > exception of Caja itself, which I'd like to
> dovetail
> > > with
> > > >> > this
> > > >> > >>> > > work).
> > > >> > >>> > > > >>
> > > >> > >>> > > > >> Caja clearly provides a large chunk of the code you'd
> > > need
> > > >> for
> > > >> > >>> this.
> > > >> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
> > > >> between
> > > >> > >>> the
> > > >> > >>> > two
> > > >> > >>> > > > >> projects.
> > > >> > >>> > > > >>
> > > >> > >>> > > > >> A generalised framework for manipulating content
> sounds
> > > >> like a
> > > >> > >>> great
> > > >> > >>> > > > >> idea, but probably should not live in either of the
> two
> > > >> > projects
> > > >> > >>> > (Caja
> > > >> > >>> > > > >> and Shindig) but rather should be shared by both of
> > them,
> > > I
> > > >> > >>> suspect.
> > > >> > >>> > > > >
> > > >> > >>> > > > >
> > > >> > >>> > > > > I agree on both counts. As I mentioned, the piece of
> > this
> > > >> idea
> > > >> > >>> that I
> > > >> > >>> > > > expect
> > > >> > >>> > > > > to change the most is the parse tree, and Caja's
> > > >> .parser.html
> > > >> > and
> > > >> > >>> > > > > .parser.css packages contain much of what I've thrown
> in
> > > >> here
> > > >> > as
> > > >> > >>> a
> > > >> > >>> > > base.
> > > >> > >>> > > > >
> > > >> > >>> > > > > My key requirements are:
> > > >> > >>> > > > > * Lightweight framework.
> > > >> > >>> > > > > * Parser modularity, mostly for HTML parsers (to
> re-use
> > > the
> > > >> > good
> > > >> > >>> work
> > > >> > >>> > > > done
> > > >> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja
> > I'd
> > > >> bet)
> > > >> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
> > > >> > >>> > > > > * Easy to manipulate structure.
> > > >> > >>> > > >
> > > >> > >>> > > > I'm not sure what the value of parser modularity is? If
> > the
> > > >> > >>> resulting
> > > >> > >>> > > > tree is different, then that's a problem for people
> > > processing
> > > >> > the
> > > >> > >>> > > > tree. And if it is not, then why do we care?
> > > >> > >>> > >
> > > >> > >>> > >
> > > >> > >>> > > IMO the value of parser modularity is that the lenient
> > parsers
> > > >> > native
> > > >> > >>> to
> > > >> > >>> > > browsers can be used in place of those that might not
> accept
> > > all
> > > >> > >>> inputs.
> > > >> > >>> > > One
> > > >> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing
> code
> > > >> into a
> > > >> > >>> > server
> > > >> > >>> > > that runs parallel to Shindig and provides a "local RPC"
> > > service
> > > >> > for
> > > >> > >>> > > parsing
> > > >> > >>> > > semi-structured HTML. The resulting tree for WebKit's
> parser
> > > >> might
> > > >> > be
> > > >> > >>> > > different than that for an XHTML parser, Gecko's parser,
> > etc,
> > > >> but
> > > >> > if
> > > >> > >>> the
> > > >> > >>> > > algorithm implemented atop it is rule-based rather than
> > > >> > >>> strict-structure
> > > >> > >>> > > based that should be fine, no?
> > > >> > >>> > >
> > > >> > >>> > >
> > > >> > >>> > > >
> > > >> > >>> > > >
> > > >> > >>> > > > >
> > > >> > >>> > > > > I'd love to see both projects share the same base
> syntax
> > > >> tree
> > > >> > >>> > > > > representations. I considered .parser.html(.DomTree)
> and
> > > >> > >>> .parser.css
> > > >> > >>> > > for
> > > >> > >>> > > > > these, but at the moment these appeared to be a little
> > > more
> > > >> > tied
> > > >> > >>> to
> > > >> > >>> > > > Caja's
> > > >> > >>> > > > > lexer/parser implementation than I preferred (though I
> > > admit
> > > >> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
> > > >> > >>> > > > >
> > > >> > >>> > > > > To be sure, I don't see this as an end-all-be-all
> > > >> > transformation
> > > >> > >>> > system
> > > >> > >>> > > > in
> > > >> > >>> > > > > any way. I'd just like to put *something* reasonable
> in
> > > >> place
> > > >> > >>> that we
> > > >> > >>> > > can
> > > >> > >>> > > > > play with, provide some benefit, and enhance into a
> > truly
> > > >> > >>> > sophisticated
> > > >> > >>> > > > > vision of document rewriting.
> > > >> > >>> > > > >
> > > >> > >>> > > > >
> > > >> > >>> > > > >>
> > > >> > >>> > > > >>
> > > >> > >>> > > > >> >  c. Add Gadget.getParsedContent().
> > > >> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used
> > to
> > > >> > >>> manipulate
> > > >> > >>> > > > Gadget
> > > >> > >>> > > > >> > Contents.
> > > >> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> > > >> > indicating
> > > >> > >>> > when
> > > >> > >>> > > > any
> > > >> > >>> > > > >> > change is made, and emits an error if setContent()
> > has
> > > >> been
> > > >> > >>> called
> > > >> > >>> > > in
> > > >> > >>> > > > the
> > > >> > >>> > > > >> > interim.
> > > >> > >>> > > > >>
> > > >> > >>> > > > >> In Caja we have been moving towards immutable
> trees...
> > > >> > >>> > > > >
> > > >> > >>> > > > >
> > > >> > >>> > > > > Interested to hear more about this. The whole idea is
> > for
> > > >> the
> > > >> > >>> > gadget's
> > > >> > >>> > > > tree
> > > >> > >>> > > > > representation to be modifiable. Doing that with
> > immutable
> > > >> > trees
> > > >> > >>> to
> > > >> > >>> > me
> > > >> > >>> > > > > suggests that a rewriter would have to create a
> > completely
> > > >> new
> > > >> > >>> tree
> > > >> > >>> > and
> > > >> > >>> > > > set
> > > >> > >>> > > > > it as a representation of new content. That's
> convenient
> > > as
> > > >> far
> > > >> > >>> as
> > > >> > >>> > the
> > > >> > >>> > > > > Gadget's maintenance of String<->Tree representations
> is
> > > >> > >>> concerned...
> > > >> > >>> > > but
> > > >> > >>> > > > > seems pretty heavyweight for many types of edits:
> > in-situ
> > > >> > >>> > modifications
> > > >> > >>> > > > of
> > > >> > >>> > > > > text, content reordering, etc. That's particularly so
> in
> > a
> > > >> > >>> > > > single-threaded
> > > >> > >>> > > > > (viz rewriting) environment.
> > > >> > >>> > > >
> > > >> > >>> > > > Never having been entirely sold on the concept, I'll let
> > > those
> > > >> on
> > > >> > >>> the
> > > >> > >>> > > > Caja team who advocate immutability explain why.
> > > >> > >>> > > >
> > > >> > >>> > >
> > > >> > >>> >
> > > >> > >>>
> > > >> > >>
> > > >> > >>
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by Kevin Brown <et...@google.com>.
I think any arguments one way or another contingent on caching should be
ignored, because caching isn't always going to be possible (many makeRequest
calls, proxied content, etc.)

Instead, we should just make sure that the total time spent rewriting a
given piece of content is under some target -- say 50-100ms. We can optimize
aggressively to meet this goal, regardless of which approach we take. From
the initial results, I think we can simply work with the caja team to
improve the performance of the parser to bring it into that range. Having a
parse tree is more useful than not having one, so we should go ahead and use
the parser instead of the lexer and then just focus on making it better.

On Mon, Aug 25, 2008 at 4:44 PM, Louis Ryan <lr...@google.com> wrote:

> These numbers seem pretty solid so I believe its fine to switch the general
> rewriter over to this new code and ditch the lexer based stuff. Nice work.
>
> On Mon, Aug 25, 2008 at 2:27 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > I've implemented a parse-tree based ContentRewriter using the existing
> > plumbing (same caching semantics et al), as
> ParseTreeDefaultContentRewriter
> > and ParseTreeHtmlRewriter, respectively. The latter contains essentially
> > all
> > rewriting functionality from the previous HtmlTagTransformer classes. The
> > parse-tree based rewriter is now functionally equivalent to the previous
> > rewriter. The new classes use a Caja-based HTML parser under the hood.
> > This proves out the functional viability of a tree-based rewriter, but
> > performance also needs to be assessed.
> >
> > I've gone ahead and profiled the comparative performance of each
> rewriter,
> > "Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out
> every
> > time essentially by definition, since obviously Caja's parser uses its
> own
> > lexer under the hood.
> >
> > Summary:
> > * The performance of each is fundamentally O(n), though...
> > * For any given input size, Lexer-based rewriting averages between 2.5 -
> > 3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
> > * By far, the majority of time involved in ParseTree-based optimization
> is
> > initial parsing: 75% of all processing.
> >
> > Details:
> > 1. I wrote a simple rewriter/parser profiler which rewrites (the sample
> > rewriter gadget's content * X repetitions) N times, recording the
> resulting
> > run time. The run time of parse-based rewriting degraded as N increased,
> in
> > all likelihood due to the additional cost of object management
> (lexer-based
> > rewriting involves few intermediate objects). Given that the results of
> > rewriting will be variously cached, it's very unlikely that rewriting
> will
> > happen in immediate succession hundreds or thousands of times. As such, I
> > fixed N = 1 to re-run the tests in relative isolation from one another.
> > Results from a given run:
> >
> > LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704] millis/run
> >
> > PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128]
> > millis/run
> >
> > Parse/lex ratio: 3.4405327398939263
> >
> > LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run
> >
> > PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968]
> millis/run
> >
> > Parse/lex ratio: 3.3908412235979606
> >
> > LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run
> >
> > PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032] millis/run
> >
> > Parse/lex ratio: 3.5628839314581313
> >
> > LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run
> >
> > PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run
> >
> > Parse/lex ratio: 3.0316666666666667
> >
> >
> > 2. Drilling down, I added simple operation profiling to each component of
> > parse-tree rewriting: original parse (CajaHtmlParser); building mutable
> > tree
> > nodes; rewriting links; concatenating JS nodes; rewriting style blocks;
> > rendering parse tree. I then reran the same tests.
> >
> > Results from subsequent run:
> >
> > LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096] millis/run
> >
> > PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384]
> millis/run
> >
> > Parse/lex ratio: 3.912896728183352
> >
> > [PARSE OPS]
> >
> > Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms
> >
> > Op[render] min:36.851ms, max:36.851ms, avg:36.851ms
> >
> > Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms
> >
> > Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms
> >
> > Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms
> >
> > Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms
> >
> >
> > LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904] millis/run
> >
> > PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032]
> millis/run
> >
> > Parse/lex ratio: 5.251336201277539
> >
> > [PARSE OPS]
> >
> > Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms
> >
> > Op[render] min:5.184ms, max:5.184ms, avg:5.184ms
> >
> > Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms
> >
> > Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms
> >
> > Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms
> >
> > Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms
> >
> >
> > LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run
> >
> > PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run
> >
> > Parse/lex ratio: 3.066152477009789
> >
> > [PARSE OPS]
> >
> > Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms
> >
> > Op[render] min:0.678ms, max:0.678ms, avg:0.678ms
> >
> > Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms
> >
> > Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms
> >
> > Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms
> >
> > Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms
> >
> >
> > LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run
> >
> > PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run
> >
> > Parse/lex ratio: 3.518581081081081
> >
> > [PARSE OPS]
> >
> > Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms
> >
> > Op[render] min:0.077ms, max:0.077ms, avg:0.077ms
> >
> > Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms
> >
> > Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms
> >
> > Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms
> >
> > Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms
> >
> >
> > 3. Drilling further, I wrote a separate test breaking out the performance
> > components to parsing: calling the Caja DomParser.parseFragment(...) API,
> > and subsequently wrapping the results of that call with ParsedHtmlNode
> > objects to satisfy interface requirements:
> >
> > Typical run:
> >
> > Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run
> >
> > [PARSER COMPONENTS]
> >
> > Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms
> >
> > Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms
> >
> >
> > Caja parser [size*10, runs:1] in 42915 microsecs [42.915072] millis/run
> >
> > [PARSER COMPONENTS]
> >
> > Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms
> >
> > Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms
> >
> >
> > Caja parser [size*50, runs:1] in 157048 microsecs [157.048064] millis/run
> >
> > [PARSER COMPONENTS]
> >
> > Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms
> >
> > Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms
> >
> >
> > Caja parser [size*100, runs:1] in 236073 microsecs [236.07296] millis/run
> >
> > [PARSER COMPONENTS]
> >
> > Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms
> >
> > Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms
> >
> >
> > Conclusions and Discussion:
> >
> > The purpose of this task was to prove that tree-based parsing is
> > functionally viable, which has succeeded. Past that, it's a matter of
> > choosing functionality vs. performance. Given that rewriting results are
> > cached, perhaps even ~3x increase in rewriting cost will be worth paying.
> >
> >
> > That's particularly true given the new class of optimizations/rewrites
> made
> > possible with a parse tree, as well as some bugs that are more easily
> fixed
> > using it. For instance, I recently discovered a bug with the existing JS
> > tag
> > rewriter which ignores type="..." attributes and doesn't maintain "id"
> > attributes in certain situations. These can be resolved in the lexer
> case,
> > but are clearer in the parser one.
> >
> >
> > Lastly, as mentioned at the beginning of this thread, I plan to maintain
> > the
> > ability to manipulate a gadget by string, meaning a lexer-based approach
> > can
> > still be used where desired and parse-tree isn't required.
> >
> >
> > Next steps:
> >
> > 1. My next step is to add modularity to content rewriting, but again
> > without
> > changing any caching semantics. Instead, rather than a single
> > ContentRewriter being injected, a ContentRewriterRegistry will be. The
> > default Registry will support injection of a single ContentRewriter to
> > maintain backward compatibility for now.
> >
> > 2. GadgetSpec immutability restored, ensuring post-rewritten caching.
> >
> > 3. ContentRewriter API cleanup.
> >
> >
> > --John
> >
> >
> > On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >
> > > Interesting idea, and sounds fine to me. Concretely, this lets me
> > sidestep
> > > SHINDIG-500 for a little while, which is nice (though I'd _really_ like
> > to
> > > see the API cleanup go in! :)), in favor of migrating the existing
> > rewriter
> > > to a tree-based approach. Turns out I've been working on #1 and #2
> > > independently anyway. I'll post a patch soon. Thanks!
> > >
> > > John
> > >
> > >
> > > On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com> wrote:
> > >
> > >> Can we prove this out incrementally bottom-up. In general I think
> using
> > >> DOM
> > >> is the right thing to do from a rewriting standpoint. So here's how I
> > >> propose we proceed
> > >>
> > >> 1. If the Caja dom is a little awkward wrap it, if not lets just use
> it
> > as
> > >> is. We can always resolve this later
> > >> 2. Change the existing content rewriters to use the DOM instead of a
> > >> lexer,
> > >> should be pretty easy. Maybe add some fancier rewriting like moving
> CSS
> > >> into
> > >> HEAD
> > >> 3. Do some perf testing, look into memory overhead of dom
> transformation
> > >> etc.
> > >> 4. Alter GadgetSpec's to retain the dom when they are cached
> > >> 5. Alter the gadget rendering phase to serialize the content of the
> dom
> > to
> > >> output
> > >> 6. Annotate the dom at parse time to make render time user-pref
> > >> substituions
> > >> faster, this should be easy enough too...
> > >>
> > >> This should be enough to prove out the pipeline end-to-end and
> identify
> > >> any
> > >> major perf niggles. Once this is done we can look into how to inject a
> > >> rewriter pipeline into the parsing phase and the rendering phase.
> > >>
> > >> -Louis
> > >>
> > >>
> > >>
> > >> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > >>
> > >> > Re-responding in order to apply the last few exchanges to
> > >> > google-caja-discuss@ (@gmail vs. @google membership issues).
> > >> >
> > >> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fa...@google.com>
> > >> wrote:
> > >> >
> > >> > > Hello,
> > >> > >
> > >> > > While beginning to refactor the rewriter APIs I've discovered that
> > >> there
> > >> > > unfortunately is one semantic difference inherent to moving
> > >> getContent()
> > >> > and
> > >> > > setContent() methods into the Gadget object (replacing
> > >> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer
> > >> caches
> > >> > > rewritten content.
> > >> > >
> > >> > > I've written a discussion of this in issue SHINDIG-500, which
> tracks
> > >> this
> > >> > > implementation sub-task:
> > >> > https://issues.apache.org/jira/browse/SHINDIG-500
> > >> > >
> > >> > > To summarize:
> > >> > > 1. Is this change acceptable for the time being?
> > >> > > 2. I suggest that we can, at a later date, move fetching of gadget
> > >> specs
> > >> > > into GadgetServer while injecting a Gadget(Spec) cache there as
> > well,
> > >> > > offering finer-tuned control over caching characteristics.
> > >> > >
> > >> > > Thanks,
> > >> > > John
> > >> > >
> > >> > >
> > >> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fargo@google.com
> >
> > >> > wrote:
> > >> > >
> > >> > >> I understand these concerns, and should be clear that I don't
> > >> (despite
> > >> > my
> > >> > >> personal interest in experimenting with the idea, agreed that we
> > >> don't
> > >> > have
> > >> > >> time for it at the moment) have any plans to introduce this sort
> of
> > >> RPC
> > >> > >> anywhere - certainly not in Shindig itself, as any such call
> would
> > be
> > >> > hidden
> > >> > >> behind an interface anyway.
> > >> > >>
> > >> > >> Putting the RPC hypothetical aside, I still feel that there's
> value
> > >> to
> > >> > >> implementing HTML parsing in terms of an interface:
> > >> > >> * Clearer separation of concerns/boundary between projects.
> > >> > >>   - Corollary simplicity in testing.
> > >> > >> * Clearer API for content manipulation (that doesn't require
> > >> knowledge
> > >> > of
> > >> > >> Caja).
> > >> > >>
> > >> > >> I could be convinced otherwise, but at this point the code
> involved
> > >> > seems
> > >> > >> of manageable size, so still worth doing. Thoughts?
> > >> > >>
> > >> > >> John
> > >> > >>
> > >> > >>
> > >> > >>
> > >> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com>
> > >> wrote:
> > >> > >>
> > >> > >>> I agree with Louis -- that's just not practical. Every rewriting
> > >> > >>> operation
> > >> > >>> must work in real time. Caja's existing html parser is adequate
> > for
> > >> our
> > >> > >>> needs, and we shouldn't go out of our way to tolerate every
> oddity
> > >> of
> > >> > >>> random
> > >> > >>> web browsers (especially as it simply wouldn't work unless you
> > >> farmed
> > >> > it
> > >> > >>> out
> > >> > >>> to *every* browser). Any new code needs to be grounded in
> > practical,
> > >> > >>> current
> > >> > >>> needs, not theoretical options. We can always change code later
> if
> > >> we
> > >> > >>> find a
> > >> > >>> real need for something like that. We have real work to do in
> the
> > >> > >>> meantime.
> > >> > >>>
> > >> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com>
> > >> wrote:
> > >> > >>>
> > >> > >>> > John,
> > >> > >>> >
> > >> > >>> > From a practicality standpoint I'm a little nervous about this
> > >> plan
> > >> > to
> > >> > >>> make
> > >> > >>> > RPCs calls out of a Java process to a native process to fetch
> a
> > >> parse
> > >> > >>> tree
> > >> > >>> > for transformations that have to occur realtime. I don't think
> > the
> > >> > >>> > motivating factor here is to accept all inputs that browsers
> > can.
> > >> > >>> Gadget
> > >> > >>> > developers will tailor their markup to the platform as they
> have
> > >> done
> > >> > >>> > already. I would greatly prefer us to pick one 'good' parser
> and
> > >> > stick
> > >> > >>> with
> > >> > >>> > it for all the manageability and consumability benefits that
> > come
> > >> > with
> > >> > >>> that
> > >> > >>> > decision. Perhaps Im missing something here?
> > >> > >>> >
> > >> > >>> > -Louis
> > >> > >>> >
> > >> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
> > >> fargo@google.com>
> > >> > >>> wrote:
> > >> > >>> >
> > >> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <benl@google.com
> >
> > >> > wrote:
> > >> > >>> > >
> > >> > >>> > > > [+google-caja-discuss]
> > >> > >>> > > >
> > >> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
> > >> fargo@google.com
> > >> > >
> > >> > >>> > wrote:
> > >> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <
> > benl@google.com
> > >> >
> > >> > >>> wrote:
> > >> > >>> > > > >
> > >> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> > >> > >>> fargo@google.com>
> > >> > >>> > > > wrote:
> > >> > >>> > > > >> > This proposal effectively enables the renderer to
> > become
> > >> a
> > >> > >>> > > multi-pass
> > >> > >>> > > > >> > compiler for gadget content (essentially, arbitrary
> web
> > >> > >>> content).
> > >> > >>> > > Such
> > >> > >>> > > > a
> > >> > >>> > > > >> > compiler can provide several benefits: static
> > >> optimization
> > >> > of
> > >> > >>> > gadget
> > >> > >>> > > > >> content
> > >> > >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
> > >> > >>> > consolidation
> > >> > >>> > > of
> > >> > >>> > > > >> CSS
> > >> > >>> > > > >> > blocks), security benefits (caja et al), new
> > >> functionality
> > >> > >>> > > (annotation
> > >> > >>> > > > of
> > >> > >>> > > > >> > content for stats, document analysis,
> > container-specific
> > >> > >>> > features),
> > >> > >>> > > > etc.
> > >> > >>> > > > >> To
> > >> > >>> > > > >> > my knowledge no such infrastructure exists today
> (with
> > >> the
> > >> > >>> > possible
> > >> > >>> > > > >> > exception of Caja itself, which I'd like to dovetail
> > with
> > >> > this
> > >> > >>> > > work).
> > >> > >>> > > > >>
> > >> > >>> > > > >> Caja clearly provides a large chunk of the code you'd
> > need
> > >> for
> > >> > >>> this.
> > >> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
> > >> between
> > >> > >>> the
> > >> > >>> > two
> > >> > >>> > > > >> projects.
> > >> > >>> > > > >>
> > >> > >>> > > > >> A generalised framework for manipulating content sounds
> > >> like a
> > >> > >>> great
> > >> > >>> > > > >> idea, but probably should not live in either of the two
> > >> > projects
> > >> > >>> > (Caja
> > >> > >>> > > > >> and Shindig) but rather should be shared by both of
> them,
> > I
> > >> > >>> suspect.
> > >> > >>> > > > >
> > >> > >>> > > > >
> > >> > >>> > > > > I agree on both counts. As I mentioned, the piece of
> this
> > >> idea
> > >> > >>> that I
> > >> > >>> > > > expect
> > >> > >>> > > > > to change the most is the parse tree, and Caja's
> > >> .parser.html
> > >> > and
> > >> > >>> > > > > .parser.css packages contain much of what I've thrown in
> > >> here
> > >> > as
> > >> > >>> a
> > >> > >>> > > base.
> > >> > >>> > > > >
> > >> > >>> > > > > My key requirements are:
> > >> > >>> > > > > * Lightweight framework.
> > >> > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use
> > the
> > >> > good
> > >> > >>> work
> > >> > >>> > > > done
> > >> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja
> I'd
> > >> bet)
> > >> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
> > >> > >>> > > > > * Easy to manipulate structure.
> > >> > >>> > > >
> > >> > >>> > > > I'm not sure what the value of parser modularity is? If
> the
> > >> > >>> resulting
> > >> > >>> > > > tree is different, then that's a problem for people
> > processing
> > >> > the
> > >> > >>> > > > tree. And if it is not, then why do we care?
> > >> > >>> > >
> > >> > >>> > >
> > >> > >>> > > IMO the value of parser modularity is that the lenient
> parsers
> > >> > native
> > >> > >>> to
> > >> > >>> > > browsers can be used in place of those that might not accept
> > all
> > >> > >>> inputs.
> > >> > >>> > > One
> > >> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code
> > >> into a
> > >> > >>> > server
> > >> > >>> > > that runs parallel to Shindig and provides a "local RPC"
> > service
> > >> > for
> > >> > >>> > > parsing
> > >> > >>> > > semi-structured HTML. The resulting tree for WebKit's parser
> > >> might
> > >> > be
> > >> > >>> > > different than that for an XHTML parser, Gecko's parser,
> etc,
> > >> but
> > >> > if
> > >> > >>> the
> > >> > >>> > > algorithm implemented atop it is rule-based rather than
> > >> > >>> strict-structure
> > >> > >>> > > based that should be fine, no?
> > >> > >>> > >
> > >> > >>> > >
> > >> > >>> > > >
> > >> > >>> > > >
> > >> > >>> > > > >
> > >> > >>> > > > > I'd love to see both projects share the same base syntax
> > >> tree
> > >> > >>> > > > > representations. I considered .parser.html(.DomTree) and
> > >> > >>> .parser.css
> > >> > >>> > > for
> > >> > >>> > > > > these, but at the moment these appeared to be a little
> > more
> > >> > tied
> > >> > >>> to
> > >> > >>> > > > Caja's
> > >> > >>> > > > > lexer/parser implementation than I preferred (though I
> > admit
> > >> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
> > >> > >>> > > > >
> > >> > >>> > > > > To be sure, I don't see this as an end-all-be-all
> > >> > transformation
> > >> > >>> > system
> > >> > >>> > > > in
> > >> > >>> > > > > any way. I'd just like to put *something* reasonable in
> > >> place
> > >> > >>> that we
> > >> > >>> > > can
> > >> > >>> > > > > play with, provide some benefit, and enhance into a
> truly
> > >> > >>> > sophisticated
> > >> > >>> > > > > vision of document rewriting.
> > >> > >>> > > > >
> > >> > >>> > > > >
> > >> > >>> > > > >>
> > >> > >>> > > > >>
> > >> > >>> > > > >> >  c. Add Gadget.getParsedContent().
> > >> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used
> to
> > >> > >>> manipulate
> > >> > >>> > > > Gadget
> > >> > >>> > > > >> > Contents.
> > >> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> > >> > indicating
> > >> > >>> > when
> > >> > >>> > > > any
> > >> > >>> > > > >> > change is made, and emits an error if setContent()
> has
> > >> been
> > >> > >>> called
> > >> > >>> > > in
> > >> > >>> > > > the
> > >> > >>> > > > >> > interim.
> > >> > >>> > > > >>
> > >> > >>> > > > >> In Caja we have been moving towards immutable trees...
> > >> > >>> > > > >
> > >> > >>> > > > >
> > >> > >>> > > > > Interested to hear more about this. The whole idea is
> for
> > >> the
> > >> > >>> > gadget's
> > >> > >>> > > > tree
> > >> > >>> > > > > representation to be modifiable. Doing that with
> immutable
> > >> > trees
> > >> > >>> to
> > >> > >>> > me
> > >> > >>> > > > > suggests that a rewriter would have to create a
> completely
> > >> new
> > >> > >>> tree
> > >> > >>> > and
> > >> > >>> > > > set
> > >> > >>> > > > > it as a representation of new content. That's convenient
> > as
> > >> far
> > >> > >>> as
> > >> > >>> > the
> > >> > >>> > > > > Gadget's maintenance of String<->Tree representations is
> > >> > >>> concerned...
> > >> > >>> > > but
> > >> > >>> > > > > seems pretty heavyweight for many types of edits:
> in-situ
> > >> > >>> > modifications
> > >> > >>> > > > of
> > >> > >>> > > > > text, content reordering, etc. That's particularly so in
> a
> > >> > >>> > > > single-threaded
> > >> > >>> > > > > (viz rewriting) environment.
> > >> > >>> > > >
> > >> > >>> > > > Never having been entirely sold on the concept, I'll let
> > those
> > >> on
> > >> > >>> the
> > >> > >>> > > > Caja team who advocate immutability explain why.
> > >> > >>> > > >
> > >> > >>> > >
> > >> > >>> >
> > >> > >>>
> > >> > >>
> > >> > >>
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by Louis Ryan <lr...@google.com>.
These numbers seem pretty solid so I believe its fine to switch the general
rewriter over to this new code and ditch the lexer based stuff. Nice work.

On Mon, Aug 25, 2008 at 2:27 PM, John Hjelmstad <fa...@google.com> wrote:

> I've implemented a parse-tree based ContentRewriter using the existing
> plumbing (same caching semantics et al), as ParseTreeDefaultContentRewriter
> and ParseTreeHtmlRewriter, respectively. The latter contains essentially
> all
> rewriting functionality from the previous HtmlTagTransformer classes. The
> parse-tree based rewriter is now functionally equivalent to the previous
> rewriter. The new classes use a Caja-based HTML parser under the hood.
> This proves out the functional viability of a tree-based rewriter, but
> performance also needs to be assessed.
>
> I've gone ahead and profiled the comparative performance of each rewriter,
> "Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out every
> time essentially by definition, since obviously Caja's parser uses its own
> lexer under the hood.
>
> Summary:
> * The performance of each is fundamentally O(n), though...
> * For any given input size, Lexer-based rewriting averages between 2.5 -
> 3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
> * By far, the majority of time involved in ParseTree-based optimization is
> initial parsing: 75% of all processing.
>
> Details:
> 1. I wrote a simple rewriter/parser profiler which rewrites (the sample
> rewriter gadget's content * X repetitions) N times, recording the resulting
> run time. The run time of parse-based rewriting degraded as N increased, in
> all likelihood due to the additional cost of object management (lexer-based
> rewriting involves few intermediate objects). Given that the results of
> rewriting will be variously cached, it's very unlikely that rewriting will
> happen in immediate succession hundreds or thousands of times. As such, I
> fixed N = 1 to re-run the tests in relative isolation from one another.
> Results from a given run:
>
> LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704] millis/run
>
> PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128]
> millis/run
>
> Parse/lex ratio: 3.4405327398939263
>
> LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run
>
> PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968] millis/run
>
> Parse/lex ratio: 3.3908412235979606
>
> LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run
>
> PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032] millis/run
>
> Parse/lex ratio: 3.5628839314581313
>
> LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run
>
> PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run
>
> Parse/lex ratio: 3.0316666666666667
>
>
> 2. Drilling down, I added simple operation profiling to each component of
> parse-tree rewriting: original parse (CajaHtmlParser); building mutable
> tree
> nodes; rewriting links; concatenating JS nodes; rewriting style blocks;
> rendering parse tree. I then reran the same tests.
>
> Results from subsequent run:
>
> LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096] millis/run
>
> PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384] millis/run
>
> Parse/lex ratio: 3.912896728183352
>
> [PARSE OPS]
>
> Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms
>
> Op[render] min:36.851ms, max:36.851ms, avg:36.851ms
>
> Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms
>
> Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms
>
> Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms
>
> Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms
>
>
> LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904] millis/run
>
> PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032] millis/run
>
> Parse/lex ratio: 5.251336201277539
>
> [PARSE OPS]
>
> Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms
>
> Op[render] min:5.184ms, max:5.184ms, avg:5.184ms
>
> Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms
>
> Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms
>
> Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms
>
> Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms
>
>
> LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run
>
> PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run
>
> Parse/lex ratio: 3.066152477009789
>
> [PARSE OPS]
>
> Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms
>
> Op[render] min:0.678ms, max:0.678ms, avg:0.678ms
>
> Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms
>
> Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms
>
> Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms
>
> Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms
>
>
> LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run
>
> PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run
>
> Parse/lex ratio: 3.518581081081081
>
> [PARSE OPS]
>
> Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms
>
> Op[render] min:0.077ms, max:0.077ms, avg:0.077ms
>
> Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms
>
> Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms
>
> Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms
>
> Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms
>
>
> 3. Drilling further, I wrote a separate test breaking out the performance
> components to parsing: calling the Caja DomParser.parseFragment(...) API,
> and subsequently wrapping the results of that call with ParsedHtmlNode
> objects to satisfy interface requirements:
>
> Typical run:
>
> Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms
>
> Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms
>
>
> Caja parser [size*10, runs:1] in 42915 microsecs [42.915072] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms
>
> Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms
>
>
> Caja parser [size*50, runs:1] in 157048 microsecs [157.048064] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms
>
> Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms
>
>
> Caja parser [size*100, runs:1] in 236073 microsecs [236.07296] millis/run
>
> [PARSER COMPONENTS]
>
> Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms
>
> Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms
>
>
> Conclusions and Discussion:
>
> The purpose of this task was to prove that tree-based parsing is
> functionally viable, which has succeeded. Past that, it's a matter of
> choosing functionality vs. performance. Given that rewriting results are
> cached, perhaps even ~3x increase in rewriting cost will be worth paying.
>
>
> That's particularly true given the new class of optimizations/rewrites made
> possible with a parse tree, as well as some bugs that are more easily fixed
> using it. For instance, I recently discovered a bug with the existing JS
> tag
> rewriter which ignores type="..." attributes and doesn't maintain "id"
> attributes in certain situations. These can be resolved in the lexer case,
> but are clearer in the parser one.
>
>
> Lastly, as mentioned at the beginning of this thread, I plan to maintain
> the
> ability to manipulate a gadget by string, meaning a lexer-based approach
> can
> still be used where desired and parse-tree isn't required.
>
>
> Next steps:
>
> 1. My next step is to add modularity to content rewriting, but again
> without
> changing any caching semantics. Instead, rather than a single
> ContentRewriter being injected, a ContentRewriterRegistry will be. The
> default Registry will support injection of a single ContentRewriter to
> maintain backward compatibility for now.
>
> 2. GadgetSpec immutability restored, ensuring post-rewritten caching.
>
> 3. ContentRewriter API cleanup.
>
>
> --John
>
>
> On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Interesting idea, and sounds fine to me. Concretely, this lets me
> sidestep
> > SHINDIG-500 for a little while, which is nice (though I'd _really_ like
> to
> > see the API cleanup go in! :)), in favor of migrating the existing
> rewriter
> > to a tree-based approach. Turns out I've been working on #1 and #2
> > independently anyway. I'll post a patch soon. Thanks!
> >
> > John
> >
> >
> > On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com> wrote:
> >
> >> Can we prove this out incrementally bottom-up. In general I think using
> >> DOM
> >> is the right thing to do from a rewriting standpoint. So here's how I
> >> propose we proceed
> >>
> >> 1. If the Caja dom is a little awkward wrap it, if not lets just use it
> as
> >> is. We can always resolve this later
> >> 2. Change the existing content rewriters to use the DOM instead of a
> >> lexer,
> >> should be pretty easy. Maybe add some fancier rewriting like moving CSS
> >> into
> >> HEAD
> >> 3. Do some perf testing, look into memory overhead of dom transformation
> >> etc.
> >> 4. Alter GadgetSpec's to retain the dom when they are cached
> >> 5. Alter the gadget rendering phase to serialize the content of the dom
> to
> >> output
> >> 6. Annotate the dom at parse time to make render time user-pref
> >> substituions
> >> faster, this should be easy enough too...
> >>
> >> This should be enough to prove out the pipeline end-to-end and identify
> >> any
> >> major perf niggles. Once this is done we can look into how to inject a
> >> rewriter pipeline into the parsing phase and the rendering phase.
> >>
> >> -Louis
> >>
> >>
> >>
> >> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >>
> >> > Re-responding in order to apply the last few exchanges to
> >> > google-caja-discuss@ (@gmail vs. @google membership issues).
> >> >
> >> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fa...@google.com>
> >> wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > While beginning to refactor the rewriter APIs I've discovered that
> >> there
> >> > > unfortunately is one semantic difference inherent to moving
> >> getContent()
> >> > and
> >> > > setContent() methods into the Gadget object (replacing
> >> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer
> >> caches
> >> > > rewritten content.
> >> > >
> >> > > I've written a discussion of this in issue SHINDIG-500, which tracks
> >> this
> >> > > implementation sub-task:
> >> > https://issues.apache.org/jira/browse/SHINDIG-500
> >> > >
> >> > > To summarize:
> >> > > 1. Is this change acceptable for the time being?
> >> > > 2. I suggest that we can, at a later date, move fetching of gadget
> >> specs
> >> > > into GadgetServer while injecting a Gadget(Spec) cache there as
> well,
> >> > > offering finer-tuned control over caching characteristics.
> >> > >
> >> > > Thanks,
> >> > > John
> >> > >
> >> > >
> >> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fa...@google.com>
> >> > wrote:
> >> > >
> >> > >> I understand these concerns, and should be clear that I don't
> >> (despite
> >> > my
> >> > >> personal interest in experimenting with the idea, agreed that we
> >> don't
> >> > have
> >> > >> time for it at the moment) have any plans to introduce this sort of
> >> RPC
> >> > >> anywhere - certainly not in Shindig itself, as any such call would
> be
> >> > hidden
> >> > >> behind an interface anyway.
> >> > >>
> >> > >> Putting the RPC hypothetical aside, I still feel that there's value
> >> to
> >> > >> implementing HTML parsing in terms of an interface:
> >> > >> * Clearer separation of concerns/boundary between projects.
> >> > >>   - Corollary simplicity in testing.
> >> > >> * Clearer API for content manipulation (that doesn't require
> >> knowledge
> >> > of
> >> > >> Caja).
> >> > >>
> >> > >> I could be convinced otherwise, but at this point the code involved
> >> > seems
> >> > >> of manageable size, so still worth doing. Thoughts?
> >> > >>
> >> > >> John
> >> > >>
> >> > >>
> >> > >>
> >> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com>
> >> wrote:
> >> > >>
> >> > >>> I agree with Louis -- that's just not practical. Every rewriting
> >> > >>> operation
> >> > >>> must work in real time. Caja's existing html parser is adequate
> for
> >> our
> >> > >>> needs, and we shouldn't go out of our way to tolerate every oddity
> >> of
> >> > >>> random
> >> > >>> web browsers (especially as it simply wouldn't work unless you
> >> farmed
> >> > it
> >> > >>> out
> >> > >>> to *every* browser). Any new code needs to be grounded in
> practical,
> >> > >>> current
> >> > >>> needs, not theoretical options. We can always change code later if
> >> we
> >> > >>> find a
> >> > >>> real need for something like that. We have real work to do in the
> >> > >>> meantime.
> >> > >>>
> >> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com>
> >> wrote:
> >> > >>>
> >> > >>> > John,
> >> > >>> >
> >> > >>> > From a practicality standpoint I'm a little nervous about this
> >> plan
> >> > to
> >> > >>> make
> >> > >>> > RPCs calls out of a Java process to a native process to fetch a
> >> parse
> >> > >>> tree
> >> > >>> > for transformations that have to occur realtime. I don't think
> the
> >> > >>> > motivating factor here is to accept all inputs that browsers
> can.
> >> > >>> Gadget
> >> > >>> > developers will tailor their markup to the platform as they have
> >> done
> >> > >>> > already. I would greatly prefer us to pick one 'good' parser and
> >> > stick
> >> > >>> with
> >> > >>> > it for all the manageability and consumability benefits that
> come
> >> > with
> >> > >>> that
> >> > >>> > decision. Perhaps Im missing something here?
> >> > >>> >
> >> > >>> > -Louis
> >> > >>> >
> >> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
> >> fargo@google.com>
> >> > >>> wrote:
> >> > >>> >
> >> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com>
> >> > wrote:
> >> > >>> > >
> >> > >>> > > > [+google-caja-discuss]
> >> > >>> > > >
> >> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
> >> fargo@google.com
> >> > >
> >> > >>> > wrote:
> >> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <
> benl@google.com
> >> >
> >> > >>> wrote:
> >> > >>> > > > >
> >> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> >> > >>> fargo@google.com>
> >> > >>> > > > wrote:
> >> > >>> > > > >> > This proposal effectively enables the renderer to
> become
> >> a
> >> > >>> > > multi-pass
> >> > >>> > > > >> > compiler for gadget content (essentially, arbitrary web
> >> > >>> content).
> >> > >>> > > Such
> >> > >>> > > > a
> >> > >>> > > > >> > compiler can provide several benefits: static
> >> optimization
> >> > of
> >> > >>> > gadget
> >> > >>> > > > >> content
> >> > >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
> >> > >>> > consolidation
> >> > >>> > > of
> >> > >>> > > > >> CSS
> >> > >>> > > > >> > blocks), security benefits (caja et al), new
> >> functionality
> >> > >>> > > (annotation
> >> > >>> > > > of
> >> > >>> > > > >> > content for stats, document analysis,
> container-specific
> >> > >>> > features),
> >> > >>> > > > etc.
> >> > >>> > > > >> To
> >> > >>> > > > >> > my knowledge no such infrastructure exists today (with
> >> the
> >> > >>> > possible
> >> > >>> > > > >> > exception of Caja itself, which I'd like to dovetail
> with
> >> > this
> >> > >>> > > work).
> >> > >>> > > > >>
> >> > >>> > > > >> Caja clearly provides a large chunk of the code you'd
> need
> >> for
> >> > >>> this.
> >> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
> >> between
> >> > >>> the
> >> > >>> > two
> >> > >>> > > > >> projects.
> >> > >>> > > > >>
> >> > >>> > > > >> A generalised framework for manipulating content sounds
> >> like a
> >> > >>> great
> >> > >>> > > > >> idea, but probably should not live in either of the two
> >> > projects
> >> > >>> > (Caja
> >> > >>> > > > >> and Shindig) but rather should be shared by both of them,
> I
> >> > >>> suspect.
> >> > >>> > > > >
> >> > >>> > > > >
> >> > >>> > > > > I agree on both counts. As I mentioned, the piece of this
> >> idea
> >> > >>> that I
> >> > >>> > > > expect
> >> > >>> > > > > to change the most is the parse tree, and Caja's
> >> .parser.html
> >> > and
> >> > >>> > > > > .parser.css packages contain much of what I've thrown in
> >> here
> >> > as
> >> > >>> a
> >> > >>> > > base.
> >> > >>> > > > >
> >> > >>> > > > > My key requirements are:
> >> > >>> > > > > * Lightweight framework.
> >> > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use
> the
> >> > good
> >> > >>> work
> >> > >>> > > > done
> >> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd
> >> bet)
> >> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
> >> > >>> > > > > * Easy to manipulate structure.
> >> > >>> > > >
> >> > >>> > > > I'm not sure what the value of parser modularity is? If the
> >> > >>> resulting
> >> > >>> > > > tree is different, then that's a problem for people
> processing
> >> > the
> >> > >>> > > > tree. And if it is not, then why do we care?
> >> > >>> > >
> >> > >>> > >
> >> > >>> > > IMO the value of parser modularity is that the lenient parsers
> >> > native
> >> > >>> to
> >> > >>> > > browsers can be used in place of those that might not accept
> all
> >> > >>> inputs.
> >> > >>> > > One
> >> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code
> >> into a
> >> > >>> > server
> >> > >>> > > that runs parallel to Shindig and provides a "local RPC"
> service
> >> > for
> >> > >>> > > parsing
> >> > >>> > > semi-structured HTML. The resulting tree for WebKit's parser
> >> might
> >> > be
> >> > >>> > > different than that for an XHTML parser, Gecko's parser, etc,
> >> but
> >> > if
> >> > >>> the
> >> > >>> > > algorithm implemented atop it is rule-based rather than
> >> > >>> strict-structure
> >> > >>> > > based that should be fine, no?
> >> > >>> > >
> >> > >>> > >
> >> > >>> > > >
> >> > >>> > > >
> >> > >>> > > > >
> >> > >>> > > > > I'd love to see both projects share the same base syntax
> >> tree
> >> > >>> > > > > representations. I considered .parser.html(.DomTree) and
> >> > >>> .parser.css
> >> > >>> > > for
> >> > >>> > > > > these, but at the moment these appeared to be a little
> more
> >> > tied
> >> > >>> to
> >> > >>> > > > Caja's
> >> > >>> > > > > lexer/parser implementation than I preferred (though I
> admit
> >> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
> >> > >>> > > > >
> >> > >>> > > > > To be sure, I don't see this as an end-all-be-all
> >> > transformation
> >> > >>> > system
> >> > >>> > > > in
> >> > >>> > > > > any way. I'd just like to put *something* reasonable in
> >> place
> >> > >>> that we
> >> > >>> > > can
> >> > >>> > > > > play with, provide some benefit, and enhance into a truly
> >> > >>> > sophisticated
> >> > >>> > > > > vision of document rewriting.
> >> > >>> > > > >
> >> > >>> > > > >
> >> > >>> > > > >>
> >> > >>> > > > >>
> >> > >>> > > > >> >  c. Add Gadget.getParsedContent().
> >> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
> >> > >>> manipulate
> >> > >>> > > > Gadget
> >> > >>> > > > >> > Contents.
> >> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> >> > indicating
> >> > >>> > when
> >> > >>> > > > any
> >> > >>> > > > >> > change is made, and emits an error if setContent() has
> >> been
> >> > >>> called
> >> > >>> > > in
> >> > >>> > > > the
> >> > >>> > > > >> > interim.
> >> > >>> > > > >>
> >> > >>> > > > >> In Caja we have been moving towards immutable trees...
> >> > >>> > > > >
> >> > >>> > > > >
> >> > >>> > > > > Interested to hear more about this. The whole idea is for
> >> the
> >> > >>> > gadget's
> >> > >>> > > > tree
> >> > >>> > > > > representation to be modifiable. Doing that with immutable
> >> > trees
> >> > >>> to
> >> > >>> > me
> >> > >>> > > > > suggests that a rewriter would have to create a completely
> >> new
> >> > >>> tree
> >> > >>> > and
> >> > >>> > > > set
> >> > >>> > > > > it as a representation of new content. That's convenient
> as
> >> far
> >> > >>> as
> >> > >>> > the
> >> > >>> > > > > Gadget's maintenance of String<->Tree representations is
> >> > >>> concerned...
> >> > >>> > > but
> >> > >>> > > > > seems pretty heavyweight for many types of edits: in-situ
> >> > >>> > modifications
> >> > >>> > > > of
> >> > >>> > > > > text, content reordering, etc. That's particularly so in a
> >> > >>> > > > single-threaded
> >> > >>> > > > > (viz rewriting) environment.
> >> > >>> > > >
> >> > >>> > > > Never having been entirely sold on the concept, I'll let
> those
> >> on
> >> > >>> the
> >> > >>> > > > Caja team who advocate immutability explain why.
> >> > >>> > > >
> >> > >>> > >
> >> > >>> >
> >> > >>>
> >> > >>
> >> > >>
> >> > >
> >> >
> >>
> >
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
I've implemented a parse-tree based ContentRewriter using the existing
plumbing (same caching semantics et al), as ParseTreeDefaultContentRewriter
and ParseTreeHtmlRewriter, respectively. The latter contains essentially all
rewriting functionality from the previous HtmlTagTransformer classes. The
parse-tree based rewriter is now functionally equivalent to the previous
rewriter. The new classes use a Caja-based HTML parser under the hood.
This proves out the functional viability of a tree-based rewriter, but
performance also needs to be assessed.

I've gone ahead and profiled the comparative performance of each rewriter,
"Lexer" based and "ParseTree" based. To no surprise, "Lexer" wins out every
time essentially by definition, since obviously Caja's parser uses its own
lexer under the hood.

Summary:
* The performance of each is fundamentally O(n), though...
* For any given input size, Lexer-based rewriting averages between 2.5 -
3.5x faster than ParseTree-based (ie. c =~ 3.5 at worst).
* By far, the majority of time involved in ParseTree-based optimization is
initial parsing: 75% of all processing.

Details:
1. I wrote a simple rewriter/parser profiler which rewrites (the sample
rewriter gadget's content * X repetitions) N times, recording the resulting
run time. The run time of parse-based rewriting degraded as N increased, in
all likelihood due to the additional cost of object management (lexer-based
rewriting involves few intermediate objects). Given that the results of
rewriting will be variously cached, it's very unlikely that rewriting will
happen in immediate succession hundreds or thousands of times. As such, I
fixed N = 1 to re-run the tests in relative isolation from one another.
Results from a given run:

LEX-BASED*100 rewriter, 1 runs in 177047 microsecs [177.04704] millis/run

PARSE-BASED*100 rewriter, 1 runs in 609136 microsecs [609.136128] millis/run

Parse/lex ratio: 3.4405327398939263

LEX-BASED*50 rewriter, 1 runs in 43936 microsecs [43.936] millis/run

PARSE-BASED*50 rewriter, 1 runs in 148980 microsecs [148.979968] millis/run

Parse/lex ratio: 3.3908412235979606

LEX-BASED*10 rewriter, 1 runs in 3093 microsecs [3.092992] millis/run

PARSE-BASED*10 rewriter, 1 runs in 11020 microsecs [11.020032] millis/run

Parse/lex ratio: 3.5628839314581313

LEX-BASED*1 rewriter, 1 runs in 600 microsecs [0.600064] millis/run

PARSE-BASED*1 rewriter, 1 runs in 1819 microsecs [1.819136] millis/run

Parse/lex ratio: 3.0316666666666667


2. Drilling down, I added simple operation profiling to each component of
parse-tree rewriting: original parse (CajaHtmlParser); building mutable tree
nodes; rewriting links; concatenating JS nodes; rewriting style blocks;
rendering parse tree. I then reran the same tests.

Results from subsequent run:

LEX-BASED*100 rewriter, 1 runs in 165321 microsecs [165.32096] millis/run

PARSE-BASED*100 rewriter, 1 runs in 646884 microsecs [646.88384] millis/run

Parse/lex ratio: 3.912896728183352

[PARSE OPS]

Op[style-rewrite] min:25.419ms, max:25.419ms, avg:25.419ms

Op[render] min:36.851ms, max:36.851ms, avg:36.851ms

Op[js-rewrite] min:53.983ms, max:53.983ms, avg:53.983ms

Op[link-rewrite] min:31.136ms, max:31.136ms, avg:31.136ms

Op[build-nodes] min:32.929ms, max:32.929ms, avg:32.929ms

Op[parse] min:464.211ms, max:464.211ms, avg:464.211ms


LEX-BASED*50 rewriter, 1 runs in 30684 microsecs [30.683904] millis/run

PARSE-BASED*50 rewriter, 1 runs in 161132 microsecs [161.132032] millis/run

Parse/lex ratio: 5.251336201277539

[PARSE OPS]

Op[style-rewrite] min:8.581ms, max:8.581ms, avg:8.581ms

Op[render] min:5.184ms, max:5.184ms, avg:5.184ms

Op[js-rewrite] min:11.606ms, max:11.606ms, avg:11.606ms

Op[link-rewrite] min:7.533ms, max:7.533ms, avg:7.533ms

Op[build-nodes] min:3.41ms, max:3.41ms, avg:3.41ms

Op[parse] min:121.367ms, max:121.367ms, avg:121.367ms


LEX-BASED*10 rewriter, 1 runs in 3371 microsecs [3.371008] millis/run

PARSE-BASED*10 rewriter, 1 runs in 10336 microsecs [10.336] millis/run

Parse/lex ratio: 3.066152477009789

[PARSE OPS]

Op[style-rewrite] min:0.563ms, max:0.563ms, avg:0.563ms

Op[render] min:0.678ms, max:0.678ms, avg:0.678ms

Op[js-rewrite] min:1.374ms, max:1.374ms, avg:1.374ms

Op[link-rewrite] min:0.718ms, max:0.718ms, avg:0.718ms

Op[build-nodes] min:0.295ms, max:0.295ms, avg:0.295ms

Op[parse] min:6.466ms, max:6.466ms, avg:6.466ms


LEX-BASED*1 rewriter, 1 runs in 592 microsecs [0.592128] millis/run

PARSE-BASED*1 rewriter, 1 runs in 2083 microsecs [2.083072] millis/run

Parse/lex ratio: 3.518581081081081

[PARSE OPS]

Op[style-rewrite] min:0.082ms, max:0.082ms, avg:0.082ms

Op[render] min:0.077ms, max:0.077ms, avg:0.077ms

Op[js-rewrite] min:0.143ms, max:0.143ms, avg:0.143ms

Op[link-rewrite] min:0.111ms, max:0.111ms, avg:0.111ms

Op[build-nodes] min:0.043ms, max:0.043ms, avg:0.043ms

Op[parse] min:1.437ms, max:1.437ms, avg:1.437ms


3. Drilling further, I wrote a separate test breaking out the performance
components to parsing: calling the Caja DomParser.parseFragment(...) API,
and subsequently wrapping the results of that call with ParsedHtmlNode
objects to satisfy interface requirements:

Typical run:

Caja parser [size*1, runs:1] in 97538 microsecs [97.538048] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:70.033ms, max:70.033ms, avg:70.033ms

Op[build-parse-nodes] min:3.644ms, max:3.644ms, avg:3.644ms


Caja parser [size*10, runs:1] in 42915 microsecs [42.915072] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:34.676ms, max:34.676ms, avg:34.676ms

Op[build-parse-nodes] min:7.148ms, max:7.148ms, avg:7.148ms


Caja parser [size*50, runs:1] in 157048 microsecs [157.048064] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:138.904ms, max:138.904ms, avg:138.904ms

Op[build-parse-nodes] min:17.313ms, max:17.313ms, avg:17.313ms


Caja parser [size*100, runs:1] in 236073 microsecs [236.07296] millis/run

[PARSER COMPONENTS]

Op[raw-caja-parse] min:173.743ms, max:173.743ms, avg:173.743ms

Op[build-parse-nodes] min:43.295ms, max:43.295ms, avg:43.295ms


Conclusions and Discussion:

The purpose of this task was to prove that tree-based parsing is
functionally viable, which has succeeded. Past that, it's a matter of
choosing functionality vs. performance. Given that rewriting results are
cached, perhaps even ~3x increase in rewriting cost will be worth paying.


That's particularly true given the new class of optimizations/rewrites made
possible with a parse tree, as well as some bugs that are more easily fixed
using it. For instance, I recently discovered a bug with the existing JS tag
rewriter which ignores type="..." attributes and doesn't maintain "id"
attributes in certain situations. These can be resolved in the lexer case,
but are clearer in the parser one.


Lastly, as mentioned at the beginning of this thread, I plan to maintain the
ability to manipulate a gadget by string, meaning a lexer-based approach can
still be used where desired and parse-tree isn't required.


Next steps:

1. My next step is to add modularity to content rewriting, but again without
changing any caching semantics. Instead, rather than a single
ContentRewriter being injected, a ContentRewriterRegistry will be. The
default Registry will support injection of a single ContentRewriter to
maintain backward compatibility for now.

2. GadgetSpec immutability restored, ensuring post-rewritten caching.

3. ContentRewriter API cleanup.


--John


On Tue, Aug 12, 2008 at 7:43 PM, John Hjelmstad <fa...@google.com> wrote:

> Interesting idea, and sounds fine to me. Concretely, this lets me sidestep
> SHINDIG-500 for a little while, which is nice (though I'd _really_ like to
> see the API cleanup go in! :)), in favor of migrating the existing rewriter
> to a tree-based approach. Turns out I've been working on #1 and #2
> independently anyway. I'll post a patch soon. Thanks!
>
> John
>
>
> On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com> wrote:
>
>> Can we prove this out incrementally bottom-up. In general I think using
>> DOM
>> is the right thing to do from a rewriting standpoint. So here's how I
>> propose we proceed
>>
>> 1. If the Caja dom is a little awkward wrap it, if not lets just use it as
>> is. We can always resolve this later
>> 2. Change the existing content rewriters to use the DOM instead of a
>> lexer,
>> should be pretty easy. Maybe add some fancier rewriting like moving CSS
>> into
>> HEAD
>> 3. Do some perf testing, look into memory overhead of dom transformation
>> etc.
>> 4. Alter GadgetSpec's to retain the dom when they are cached
>> 5. Alter the gadget rendering phase to serialize the content of the dom to
>> output
>> 6. Annotate the dom at parse time to make render time user-pref
>> substituions
>> faster, this should be easy enough too...
>>
>> This should be enough to prove out the pipeline end-to-end and identify
>> any
>> major perf niggles. Once this is done we can look into how to inject a
>> rewriter pipeline into the parsing phase and the rendering phase.
>>
>> -Louis
>>
>>
>>
>> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com> wrote:
>>
>> > Re-responding in order to apply the last few exchanges to
>> > google-caja-discuss@ (@gmail vs. @google membership issues).
>> >
>> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fa...@google.com>
>> wrote:
>> >
>> > > Hello,
>> > >
>> > > While beginning to refactor the rewriter APIs I've discovered that
>> there
>> > > unfortunately is one semantic difference inherent to moving
>> getContent()
>> > and
>> > > setContent() methods into the Gadget object (replacing
>> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer
>> caches
>> > > rewritten content.
>> > >
>> > > I've written a discussion of this in issue SHINDIG-500, which tracks
>> this
>> > > implementation sub-task:
>> > https://issues.apache.org/jira/browse/SHINDIG-500
>> > >
>> > > To summarize:
>> > > 1. Is this change acceptable for the time being?
>> > > 2. I suggest that we can, at a later date, move fetching of gadget
>> specs
>> > > into GadgetServer while injecting a Gadget(Spec) cache there as well,
>> > > offering finer-tuned control over caching characteristics.
>> > >
>> > > Thanks,
>> > > John
>> > >
>> > >
>> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fa...@google.com>
>> > wrote:
>> > >
>> > >> I understand these concerns, and should be clear that I don't
>> (despite
>> > my
>> > >> personal interest in experimenting with the idea, agreed that we
>> don't
>> > have
>> > >> time for it at the moment) have any plans to introduce this sort of
>> RPC
>> > >> anywhere - certainly not in Shindig itself, as any such call would be
>> > hidden
>> > >> behind an interface anyway.
>> > >>
>> > >> Putting the RPC hypothetical aside, I still feel that there's value
>> to
>> > >> implementing HTML parsing in terms of an interface:
>> > >> * Clearer separation of concerns/boundary between projects.
>> > >>   - Corollary simplicity in testing.
>> > >> * Clearer API for content manipulation (that doesn't require
>> knowledge
>> > of
>> > >> Caja).
>> > >>
>> > >> I could be convinced otherwise, but at this point the code involved
>> > seems
>> > >> of manageable size, so still worth doing. Thoughts?
>> > >>
>> > >> John
>> > >>
>> > >>
>> > >>
>> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com>
>> wrote:
>> > >>
>> > >>> I agree with Louis -- that's just not practical. Every rewriting
>> > >>> operation
>> > >>> must work in real time. Caja's existing html parser is adequate for
>> our
>> > >>> needs, and we shouldn't go out of our way to tolerate every oddity
>> of
>> > >>> random
>> > >>> web browsers (especially as it simply wouldn't work unless you
>> farmed
>> > it
>> > >>> out
>> > >>> to *every* browser). Any new code needs to be grounded in practical,
>> > >>> current
>> > >>> needs, not theoretical options. We can always change code later if
>> we
>> > >>> find a
>> > >>> real need for something like that. We have real work to do in the
>> > >>> meantime.
>> > >>>
>> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com>
>> wrote:
>> > >>>
>> > >>> > John,
>> > >>> >
>> > >>> > From a practicality standpoint I'm a little nervous about this
>> plan
>> > to
>> > >>> make
>> > >>> > RPCs calls out of a Java process to a native process to fetch a
>> parse
>> > >>> tree
>> > >>> > for transformations that have to occur realtime. I don't think the
>> > >>> > motivating factor here is to accept all inputs that browsers can.
>> > >>> Gadget
>> > >>> > developers will tailor their markup to the platform as they have
>> done
>> > >>> > already. I would greatly prefer us to pick one 'good' parser and
>> > stick
>> > >>> with
>> > >>> > it for all the manageability and consumability benefits that come
>> > with
>> > >>> that
>> > >>> > decision. Perhaps Im missing something here?
>> > >>> >
>> > >>> > -Louis
>> > >>> >
>> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <
>> fargo@google.com>
>> > >>> wrote:
>> > >>> >
>> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com>
>> > wrote:
>> > >>> > >
>> > >>> > > > [+google-caja-discuss]
>> > >>> > > >
>> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
>> fargo@google.com
>> > >
>> > >>> > wrote:
>> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <benl@google.com
>> >
>> > >>> wrote:
>> > >>> > > > >
>> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>> > >>> fargo@google.com>
>> > >>> > > > wrote:
>> > >>> > > > >> > This proposal effectively enables the renderer to become
>> a
>> > >>> > > multi-pass
>> > >>> > > > >> > compiler for gadget content (essentially, arbitrary web
>> > >>> content).
>> > >>> > > Such
>> > >>> > > > a
>> > >>> > > > >> > compiler can provide several benefits: static
>> optimization
>> > of
>> > >>> > gadget
>> > >>> > > > >> content
>> > >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
>> > >>> > consolidation
>> > >>> > > of
>> > >>> > > > >> CSS
>> > >>> > > > >> > blocks), security benefits (caja et al), new
>> functionality
>> > >>> > > (annotation
>> > >>> > > > of
>> > >>> > > > >> > content for stats, document analysis, container-specific
>> > >>> > features),
>> > >>> > > > etc.
>> > >>> > > > >> To
>> > >>> > > > >> > my knowledge no such infrastructure exists today (with
>> the
>> > >>> > possible
>> > >>> > > > >> > exception of Caja itself, which I'd like to dovetail with
>> > this
>> > >>> > > work).
>> > >>> > > > >>
>> > >>> > > > >> Caja clearly provides a large chunk of the code you'd need
>> for
>> > >>> this.
>> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
>> between
>> > >>> the
>> > >>> > two
>> > >>> > > > >> projects.
>> > >>> > > > >>
>> > >>> > > > >> A generalised framework for manipulating content sounds
>> like a
>> > >>> great
>> > >>> > > > >> idea, but probably should not live in either of the two
>> > projects
>> > >>> > (Caja
>> > >>> > > > >> and Shindig) but rather should be shared by both of them, I
>> > >>> suspect.
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > > I agree on both counts. As I mentioned, the piece of this
>> idea
>> > >>> that I
>> > >>> > > > expect
>> > >>> > > > > to change the most is the parse tree, and Caja's
>> .parser.html
>> > and
>> > >>> > > > > .parser.css packages contain much of what I've thrown in
>> here
>> > as
>> > >>> a
>> > >>> > > base.
>> > >>> > > > >
>> > >>> > > > > My key requirements are:
>> > >>> > > > > * Lightweight framework.
>> > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the
>> > good
>> > >>> work
>> > >>> > > > done
>> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd
>> bet)
>> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
>> > >>> > > > > * Easy to manipulate structure.
>> > >>> > > >
>> > >>> > > > I'm not sure what the value of parser modularity is? If the
>> > >>> resulting
>> > >>> > > > tree is different, then that's a problem for people processing
>> > the
>> > >>> > > > tree. And if it is not, then why do we care?
>> > >>> > >
>> > >>> > >
>> > >>> > > IMO the value of parser modularity is that the lenient parsers
>> > native
>> > >>> to
>> > >>> > > browsers can be used in place of those that might not accept all
>> > >>> inputs.
>> > >>> > > One
>> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code
>> into a
>> > >>> > server
>> > >>> > > that runs parallel to Shindig and provides a "local RPC" service
>> > for
>> > >>> > > parsing
>> > >>> > > semi-structured HTML. The resulting tree for WebKit's parser
>> might
>> > be
>> > >>> > > different than that for an XHTML parser, Gecko's parser, etc,
>> but
>> > if
>> > >>> the
>> > >>> > > algorithm implemented atop it is rule-based rather than
>> > >>> strict-structure
>> > >>> > > based that should be fine, no?
>> > >>> > >
>> > >>> > >
>> > >>> > > >
>> > >>> > > >
>> > >>> > > > >
>> > >>> > > > > I'd love to see both projects share the same base syntax
>> tree
>> > >>> > > > > representations. I considered .parser.html(.DomTree) and
>> > >>> .parser.css
>> > >>> > > for
>> > >>> > > > > these, but at the moment these appeared to be a little more
>> > tied
>> > >>> to
>> > >>> > > > Caja's
>> > >>> > > > > lexer/parser implementation than I preferred (though I admit
>> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
>> > >>> > > > >
>> > >>> > > > > To be sure, I don't see this as an end-all-be-all
>> > transformation
>> > >>> > system
>> > >>> > > > in
>> > >>> > > > > any way. I'd just like to put *something* reasonable in
>> place
>> > >>> that we
>> > >>> > > can
>> > >>> > > > > play with, provide some benefit, and enhance into a truly
>> > >>> > sophisticated
>> > >>> > > > > vision of document rewriting.
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > >>
>> > >>> > > > >>
>> > >>> > > > >> >  c. Add Gadget.getParsedContent().
>> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
>> > >>> manipulate
>> > >>> > > > Gadget
>> > >>> > > > >> > Contents.
>> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
>> > indicating
>> > >>> > when
>> > >>> > > > any
>> > >>> > > > >> > change is made, and emits an error if setContent() has
>> been
>> > >>> called
>> > >>> > > in
>> > >>> > > > the
>> > >>> > > > >> > interim.
>> > >>> > > > >>
>> > >>> > > > >> In Caja we have been moving towards immutable trees...
>> > >>> > > > >
>> > >>> > > > >
>> > >>> > > > > Interested to hear more about this. The whole idea is for
>> the
>> > >>> > gadget's
>> > >>> > > > tree
>> > >>> > > > > representation to be modifiable. Doing that with immutable
>> > trees
>> > >>> to
>> > >>> > me
>> > >>> > > > > suggests that a rewriter would have to create a completely
>> new
>> > >>> tree
>> > >>> > and
>> > >>> > > > set
>> > >>> > > > > it as a representation of new content. That's convenient as
>> far
>> > >>> as
>> > >>> > the
>> > >>> > > > > Gadget's maintenance of String<->Tree representations is
>> > >>> concerned...
>> > >>> > > but
>> > >>> > > > > seems pretty heavyweight for many types of edits: in-situ
>> > >>> > modifications
>> > >>> > > > of
>> > >>> > > > > text, content reordering, etc. That's particularly so in a
>> > >>> > > > single-threaded
>> > >>> > > > > (viz rewriting) environment.
>> > >>> > > >
>> > >>> > > > Never having been entirely sold on the concept, I'll let those
>> on
>> > >>> the
>> > >>> > > > Caja team who advocate immutability explain why.
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
Interesting idea, and sounds fine to me. Concretely, this lets me sidestep
SHINDIG-500 for a little while, which is nice (though I'd _really_ like to
see the API cleanup go in! :)), in favor of migrating the existing rewriter
to a tree-based approach. Turns out I've been working on #1 and #2
independently anyway. I'll post a patch soon. Thanks!

John

On Tue, Aug 12, 2008 at 7:14 PM, Louis Ryan <lr...@google.com> wrote:

> Can we prove this out incrementally bottom-up. In general I think using DOM
> is the right thing to do from a rewriting standpoint. So here's how I
> propose we proceed
>
> 1. If the Caja dom is a little awkward wrap it, if not lets just use it as
> is. We can always resolve this later
> 2. Change the existing content rewriters to use the DOM instead of a lexer,
> should be pretty easy. Maybe add some fancier rewriting like moving CSS
> into
> HEAD
> 3. Do some perf testing, look into memory overhead of dom transformation
> etc.
> 4. Alter GadgetSpec's to retain the dom when they are cached
> 5. Alter the gadget rendering phase to serialize the content of the dom to
> output
> 6. Annotate the dom at parse time to make render time user-pref
> substituions
> faster, this should be easy enough too...
>
> This should be enough to prove out the pipeline end-to-end and identify any
> major perf niggles. Once this is done we can look into how to inject a
> rewriter pipeline into the parsing phase and the rendering phase.
>
> -Louis
>
>
>
> On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Re-responding in order to apply the last few exchanges to
> > google-caja-discuss@ (@gmail vs. @google membership issues).
> >
> > On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >
> > > Hello,
> > >
> > > While beginning to refactor the rewriter APIs I've discovered that
> there
> > > unfortunately is one semantic difference inherent to moving
> getContent()
> > and
> > > setContent() methods into the Gadget object (replacing
> > > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer
> caches
> > > rewritten content.
> > >
> > > I've written a discussion of this in issue SHINDIG-500, which tracks
> this
> > > implementation sub-task:
> > https://issues.apache.org/jira/browse/SHINDIG-500
> > >
> > > To summarize:
> > > 1. Is this change acceptable for the time being?
> > > 2. I suggest that we can, at a later date, move fetching of gadget
> specs
> > > into GadgetServer while injecting a Gadget(Spec) cache there as well,
> > > offering finer-tuned control over caching characteristics.
> > >
> > > Thanks,
> > > John
> > >
> > >
> > > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > >
> > >> I understand these concerns, and should be clear that I don't (despite
> > my
> > >> personal interest in experimenting with the idea, agreed that we don't
> > have
> > >> time for it at the moment) have any plans to introduce this sort of
> RPC
> > >> anywhere - certainly not in Shindig itself, as any such call would be
> > hidden
> > >> behind an interface anyway.
> > >>
> > >> Putting the RPC hypothetical aside, I still feel that there's value to
> > >> implementing HTML parsing in terms of an interface:
> > >> * Clearer separation of concerns/boundary between projects.
> > >>   - Corollary simplicity in testing.
> > >> * Clearer API for content manipulation (that doesn't require knowledge
> > of
> > >> Caja).
> > >>
> > >> I could be convinced otherwise, but at this point the code involved
> > seems
> > >> of manageable size, so still worth doing. Thoughts?
> > >>
> > >> John
> > >>
> > >>
> > >>
> > >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com> wrote:
> > >>
> > >>> I agree with Louis -- that's just not practical. Every rewriting
> > >>> operation
> > >>> must work in real time. Caja's existing html parser is adequate for
> our
> > >>> needs, and we shouldn't go out of our way to tolerate every oddity of
> > >>> random
> > >>> web browsers (especially as it simply wouldn't work unless you farmed
> > it
> > >>> out
> > >>> to *every* browser). Any new code needs to be grounded in practical,
> > >>> current
> > >>> needs, not theoretical options. We can always change code later if we
> > >>> find a
> > >>> real need for something like that. We have real work to do in the
> > >>> meantime.
> > >>>
> > >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com>
> wrote:
> > >>>
> > >>> > John,
> > >>> >
> > >>> > From a practicality standpoint I'm a little nervous about this plan
> > to
> > >>> make
> > >>> > RPCs calls out of a Java process to a native process to fetch a
> parse
> > >>> tree
> > >>> > for transformations that have to occur realtime. I don't think the
> > >>> > motivating factor here is to accept all inputs that browsers can.
> > >>> Gadget
> > >>> > developers will tailor their markup to the platform as they have
> done
> > >>> > already. I would greatly prefer us to pick one 'good' parser and
> > stick
> > >>> with
> > >>> > it for all the manageability and consumability benefits that come
> > with
> > >>> that
> > >>> > decision. Perhaps Im missing something here?
> > >>> >
> > >>> > -Louis
> > >>> >
> > >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fargo@google.com
> >
> > >>> wrote:
> > >>> >
> > >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com>
> > wrote:
> > >>> > >
> > >>> > > > [+google-caja-discuss]
> > >>> > > >
> > >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <
> fargo@google.com
> > >
> > >>> > wrote:
> > >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com>
> > >>> wrote:
> > >>> > > > >
> > >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> > >>> fargo@google.com>
> > >>> > > > wrote:
> > >>> > > > >> > This proposal effectively enables the renderer to become a
> > >>> > > multi-pass
> > >>> > > > >> > compiler for gadget content (essentially, arbitrary web
> > >>> content).
> > >>> > > Such
> > >>> > > > a
> > >>> > > > >> > compiler can provide several benefits: static optimization
> > of
> > >>> > gadget
> > >>> > > > >> content
> > >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
> > >>> > consolidation
> > >>> > > of
> > >>> > > > >> CSS
> > >>> > > > >> > blocks), security benefits (caja et al), new functionality
> > >>> > > (annotation
> > >>> > > > of
> > >>> > > > >> > content for stats, document analysis, container-specific
> > >>> > features),
> > >>> > > > etc.
> > >>> > > > >> To
> > >>> > > > >> > my knowledge no such infrastructure exists today (with the
> > >>> > possible
> > >>> > > > >> > exception of Caja itself, which I'd like to dovetail with
> > this
> > >>> > > work).
> > >>> > > > >>
> > >>> > > > >> Caja clearly provides a large chunk of the code you'd need
> for
> > >>> this.
> > >>> > > > >> I'd like to hear how we'd manage to avoid duplication
> between
> > >>> the
> > >>> > two
> > >>> > > > >> projects.
> > >>> > > > >>
> > >>> > > > >> A generalised framework for manipulating content sounds like
> a
> > >>> great
> > >>> > > > >> idea, but probably should not live in either of the two
> > projects
> > >>> > (Caja
> > >>> > > > >> and Shindig) but rather should be shared by both of them, I
> > >>> suspect.
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > I agree on both counts. As I mentioned, the piece of this
> idea
> > >>> that I
> > >>> > > > expect
> > >>> > > > > to change the most is the parse tree, and Caja's .parser.html
> > and
> > >>> > > > > .parser.css packages contain much of what I've thrown in here
> > as
> > >>> a
> > >>> > > base.
> > >>> > > > >
> > >>> > > > > My key requirements are:
> > >>> > > > > * Lightweight framework.
> > >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the
> > good
> > >>> work
> > >>> > > > done
> > >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd
> bet)
> > >>> > > > > * Automatic maintenance of DOM<->String conversion.
> > >>> > > > > * Easy to manipulate structure.
> > >>> > > >
> > >>> > > > I'm not sure what the value of parser modularity is? If the
> > >>> resulting
> > >>> > > > tree is different, then that's a problem for people processing
> > the
> > >>> > > > tree. And if it is not, then why do we care?
> > >>> > >
> > >>> > >
> > >>> > > IMO the value of parser modularity is that the lenient parsers
> > native
> > >>> to
> > >>> > > browsers can be used in place of those that might not accept all
> > >>> inputs.
> > >>> > > One
> > >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into
> a
> > >>> > server
> > >>> > > that runs parallel to Shindig and provides a "local RPC" service
> > for
> > >>> > > parsing
> > >>> > > semi-structured HTML. The resulting tree for WebKit's parser
> might
> > be
> > >>> > > different than that for an XHTML parser, Gecko's parser, etc, but
> > if
> > >>> the
> > >>> > > algorithm implemented atop it is rule-based rather than
> > >>> strict-structure
> > >>> > > based that should be fine, no?
> > >>> > >
> > >>> > >
> > >>> > > >
> > >>> > > >
> > >>> > > > >
> > >>> > > > > I'd love to see both projects share the same base syntax tree
> > >>> > > > > representations. I considered .parser.html(.DomTree) and
> > >>> .parser.css
> > >>> > > for
> > >>> > > > > these, but at the moment these appeared to be a little more
> > tied
> > >>> to
> > >>> > > > Caja's
> > >>> > > > > lexer/parser implementation than I preferred (though I admit
> > >>> > > > > AbstractParseTreeNode contains most of what's needed).
> > >>> > > > >
> > >>> > > > > To be sure, I don't see this as an end-all-be-all
> > transformation
> > >>> > system
> > >>> > > > in
> > >>> > > > > any way. I'd just like to put *something* reasonable in place
> > >>> that we
> > >>> > > can
> > >>> > > > > play with, provide some benefit, and enhance into a truly
> > >>> > sophisticated
> > >>> > > > > vision of document rewriting.
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >> >  c. Add Gadget.getParsedContent().
> > >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
> > >>> manipulate
> > >>> > > > Gadget
> > >>> > > > >> > Contents.
> > >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> > indicating
> > >>> > when
> > >>> > > > any
> > >>> > > > >> > change is made, and emits an error if setContent() has
> been
> > >>> called
> > >>> > > in
> > >>> > > > the
> > >>> > > > >> > interim.
> > >>> > > > >>
> > >>> > > > >> In Caja we have been moving towards immutable trees...
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > Interested to hear more about this. The whole idea is for the
> > >>> > gadget's
> > >>> > > > tree
> > >>> > > > > representation to be modifiable. Doing that with immutable
> > trees
> > >>> to
> > >>> > me
> > >>> > > > > suggests that a rewriter would have to create a completely
> new
> > >>> tree
> > >>> > and
> > >>> > > > set
> > >>> > > > > it as a representation of new content. That's convenient as
> far
> > >>> as
> > >>> > the
> > >>> > > > > Gadget's maintenance of String<->Tree representations is
> > >>> concerned...
> > >>> > > but
> > >>> > > > > seems pretty heavyweight for many types of edits: in-situ
> > >>> > modifications
> > >>> > > > of
> > >>> > > > > text, content reordering, etc. That's particularly so in a
> > >>> > > > single-threaded
> > >>> > > > > (viz rewriting) environment.
> > >>> > > >
> > >>> > > > Never having been entirely sold on the concept, I'll let those
> on
> > >>> the
> > >>> > > > Caja team who advocate immutability explain why.
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by Louis Ryan <lr...@google.com>.
Can we prove this out incrementally bottom-up. In general I think using DOM
is the right thing to do from a rewriting standpoint. So here's how I
propose we proceed

1. If the Caja dom is a little awkward wrap it, if not lets just use it as
is. We can always resolve this later
2. Change the existing content rewriters to use the DOM instead of a lexer,
should be pretty easy. Maybe add some fancier rewriting like moving CSS into
HEAD
3. Do some perf testing, look into memory overhead of dom transformation
etc.
4. Alter GadgetSpec's to retain the dom when they are cached
5. Alter the gadget rendering phase to serialize the content of the dom to
output
6. Annotate the dom at parse time to make render time user-pref substituions
faster, this should be easy enough too...

This should be enough to prove out the pipeline end-to-end and identify any
major perf niggles. Once this is done we can look into how to inject a
rewriter pipeline into the parsing phase and the rendering phase.

-Louis



On Tue, Aug 12, 2008 at 5:57 PM, John Hjelmstad <fa...@google.com> wrote:

> Re-responding in order to apply the last few exchanges to
> google-caja-discuss@ (@gmail vs. @google membership issues).
>
> On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fa...@google.com> wrote:
>
> > Hello,
> >
> > While beginning to refactor the rewriter APIs I've discovered that there
> > unfortunately is one semantic difference inherent to moving getContent()
> and
> > setContent() methods into the Gadget object (replacing
> > View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer caches
> > rewritten content.
> >
> > I've written a discussion of this in issue SHINDIG-500, which tracks this
> > implementation sub-task:
> https://issues.apache.org/jira/browse/SHINDIG-500
> >
> > To summarize:
> > 1. Is this change acceptable for the time being?
> > 2. I suggest that we can, at a later date, move fetching of gadget specs
> > into GadgetServer while injecting a Gadget(Spec) cache there as well,
> > offering finer-tuned control over caching characteristics.
> >
> > Thanks,
> > John
> >
> >
> > On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >
> >> I understand these concerns, and should be clear that I don't (despite
> my
> >> personal interest in experimenting with the idea, agreed that we don't
> have
> >> time for it at the moment) have any plans to introduce this sort of RPC
> >> anywhere - certainly not in Shindig itself, as any such call would be
> hidden
> >> behind an interface anyway.
> >>
> >> Putting the RPC hypothetical aside, I still feel that there's value to
> >> implementing HTML parsing in terms of an interface:
> >> * Clearer separation of concerns/boundary between projects.
> >>   - Corollary simplicity in testing.
> >> * Clearer API for content manipulation (that doesn't require knowledge
> of
> >> Caja).
> >>
> >> I could be convinced otherwise, but at this point the code involved
> seems
> >> of manageable size, so still worth doing. Thoughts?
> >>
> >> John
> >>
> >>
> >>
> >> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com> wrote:
> >>
> >>> I agree with Louis -- that's just not practical. Every rewriting
> >>> operation
> >>> must work in real time. Caja's existing html parser is adequate for our
> >>> needs, and we shouldn't go out of our way to tolerate every oddity of
> >>> random
> >>> web browsers (especially as it simply wouldn't work unless you farmed
> it
> >>> out
> >>> to *every* browser). Any new code needs to be grounded in practical,
> >>> current
> >>> needs, not theoretical options. We can always change code later if we
> >>> find a
> >>> real need for something like that. We have real work to do in the
> >>> meantime.
> >>>
> >>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com> wrote:
> >>>
> >>> > John,
> >>> >
> >>> > From a practicality standpoint I'm a little nervous about this plan
> to
> >>> make
> >>> > RPCs calls out of a Java process to a native process to fetch a parse
> >>> tree
> >>> > for transformations that have to occur realtime. I don't think the
> >>> > motivating factor here is to accept all inputs that browsers can.
> >>> Gadget
> >>> > developers will tailor their markup to the platform as they have done
> >>> > already. I would greatly prefer us to pick one 'good' parser and
> stick
> >>> with
> >>> > it for all the manageability and consumability benefits that come
> with
> >>> that
> >>> > decision. Perhaps Im missing something here?
> >>> >
> >>> > -Louis
> >>> >
> >>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fa...@google.com>
> >>> wrote:
> >>> >
> >>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com>
> wrote:
> >>> > >
> >>> > > > [+google-caja-discuss]
> >>> > > >
> >>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fargo@google.com
> >
> >>> > wrote:
> >>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com>
> >>> wrote:
> >>> > > > >
> >>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
> >>> fargo@google.com>
> >>> > > > wrote:
> >>> > > > >> > This proposal effectively enables the renderer to become a
> >>> > > multi-pass
> >>> > > > >> > compiler for gadget content (essentially, arbitrary web
> >>> content).
> >>> > > Such
> >>> > > > a
> >>> > > > >> > compiler can provide several benefits: static optimization
> of
> >>> > gadget
> >>> > > > >> content
> >>> > > > >> > (auto-proxying of images, whitespace/comment removal,
> >>> > consolidation
> >>> > > of
> >>> > > > >> CSS
> >>> > > > >> > blocks), security benefits (caja et al), new functionality
> >>> > > (annotation
> >>> > > > of
> >>> > > > >> > content for stats, document analysis, container-specific
> >>> > features),
> >>> > > > etc.
> >>> > > > >> To
> >>> > > > >> > my knowledge no such infrastructure exists today (with the
> >>> > possible
> >>> > > > >> > exception of Caja itself, which I'd like to dovetail with
> this
> >>> > > work).
> >>> > > > >>
> >>> > > > >> Caja clearly provides a large chunk of the code you'd need for
> >>> this.
> >>> > > > >> I'd like to hear how we'd manage to avoid duplication between
> >>> the
> >>> > two
> >>> > > > >> projects.
> >>> > > > >>
> >>> > > > >> A generalised framework for manipulating content sounds like a
> >>> great
> >>> > > > >> idea, but probably should not live in either of the two
> projects
> >>> > (Caja
> >>> > > > >> and Shindig) but rather should be shared by both of them, I
> >>> suspect.
> >>> > > > >
> >>> > > > >
> >>> > > > > I agree on both counts. As I mentioned, the piece of this idea
> >>> that I
> >>> > > > expect
> >>> > > > > to change the most is the parse tree, and Caja's .parser.html
> and
> >>> > > > > .parser.css packages contain much of what I've thrown in here
> as
> >>> a
> >>> > > base.
> >>> > > > >
> >>> > > > > My key requirements are:
> >>> > > > > * Lightweight framework.
> >>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the
> good
> >>> work
> >>> > > > done
> >>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> >>> > > > > * Automatic maintenance of DOM<->String conversion.
> >>> > > > > * Easy to manipulate structure.
> >>> > > >
> >>> > > > I'm not sure what the value of parser modularity is? If the
> >>> resulting
> >>> > > > tree is different, then that's a problem for people processing
> the
> >>> > > > tree. And if it is not, then why do we care?
> >>> > >
> >>> > >
> >>> > > IMO the value of parser modularity is that the lenient parsers
> native
> >>> to
> >>> > > browsers can be used in place of those that might not accept all
> >>> inputs.
> >>> > > One
> >>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
> >>> > server
> >>> > > that runs parallel to Shindig and provides a "local RPC" service
> for
> >>> > > parsing
> >>> > > semi-structured HTML. The resulting tree for WebKit's parser might
> be
> >>> > > different than that for an XHTML parser, Gecko's parser, etc, but
> if
> >>> the
> >>> > > algorithm implemented atop it is rule-based rather than
> >>> strict-structure
> >>> > > based that should be fine, no?
> >>> > >
> >>> > >
> >>> > > >
> >>> > > >
> >>> > > > >
> >>> > > > > I'd love to see both projects share the same base syntax tree
> >>> > > > > representations. I considered .parser.html(.DomTree) and
> >>> .parser.css
> >>> > > for
> >>> > > > > these, but at the moment these appeared to be a little more
> tied
> >>> to
> >>> > > > Caja's
> >>> > > > > lexer/parser implementation than I preferred (though I admit
> >>> > > > > AbstractParseTreeNode contains most of what's needed).
> >>> > > > >
> >>> > > > > To be sure, I don't see this as an end-all-be-all
> transformation
> >>> > system
> >>> > > > in
> >>> > > > > any way. I'd just like to put *something* reasonable in place
> >>> that we
> >>> > > can
> >>> > > > > play with, provide some benefit, and enhance into a truly
> >>> > sophisticated
> >>> > > > > vision of document rewriting.
> >>> > > > >
> >>> > > > >
> >>> > > > >>
> >>> > > > >>
> >>> > > > >> >  c. Add Gadget.getParsedContent().
> >>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
> >>> manipulate
> >>> > > > Gadget
> >>> > > > >> > Contents.
> >>> > > > >> >    ii. Mutable tree calls back to the Gadget object
> indicating
> >>> > when
> >>> > > > any
> >>> > > > >> > change is made, and emits an error if setContent() has been
> >>> called
> >>> > > in
> >>> > > > the
> >>> > > > >> > interim.
> >>> > > > >>
> >>> > > > >> In Caja we have been moving towards immutable trees...
> >>> > > > >
> >>> > > > >
> >>> > > > > Interested to hear more about this. The whole idea is for the
> >>> > gadget's
> >>> > > > tree
> >>> > > > > representation to be modifiable. Doing that with immutable
> trees
> >>> to
> >>> > me
> >>> > > > > suggests that a rewriter would have to create a completely new
> >>> tree
> >>> > and
> >>> > > > set
> >>> > > > > it as a representation of new content. That's convenient as far
> >>> as
> >>> > the
> >>> > > > > Gadget's maintenance of String<->Tree representations is
> >>> concerned...
> >>> > > but
> >>> > > > > seems pretty heavyweight for many types of edits: in-situ
> >>> > modifications
> >>> > > > of
> >>> > > > > text, content reordering, etc. That's particularly so in a
> >>> > > > single-threaded
> >>> > > > > (viz rewriting) environment.
> >>> > > >
> >>> > > > Never having been entirely sold on the concept, I'll let those on
> >>> the
> >>> > > > Caja team who advocate immutability explain why.
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
Re-responding in order to apply the last few exchanges to
google-caja-discuss@ (@gmail vs. @google membership issues).

On Tue, Aug 12, 2008 at 4:48 PM, John Hjelmstad <fa...@google.com> wrote:

> Hello,
>
> While beginning to refactor the rewriter APIs I've discovered that there
> unfortunately is one semantic difference inherent to moving getContent() and
> setContent() methods into the Gadget object (replacing
> View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer caches
> rewritten content.
>
> I've written a discussion of this in issue SHINDIG-500, which tracks this
> implementation sub-task: https://issues.apache.org/jira/browse/SHINDIG-500
>
> To summarize:
> 1. Is this change acceptable for the time being?
> 2. I suggest that we can, at a later date, move fetching of gadget specs
> into GadgetServer while injecting a Gadget(Spec) cache there as well,
> offering finer-tuned control over caching characteristics.
>
> Thanks,
> John
>
>
> On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fa...@google.com> wrote:
>
>> I understand these concerns, and should be clear that I don't (despite my
>> personal interest in experimenting with the idea, agreed that we don't have
>> time for it at the moment) have any plans to introduce this sort of RPC
>> anywhere - certainly not in Shindig itself, as any such call would be hidden
>> behind an interface anyway.
>>
>> Putting the RPC hypothetical aside, I still feel that there's value to
>> implementing HTML parsing in terms of an interface:
>> * Clearer separation of concerns/boundary between projects.
>>   - Corollary simplicity in testing.
>> * Clearer API for content manipulation (that doesn't require knowledge of
>> Caja).
>>
>> I could be convinced otherwise, but at this point the code involved seems
>> of manageable size, so still worth doing. Thoughts?
>>
>> John
>>
>>
>>
>> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com> wrote:
>>
>>> I agree with Louis -- that's just not practical. Every rewriting
>>> operation
>>> must work in real time. Caja's existing html parser is adequate for our
>>> needs, and we shouldn't go out of our way to tolerate every oddity of
>>> random
>>> web browsers (especially as it simply wouldn't work unless you farmed it
>>> out
>>> to *every* browser). Any new code needs to be grounded in practical,
>>> current
>>> needs, not theoretical options. We can always change code later if we
>>> find a
>>> real need for something like that. We have real work to do in the
>>> meantime.
>>>
>>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com> wrote:
>>>
>>> > John,
>>> >
>>> > From a practicality standpoint I'm a little nervous about this plan to
>>> make
>>> > RPCs calls out of a Java process to a native process to fetch a parse
>>> tree
>>> > for transformations that have to occur realtime. I don't think the
>>> > motivating factor here is to accept all inputs that browsers can.
>>> Gadget
>>> > developers will tailor their markup to the platform as they have done
>>> > already. I would greatly prefer us to pick one 'good' parser and stick
>>> with
>>> > it for all the manageability and consumability benefits that come with
>>> that
>>> > decision. Perhaps Im missing something here?
>>> >
>>> > -Louis
>>> >
>>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fa...@google.com>
>>> wrote:
>>> >
>>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com> wrote:
>>> > >
>>> > > > [+google-caja-discuss]
>>> > > >
>>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fa...@google.com>
>>> > wrote:
>>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com>
>>> wrote:
>>> > > > >
>>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>>> fargo@google.com>
>>> > > > wrote:
>>> > > > >> > This proposal effectively enables the renderer to become a
>>> > > multi-pass
>>> > > > >> > compiler for gadget content (essentially, arbitrary web
>>> content).
>>> > > Such
>>> > > > a
>>> > > > >> > compiler can provide several benefits: static optimization of
>>> > gadget
>>> > > > >> content
>>> > > > >> > (auto-proxying of images, whitespace/comment removal,
>>> > consolidation
>>> > > of
>>> > > > >> CSS
>>> > > > >> > blocks), security benefits (caja et al), new functionality
>>> > > (annotation
>>> > > > of
>>> > > > >> > content for stats, document analysis, container-specific
>>> > features),
>>> > > > etc.
>>> > > > >> To
>>> > > > >> > my knowledge no such infrastructure exists today (with the
>>> > possible
>>> > > > >> > exception of Caja itself, which I'd like to dovetail with this
>>> > > work).
>>> > > > >>
>>> > > > >> Caja clearly provides a large chunk of the code you'd need for
>>> this.
>>> > > > >> I'd like to hear how we'd manage to avoid duplication between
>>> the
>>> > two
>>> > > > >> projects.
>>> > > > >>
>>> > > > >> A generalised framework for manipulating content sounds like a
>>> great
>>> > > > >> idea, but probably should not live in either of the two projects
>>> > (Caja
>>> > > > >> and Shindig) but rather should be shared by both of them, I
>>> suspect.
>>> > > > >
>>> > > > >
>>> > > > > I agree on both counts. As I mentioned, the piece of this idea
>>> that I
>>> > > > expect
>>> > > > > to change the most is the parse tree, and Caja's .parser.html and
>>> > > > > .parser.css packages contain much of what I've thrown in here as
>>> a
>>> > > base.
>>> > > > >
>>> > > > > My key requirements are:
>>> > > > > * Lightweight framework.
>>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the good
>>> work
>>> > > > done
>>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
>>> > > > > * Automatic maintenance of DOM<->String conversion.
>>> > > > > * Easy to manipulate structure.
>>> > > >
>>> > > > I'm not sure what the value of parser modularity is? If the
>>> resulting
>>> > > > tree is different, then that's a problem for people processing the
>>> > > > tree. And if it is not, then why do we care?
>>> > >
>>> > >
>>> > > IMO the value of parser modularity is that the lenient parsers native
>>> to
>>> > > browsers can be used in place of those that might not accept all
>>> inputs.
>>> > > One
>>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
>>> > server
>>> > > that runs parallel to Shindig and provides a "local RPC" service for
>>> > > parsing
>>> > > semi-structured HTML. The resulting tree for WebKit's parser might be
>>> > > different than that for an XHTML parser, Gecko's parser, etc, but if
>>> the
>>> > > algorithm implemented atop it is rule-based rather than
>>> strict-structure
>>> > > based that should be fine, no?
>>> > >
>>> > >
>>> > > >
>>> > > >
>>> > > > >
>>> > > > > I'd love to see both projects share the same base syntax tree
>>> > > > > representations. I considered .parser.html(.DomTree) and
>>> .parser.css
>>> > > for
>>> > > > > these, but at the moment these appeared to be a little more tied
>>> to
>>> > > > Caja's
>>> > > > > lexer/parser implementation than I preferred (though I admit
>>> > > > > AbstractParseTreeNode contains most of what's needed).
>>> > > > >
>>> > > > > To be sure, I don't see this as an end-all-be-all transformation
>>> > system
>>> > > > in
>>> > > > > any way. I'd just like to put *something* reasonable in place
>>> that we
>>> > > can
>>> > > > > play with, provide some benefit, and enhance into a truly
>>> > sophisticated
>>> > > > > vision of document rewriting.
>>> > > > >
>>> > > > >
>>> > > > >>
>>> > > > >>
>>> > > > >> >  c. Add Gadget.getParsedContent().
>>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
>>> manipulate
>>> > > > Gadget
>>> > > > >> > Contents.
>>> > > > >> >    ii. Mutable tree calls back to the Gadget object indicating
>>> > when
>>> > > > any
>>> > > > >> > change is made, and emits an error if setContent() has been
>>> called
>>> > > in
>>> > > > the
>>> > > > >> > interim.
>>> > > > >>
>>> > > > >> In Caja we have been moving towards immutable trees...
>>> > > > >
>>> > > > >
>>> > > > > Interested to hear more about this. The whole idea is for the
>>> > gadget's
>>> > > > tree
>>> > > > > representation to be modifiable. Doing that with immutable trees
>>> to
>>> > me
>>> > > > > suggests that a rewriter would have to create a completely new
>>> tree
>>> > and
>>> > > > set
>>> > > > > it as a representation of new content. That's convenient as far
>>> as
>>> > the
>>> > > > > Gadget's maintenance of String<->Tree representations is
>>> concerned...
>>> > > but
>>> > > > > seems pretty heavyweight for many types of edits: in-situ
>>> > modifications
>>> > > > of
>>> > > > > text, content reordering, etc. That's particularly so in a
>>> > > > single-threaded
>>> > > > > (viz rewriting) environment.
>>> > > >
>>> > > > Never having been entirely sold on the concept, I'll let those on
>>> the
>>> > > > Caja team who advocate immutability explain why.
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
Hello,

While beginning to refactor the rewriter APIs I've discovered that there
unfortunately is one semantic difference inherent to moving getContent() and
setContent() methods into the Gadget object (replacing
View.get/setRewrittenContent()): BasicGadgetSpecFactory no longer caches
rewritten content.

I've written a discussion of this in issue SHINDIG-500, which tracks this
implementation sub-task: https://issues.apache.org/jira/browse/SHINDIG-500

To summarize:
1. Is this change acceptable for the time being?
2. I suggest that we can, at a later date, move fetching of gadget specs
into GadgetServer while injecting a Gadget(Spec) cache there as well,
offering finer-tuned control over caching characteristics.

Thanks,
John

On Mon, Aug 11, 2008 at 2:20 PM, John Hjelmstad <fa...@google.com> wrote:

> I understand these concerns, and should be clear that I don't (despite my
> personal interest in experimenting with the idea, agreed that we don't have
> time for it at the moment) have any plans to introduce this sort of RPC
> anywhere - certainly not in Shindig itself, as any such call would be hidden
> behind an interface anyway.
>
> Putting the RPC hypothetical aside, I still feel that there's value to
> implementing HTML parsing in terms of an interface:
> * Clearer separation of concerns/boundary between projects.
>   - Corollary simplicity in testing.
> * Clearer API for content manipulation (that doesn't require knowledge of
> Caja).
>
> I could be convinced otherwise, but at this point the code involved seems
> of manageable size, so still worth doing. Thoughts?
>
> John
>
>
>
> On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com> wrote:
>
>> I agree with Louis -- that's just not practical. Every rewriting operation
>> must work in real time. Caja's existing html parser is adequate for our
>> needs, and we shouldn't go out of our way to tolerate every oddity of
>> random
>> web browsers (especially as it simply wouldn't work unless you farmed it
>> out
>> to *every* browser). Any new code needs to be grounded in practical,
>> current
>> needs, not theoretical options. We can always change code later if we find
>> a
>> real need for something like that. We have real work to do in the
>> meantime.
>>
>> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com> wrote:
>>
>> > John,
>> >
>> > From a practicality standpoint I'm a little nervous about this plan to
>> make
>> > RPCs calls out of a Java process to a native process to fetch a parse
>> tree
>> > for transformations that have to occur realtime. I don't think the
>> > motivating factor here is to accept all inputs that browsers can. Gadget
>> > developers will tailor their markup to the platform as they have done
>> > already. I would greatly prefer us to pick one 'good' parser and stick
>> with
>> > it for all the manageability and consumability benefits that come with
>> that
>> > decision. Perhaps Im missing something here?
>> >
>> > -Louis
>> >
>> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fa...@google.com>
>> wrote:
>> >
>> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com> wrote:
>> > >
>> > > > [+google-caja-discuss]
>> > > >
>> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fa...@google.com>
>> > wrote:
>> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com>
>> wrote:
>> > > > >
>> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <
>> fargo@google.com>
>> > > > wrote:
>> > > > >> > This proposal effectively enables the renderer to become a
>> > > multi-pass
>> > > > >> > compiler for gadget content (essentially, arbitrary web
>> content).
>> > > Such
>> > > > a
>> > > > >> > compiler can provide several benefits: static optimization of
>> > gadget
>> > > > >> content
>> > > > >> > (auto-proxying of images, whitespace/comment removal,
>> > consolidation
>> > > of
>> > > > >> CSS
>> > > > >> > blocks), security benefits (caja et al), new functionality
>> > > (annotation
>> > > > of
>> > > > >> > content for stats, document analysis, container-specific
>> > features),
>> > > > etc.
>> > > > >> To
>> > > > >> > my knowledge no such infrastructure exists today (with the
>> > possible
>> > > > >> > exception of Caja itself, which I'd like to dovetail with this
>> > > work).
>> > > > >>
>> > > > >> Caja clearly provides a large chunk of the code you'd need for
>> this.
>> > > > >> I'd like to hear how we'd manage to avoid duplication between the
>> > two
>> > > > >> projects.
>> > > > >>
>> > > > >> A generalised framework for manipulating content sounds like a
>> great
>> > > > >> idea, but probably should not live in either of the two projects
>> > (Caja
>> > > > >> and Shindig) but rather should be shared by both of them, I
>> suspect.
>> > > > >
>> > > > >
>> > > > > I agree on both counts. As I mentioned, the piece of this idea
>> that I
>> > > > expect
>> > > > > to change the most is the parse tree, and Caja's .parser.html and
>> > > > > .parser.css packages contain much of what I've thrown in here as a
>> > > base.
>> > > > >
>> > > > > My key requirements are:
>> > > > > * Lightweight framework.
>> > > > > * Parser modularity, mostly for HTML parsers (to re-use the good
>> work
>> > > > done
>> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
>> > > > > * Automatic maintenance of DOM<->String conversion.
>> > > > > * Easy to manipulate structure.
>> > > >
>> > > > I'm not sure what the value of parser modularity is? If the
>> resulting
>> > > > tree is different, then that's a problem for people processing the
>> > > > tree. And if it is not, then why do we care?
>> > >
>> > >
>> > > IMO the value of parser modularity is that the lenient parsers native
>> to
>> > > browsers can be used in place of those that might not accept all
>> inputs.
>> > > One
>> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
>> > server
>> > > that runs parallel to Shindig and provides a "local RPC" service for
>> > > parsing
>> > > semi-structured HTML. The resulting tree for WebKit's parser might be
>> > > different than that for an XHTML parser, Gecko's parser, etc, but if
>> the
>> > > algorithm implemented atop it is rule-based rather than
>> strict-structure
>> > > based that should be fine, no?
>> > >
>> > >
>> > > >
>> > > >
>> > > > >
>> > > > > I'd love to see both projects share the same base syntax tree
>> > > > > representations. I considered .parser.html(.DomTree) and
>> .parser.css
>> > > for
>> > > > > these, but at the moment these appeared to be a little more tied
>> to
>> > > > Caja's
>> > > > > lexer/parser implementation than I preferred (though I admit
>> > > > > AbstractParseTreeNode contains most of what's needed).
>> > > > >
>> > > > > To be sure, I don't see this as an end-all-be-all transformation
>> > system
>> > > > in
>> > > > > any way. I'd just like to put *something* reasonable in place that
>> we
>> > > can
>> > > > > play with, provide some benefit, and enhance into a truly
>> > sophisticated
>> > > > > vision of document rewriting.
>> > > > >
>> > > > >
>> > > > >>
>> > > > >>
>> > > > >> >  c. Add Gadget.getParsedContent().
>> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
>> manipulate
>> > > > Gadget
>> > > > >> > Contents.
>> > > > >> >    ii. Mutable tree calls back to the Gadget object indicating
>> > when
>> > > > any
>> > > > >> > change is made, and emits an error if setContent() has been
>> called
>> > > in
>> > > > the
>> > > > >> > interim.
>> > > > >>
>> > > > >> In Caja we have been moving towards immutable trees...
>> > > > >
>> > > > >
>> > > > > Interested to hear more about this. The whole idea is for the
>> > gadget's
>> > > > tree
>> > > > > representation to be modifiable. Doing that with immutable trees
>> to
>> > me
>> > > > > suggests that a rewriter would have to create a completely new
>> tree
>> > and
>> > > > set
>> > > > > it as a representation of new content. That's convenient as far as
>> > the
>> > > > > Gadget's maintenance of String<->Tree representations is
>> concerned...
>> > > but
>> > > > > seems pretty heavyweight for many types of edits: in-situ
>> > modifications
>> > > > of
>> > > > > text, content reordering, etc. That's particularly so in a
>> > > > single-threaded
>> > > > > (viz rewriting) environment.
>> > > >
>> > > > Never having been entirely sold on the concept, I'll let those on
>> the
>> > > > Caja team who advocate immutability explain why.
>> > > >
>> > >
>> >
>>
>
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
I understand these concerns, and should be clear that I don't (despite my
personal interest in experimenting with the idea, agreed that we don't have
time for it at the moment) have any plans to introduce this sort of RPC
anywhere - certainly not in Shindig itself, as any such call would be hidden
behind an interface anyway.

Putting the RPC hypothetical aside, I still feel that there's value to
implementing HTML parsing in terms of an interface:
* Clearer separation of concerns/boundary between projects.
  - Corollary simplicity in testing.
* Clearer API for content manipulation (that doesn't require knowledge of
Caja).

I could be convinced otherwise, but at this point the code involved seems of
manageable size, so still worth doing. Thoughts?

John


On Mon, Aug 11, 2008 at 1:00 PM, Kevin Brown <et...@google.com> wrote:

> I agree with Louis -- that's just not practical. Every rewriting operation
> must work in real time. Caja's existing html parser is adequate for our
> needs, and we shouldn't go out of our way to tolerate every oddity of
> random
> web browsers (especially as it simply wouldn't work unless you farmed it
> out
> to *every* browser). Any new code needs to be grounded in practical,
> current
> needs, not theoretical options. We can always change code later if we find
> a
> real need for something like that. We have real work to do in the meantime.
>
> On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com> wrote:
>
> > John,
> >
> > From a practicality standpoint I'm a little nervous about this plan to
> make
> > RPCs calls out of a Java process to a native process to fetch a parse
> tree
> > for transformations that have to occur realtime. I don't think the
> > motivating factor here is to accept all inputs that browsers can. Gadget
> > developers will tailor their markup to the platform as they have done
> > already. I would greatly prefer us to pick one 'good' parser and stick
> with
> > it for all the manageability and consumability benefits that come with
> that
> > decision. Perhaps Im missing something here?
> >
> > -Louis
> >
> > On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fa...@google.com>
> wrote:
> >
> > > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com> wrote:
> > >
> > > > [+google-caja-discuss]
> > > >
> > > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com>
> wrote:
> > > > >
> > > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fargo@google.com
> >
> > > > wrote:
> > > > >> > This proposal effectively enables the renderer to become a
> > > multi-pass
> > > > >> > compiler for gadget content (essentially, arbitrary web
> content).
> > > Such
> > > > a
> > > > >> > compiler can provide several benefits: static optimization of
> > gadget
> > > > >> content
> > > > >> > (auto-proxying of images, whitespace/comment removal,
> > consolidation
> > > of
> > > > >> CSS
> > > > >> > blocks), security benefits (caja et al), new functionality
> > > (annotation
> > > > of
> > > > >> > content for stats, document analysis, container-specific
> > features),
> > > > etc.
> > > > >> To
> > > > >> > my knowledge no such infrastructure exists today (with the
> > possible
> > > > >> > exception of Caja itself, which I'd like to dovetail with this
> > > work).
> > > > >>
> > > > >> Caja clearly provides a large chunk of the code you'd need for
> this.
> > > > >> I'd like to hear how we'd manage to avoid duplication between the
> > two
> > > > >> projects.
> > > > >>
> > > > >> A generalised framework for manipulating content sounds like a
> great
> > > > >> idea, but probably should not live in either of the two projects
> > (Caja
> > > > >> and Shindig) but rather should be shared by both of them, I
> suspect.
> > > > >
> > > > >
> > > > > I agree on both counts. As I mentioned, the piece of this idea that
> I
> > > > expect
> > > > > to change the most is the parse tree, and Caja's .parser.html and
> > > > > .parser.css packages contain much of what I've thrown in here as a
> > > base.
> > > > >
> > > > > My key requirements are:
> > > > > * Lightweight framework.
> > > > > * Parser modularity, mostly for HTML parsers (to re-use the good
> work
> > > > done
> > > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> > > > > * Automatic maintenance of DOM<->String conversion.
> > > > > * Easy to manipulate structure.
> > > >
> > > > I'm not sure what the value of parser modularity is? If the resulting
> > > > tree is different, then that's a problem for people processing the
> > > > tree. And if it is not, then why do we care?
> > >
> > >
> > > IMO the value of parser modularity is that the lenient parsers native
> to
> > > browsers can be used in place of those that might not accept all
> inputs.
> > > One
> > > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
> > server
> > > that runs parallel to Shindig and provides a "local RPC" service for
> > > parsing
> > > semi-structured HTML. The resulting tree for WebKit's parser might be
> > > different than that for an XHTML parser, Gecko's parser, etc, but if
> the
> > > algorithm implemented atop it is rule-based rather than
> strict-structure
> > > based that should be fine, no?
> > >
> > >
> > > >
> > > >
> > > > >
> > > > > I'd love to see both projects share the same base syntax tree
> > > > > representations. I considered .parser.html(.DomTree) and
> .parser.css
> > > for
> > > > > these, but at the moment these appeared to be a little more tied to
> > > > Caja's
> > > > > lexer/parser implementation than I preferred (though I admit
> > > > > AbstractParseTreeNode contains most of what's needed).
> > > > >
> > > > > To be sure, I don't see this as an end-all-be-all transformation
> > system
> > > > in
> > > > > any way. I'd just like to put *something* reasonable in place that
> we
> > > can
> > > > > play with, provide some benefit, and enhance into a truly
> > sophisticated
> > > > > vision of document rewriting.
> > > > >
> > > > >
> > > > >>
> > > > >>
> > > > >> >  c. Add Gadget.getParsedContent().
> > > > >> >    i. Returns a mutable GadgetContentParseTree used to
> manipulate
> > > > Gadget
> > > > >> > Contents.
> > > > >> >    ii. Mutable tree calls back to the Gadget object indicating
> > when
> > > > any
> > > > >> > change is made, and emits an error if setContent() has been
> called
> > > in
> > > > the
> > > > >> > interim.
> > > > >>
> > > > >> In Caja we have been moving towards immutable trees...
> > > > >
> > > > >
> > > > > Interested to hear more about this. The whole idea is for the
> > gadget's
> > > > tree
> > > > > representation to be modifiable. Doing that with immutable trees to
> > me
> > > > > suggests that a rewriter would have to create a completely new tree
> > and
> > > > set
> > > > > it as a representation of new content. That's convenient as far as
> > the
> > > > > Gadget's maintenance of String<->Tree representations is
> concerned...
> > > but
> > > > > seems pretty heavyweight for many types of edits: in-situ
> > modifications
> > > > of
> > > > > text, content reordering, etc. That's particularly so in a
> > > > single-threaded
> > > > > (viz rewriting) environment.
> > > >
> > > > Never having been entirely sold on the concept, I'll let those on the
> > > > Caja team who advocate immutability explain why.
> > > >
> > >
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by Kevin Brown <et...@google.com>.
I agree with Louis -- that's just not practical. Every rewriting operation
must work in real time. Caja's existing html parser is adequate for our
needs, and we shouldn't go out of our way to tolerate every oddity of random
web browsers (especially as it simply wouldn't work unless you farmed it out
to *every* browser). Any new code needs to be grounded in practical, current
needs, not theoretical options. We can always change code later if we find a
real need for something like that. We have real work to do in the meantime.

On Mon, Aug 11, 2008 at 12:06 PM, Louis Ryan <lr...@google.com> wrote:

> John,
>
> From a practicality standpoint I'm a little nervous about this plan to make
> RPCs calls out of a Java process to a native process to fetch a parse tree
> for transformations that have to occur realtime. I don't think the
> motivating factor here is to accept all inputs that browsers can. Gadget
> developers will tailor their markup to the platform as they have done
> already. I would greatly prefer us to pick one 'good' parser and stick with
> it for all the manageability and consumability benefits that come with that
> decision. Perhaps Im missing something here?
>
> -Louis
>
> On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fa...@google.com> wrote:
>
> > On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com> wrote:
> >
> > > [+google-caja-discuss]
> > >
> > > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fa...@google.com>
> wrote:
> > > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com> wrote:
> > > >
> > > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fa...@google.com>
> > > wrote:
> > > >> > This proposal effectively enables the renderer to become a
> > multi-pass
> > > >> > compiler for gadget content (essentially, arbitrary web content).
> > Such
> > > a
> > > >> > compiler can provide several benefits: static optimization of
> gadget
> > > >> content
> > > >> > (auto-proxying of images, whitespace/comment removal,
> consolidation
> > of
> > > >> CSS
> > > >> > blocks), security benefits (caja et al), new functionality
> > (annotation
> > > of
> > > >> > content for stats, document analysis, container-specific
> features),
> > > etc.
> > > >> To
> > > >> > my knowledge no such infrastructure exists today (with the
> possible
> > > >> > exception of Caja itself, which I'd like to dovetail with this
> > work).
> > > >>
> > > >> Caja clearly provides a large chunk of the code you'd need for this.
> > > >> I'd like to hear how we'd manage to avoid duplication between the
> two
> > > >> projects.
> > > >>
> > > >> A generalised framework for manipulating content sounds like a great
> > > >> idea, but probably should not live in either of the two projects
> (Caja
> > > >> and Shindig) but rather should be shared by both of them, I suspect.
> > > >
> > > >
> > > > I agree on both counts. As I mentioned, the piece of this idea that I
> > > expect
> > > > to change the most is the parse tree, and Caja's .parser.html and
> > > > .parser.css packages contain much of what I've thrown in here as a
> > base.
> > > >
> > > > My key requirements are:
> > > > * Lightweight framework.
> > > > * Parser modularity, mostly for HTML parsers (to re-use the good work
> > > done
> > > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> > > > * Automatic maintenance of DOM<->String conversion.
> > > > * Easy to manipulate structure.
> > >
> > > I'm not sure what the value of parser modularity is? If the resulting
> > > tree is different, then that's a problem for people processing the
> > > tree. And if it is not, then why do we care?
> >
> >
> > IMO the value of parser modularity is that the lenient parsers native to
> > browsers can be used in place of those that might not accept all inputs.
> > One
> > could (and I'd like to) adapt WebKit or Gecko's parsing code into a
> server
> > that runs parallel to Shindig and provides a "local RPC" service for
> > parsing
> > semi-structured HTML. The resulting tree for WebKit's parser might be
> > different than that for an XHTML parser, Gecko's parser, etc, but if the
> > algorithm implemented atop it is rule-based rather than strict-structure
> > based that should be fine, no?
> >
> >
> > >
> > >
> > > >
> > > > I'd love to see both projects share the same base syntax tree
> > > > representations. I considered .parser.html(.DomTree) and .parser.css
> > for
> > > > these, but at the moment these appeared to be a little more tied to
> > > Caja's
> > > > lexer/parser implementation than I preferred (though I admit
> > > > AbstractParseTreeNode contains most of what's needed).
> > > >
> > > > To be sure, I don't see this as an end-all-be-all transformation
> system
> > > in
> > > > any way. I'd just like to put *something* reasonable in place that we
> > can
> > > > play with, provide some benefit, and enhance into a truly
> sophisticated
> > > > vision of document rewriting.
> > > >
> > > >
> > > >>
> > > >>
> > > >> >  c. Add Gadget.getParsedContent().
> > > >> >    i. Returns a mutable GadgetContentParseTree used to manipulate
> > > Gadget
> > > >> > Contents.
> > > >> >    ii. Mutable tree calls back to the Gadget object indicating
> when
> > > any
> > > >> > change is made, and emits an error if setContent() has been called
> > in
> > > the
> > > >> > interim.
> > > >>
> > > >> In Caja we have been moving towards immutable trees...
> > > >
> > > >
> > > > Interested to hear more about this. The whole idea is for the
> gadget's
> > > tree
> > > > representation to be modifiable. Doing that with immutable trees to
> me
> > > > suggests that a rewriter would have to create a completely new tree
> and
> > > set
> > > > it as a representation of new content. That's convenient as far as
> the
> > > > Gadget's maintenance of String<->Tree representations is concerned...
> > but
> > > > seems pretty heavyweight for many types of edits: in-situ
> modifications
> > > of
> > > > text, content reordering, etc. That's particularly so in a
> > > single-threaded
> > > > (viz rewriting) environment.
> > >
> > > Never having been entirely sold on the concept, I'll let those on the
> > > Caja team who advocate immutability explain why.
> > >
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by Louis Ryan <lr...@google.com>.
John,

>From a practicality standpoint I'm a little nervous about this plan to make
RPCs calls out of a Java process to a native process to fetch a parse tree
for transformations that have to occur realtime. I don't think the
motivating factor here is to accept all inputs that browsers can. Gadget
developers will tailor their markup to the platform as they have done
already. I would greatly prefer us to pick one 'good' parser and stick with
it for all the manageability and consumability benefits that come with that
decision. Perhaps Im missing something here?

-Louis

On Mon, Aug 11, 2008 at 11:59 AM, John Hjelmstad <fa...@google.com> wrote:

> On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com> wrote:
>
> > [+google-caja-discuss]
> >
> > On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fa...@google.com> wrote:
> > > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com> wrote:
> > >
> > >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fa...@google.com>
> > wrote:
> > >> > This proposal effectively enables the renderer to become a
> multi-pass
> > >> > compiler for gadget content (essentially, arbitrary web content).
> Such
> > a
> > >> > compiler can provide several benefits: static optimization of gadget
> > >> content
> > >> > (auto-proxying of images, whitespace/comment removal, consolidation
> of
> > >> CSS
> > >> > blocks), security benefits (caja et al), new functionality
> (annotation
> > of
> > >> > content for stats, document analysis, container-specific features),
> > etc.
> > >> To
> > >> > my knowledge no such infrastructure exists today (with the possible
> > >> > exception of Caja itself, which I'd like to dovetail with this
> work).
> > >>
> > >> Caja clearly provides a large chunk of the code you'd need for this.
> > >> I'd like to hear how we'd manage to avoid duplication between the two
> > >> projects.
> > >>
> > >> A generalised framework for manipulating content sounds like a great
> > >> idea, but probably should not live in either of the two projects (Caja
> > >> and Shindig) but rather should be shared by both of them, I suspect.
> > >
> > >
> > > I agree on both counts. As I mentioned, the piece of this idea that I
> > expect
> > > to change the most is the parse tree, and Caja's .parser.html and
> > > .parser.css packages contain much of what I've thrown in here as a
> base.
> > >
> > > My key requirements are:
> > > * Lightweight framework.
> > > * Parser modularity, mostly for HTML parsers (to re-use the good work
> > done
> > > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> > > * Automatic maintenance of DOM<->String conversion.
> > > * Easy to manipulate structure.
> >
> > I'm not sure what the value of parser modularity is? If the resulting
> > tree is different, then that's a problem for people processing the
> > tree. And if it is not, then why do we care?
>
>
> IMO the value of parser modularity is that the lenient parsers native to
> browsers can be used in place of those that might not accept all inputs.
> One
> could (and I'd like to) adapt WebKit or Gecko's parsing code into a server
> that runs parallel to Shindig and provides a "local RPC" service for
> parsing
> semi-structured HTML. The resulting tree for WebKit's parser might be
> different than that for an XHTML parser, Gecko's parser, etc, but if the
> algorithm implemented atop it is rule-based rather than strict-structure
> based that should be fine, no?
>
>
> >
> >
> > >
> > > I'd love to see both projects share the same base syntax tree
> > > representations. I considered .parser.html(.DomTree) and .parser.css
> for
> > > these, but at the moment these appeared to be a little more tied to
> > Caja's
> > > lexer/parser implementation than I preferred (though I admit
> > > AbstractParseTreeNode contains most of what's needed).
> > >
> > > To be sure, I don't see this as an end-all-be-all transformation system
> > in
> > > any way. I'd just like to put *something* reasonable in place that we
> can
> > > play with, provide some benefit, and enhance into a truly sophisticated
> > > vision of document rewriting.
> > >
> > >
> > >>
> > >>
> > >> >  c. Add Gadget.getParsedContent().
> > >> >    i. Returns a mutable GadgetContentParseTree used to manipulate
> > Gadget
> > >> > Contents.
> > >> >    ii. Mutable tree calls back to the Gadget object indicating when
> > any
> > >> > change is made, and emits an error if setContent() has been called
> in
> > the
> > >> > interim.
> > >>
> > >> In Caja we have been moving towards immutable trees...
> > >
> > >
> > > Interested to hear more about this. The whole idea is for the gadget's
> > tree
> > > representation to be modifiable. Doing that with immutable trees to me
> > > suggests that a rewriter would have to create a completely new tree and
> > set
> > > it as a representation of new content. That's convenient as far as the
> > > Gadget's maintenance of String<->Tree representations is concerned...
> but
> > > seems pretty heavyweight for many types of edits: in-situ modifications
> > of
> > > text, content reordering, etc. That's particularly so in a
> > single-threaded
> > > (viz rewriting) environment.
> >
> > Never having been entirely sold on the concept, I'll let those on the
> > Caja team who advocate immutability explain why.
> >
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
On Fri, Aug 8, 2008 at 6:10 AM, Ben Laurie <be...@google.com> wrote:

> [+google-caja-discuss]
>
> On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fa...@google.com> wrote:
> > On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com> wrote:
> >
> >> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fa...@google.com>
> wrote:
> >> > This proposal effectively enables the renderer to become a multi-pass
> >> > compiler for gadget content (essentially, arbitrary web content). Such
> a
> >> > compiler can provide several benefits: static optimization of gadget
> >> content
> >> > (auto-proxying of images, whitespace/comment removal, consolidation of
> >> CSS
> >> > blocks), security benefits (caja et al), new functionality (annotation
> of
> >> > content for stats, document analysis, container-specific features),
> etc.
> >> To
> >> > my knowledge no such infrastructure exists today (with the possible
> >> > exception of Caja itself, which I'd like to dovetail with this work).
> >>
> >> Caja clearly provides a large chunk of the code you'd need for this.
> >> I'd like to hear how we'd manage to avoid duplication between the two
> >> projects.
> >>
> >> A generalised framework for manipulating content sounds like a great
> >> idea, but probably should not live in either of the two projects (Caja
> >> and Shindig) but rather should be shared by both of them, I suspect.
> >
> >
> > I agree on both counts. As I mentioned, the piece of this idea that I
> expect
> > to change the most is the parse tree, and Caja's .parser.html and
> > .parser.css packages contain much of what I've thrown in here as a base.
> >
> > My key requirements are:
> > * Lightweight framework.
> > * Parser modularity, mostly for HTML parsers (to re-use the good work
> done
> > by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> > * Automatic maintenance of DOM<->String conversion.
> > * Easy to manipulate structure.
>
> I'm not sure what the value of parser modularity is? If the resulting
> tree is different, then that's a problem for people processing the
> tree. And if it is not, then why do we care?


IMO the value of parser modularity is that the lenient parsers native to
browsers can be used in place of those that might not accept all inputs. One
could (and I'd like to) adapt WebKit or Gecko's parsing code into a server
that runs parallel to Shindig and provides a "local RPC" service for parsing
semi-structured HTML. The resulting tree for WebKit's parser might be
different than that for an XHTML parser, Gecko's parser, etc, but if the
algorithm implemented atop it is rule-based rather than strict-structure
based that should be fine, no?


>
>
> >
> > I'd love to see both projects share the same base syntax tree
> > representations. I considered .parser.html(.DomTree) and .parser.css for
> > these, but at the moment these appeared to be a little more tied to
> Caja's
> > lexer/parser implementation than I preferred (though I admit
> > AbstractParseTreeNode contains most of what's needed).
> >
> > To be sure, I don't see this as an end-all-be-all transformation system
> in
> > any way. I'd just like to put *something* reasonable in place that we can
> > play with, provide some benefit, and enhance into a truly sophisticated
> > vision of document rewriting.
> >
> >
> >>
> >>
> >> >  c. Add Gadget.getParsedContent().
> >> >    i. Returns a mutable GadgetContentParseTree used to manipulate
> Gadget
> >> > Contents.
> >> >    ii. Mutable tree calls back to the Gadget object indicating when
> any
> >> > change is made, and emits an error if setContent() has been called in
> the
> >> > interim.
> >>
> >> In Caja we have been moving towards immutable trees...
> >
> >
> > Interested to hear more about this. The whole idea is for the gadget's
> tree
> > representation to be modifiable. Doing that with immutable trees to me
> > suggests that a rewriter would have to create a completely new tree and
> set
> > it as a representation of new content. That's convenient as far as the
> > Gadget's maintenance of String<->Tree representations is concerned... but
> > seems pretty heavyweight for many types of edits: in-situ modifications
> of
> > text, content reordering, etc. That's particularly so in a
> single-threaded
> > (viz rewriting) environment.
>
> Never having been entirely sold on the concept, I'll let those on the
> Caja team who advocate immutability explain why.
>

Re: [Caja] Re: Content Rewriter Modularization: Design/Change

Posted by ih...@gmail.com.
On Fri, Aug 8, 2008 at 7:58 AM, Mark S. Miller <er...@google.com> wrote:
> Much easier understandability & enforcement of invariants.

And, that being the case, it is possible to alias an immutable node
from several trees without fear of inconsistent mutations later on.

Ihab

-- 
Ihab A.B. Awad, Palo Alto, CA

Re: Content Rewriter Modularization: Design/Change

Posted by Ben Laurie <be...@google.com>.
[+google-caja-discuss]

On Thu, Aug 7, 2008 at 9:27 PM, John Hjelmstad <fa...@google.com> wrote:
> On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com> wrote:
>
>> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fa...@google.com> wrote:
>> > This proposal effectively enables the renderer to become a multi-pass
>> > compiler for gadget content (essentially, arbitrary web content). Such a
>> > compiler can provide several benefits: static optimization of gadget
>> content
>> > (auto-proxying of images, whitespace/comment removal, consolidation of
>> CSS
>> > blocks), security benefits (caja et al), new functionality (annotation of
>> > content for stats, document analysis, container-specific features), etc.
>> To
>> > my knowledge no such infrastructure exists today (with the possible
>> > exception of Caja itself, which I'd like to dovetail with this work).
>>
>> Caja clearly provides a large chunk of the code you'd need for this.
>> I'd like to hear how we'd manage to avoid duplication between the two
>> projects.
>>
>> A generalised framework for manipulating content sounds like a great
>> idea, but probably should not live in either of the two projects (Caja
>> and Shindig) but rather should be shared by both of them, I suspect.
>
>
> I agree on both counts. As I mentioned, the piece of this idea that I expect
> to change the most is the parse tree, and Caja's .parser.html and
> .parser.css packages contain much of what I've thrown in here as a base.
>
> My key requirements are:
> * Lightweight framework.
> * Parser modularity, mostly for HTML parsers (to re-use the good work done
> by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> * Automatic maintenance of DOM<->String conversion.
> * Easy to manipulate structure.

I'm not sure what the value of parser modularity is? If the resulting
tree is different, then that's a problem for people processing the
tree. And if it is not, then why do we care?

>
> I'd love to see both projects share the same base syntax tree
> representations. I considered .parser.html(.DomTree) and .parser.css for
> these, but at the moment these appeared to be a little more tied to Caja's
> lexer/parser implementation than I preferred (though I admit
> AbstractParseTreeNode contains most of what's needed).
>
> To be sure, I don't see this as an end-all-be-all transformation system in
> any way. I'd just like to put *something* reasonable in place that we can
> play with, provide some benefit, and enhance into a truly sophisticated
> vision of document rewriting.
>
>
>>
>>
>> >  c. Add Gadget.getParsedContent().
>> >    i. Returns a mutable GadgetContentParseTree used to manipulate Gadget
>> > Contents.
>> >    ii. Mutable tree calls back to the Gadget object indicating when any
>> > change is made, and emits an error if setContent() has been called in the
>> > interim.
>>
>> In Caja we have been moving towards immutable trees...
>
>
> Interested to hear more about this. The whole idea is for the gadget's tree
> representation to be modifiable. Doing that with immutable trees to me
> suggests that a rewriter would have to create a completely new tree and set
> it as a representation of new content. That's convenient as far as the
> Gadget's maintenance of String<->Tree representations is concerned... but
> seems pretty heavyweight for many types of edits: in-situ modifications of
> text, content reordering, etc. That's particularly so in a single-threaded
> (viz rewriting) environment.

Never having been entirely sold on the concept, I'll let those on the
Caja team who advocate immutability explain why.

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
FYI I've created JIRA issues for each of the steps involved in this change:
https://issues.apache.org/jira/browse/SHINDIG-499 (with various attached
sub-tasks).

I'll try to keep these up to date if/when the overall structure of this
mechanism changes per conversation. In the meantime, the sub-tasks will be
useful for attaching CLs, code review commentary, and the like.

--John

On Thu, Aug 7, 2008 at 1:27 PM, John Hjelmstad <fa...@google.com> wrote:

> On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com> wrote:
>
>> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fa...@google.com> wrote:
>> > This proposal effectively enables the renderer to become a multi-pass
>> > compiler for gadget content (essentially, arbitrary web content). Such a
>> > compiler can provide several benefits: static optimization of gadget
>> content
>> > (auto-proxying of images, whitespace/comment removal, consolidation of
>> CSS
>> > blocks), security benefits (caja et al), new functionality (annotation
>> of
>> > content for stats, document analysis, container-specific features), etc.
>> To
>> > my knowledge no such infrastructure exists today (with the possible
>> > exception of Caja itself, which I'd like to dovetail with this work).
>>
>> Caja clearly provides a large chunk of the code you'd need for this.
>> I'd like to hear how we'd manage to avoid duplication between the two
>> projects.
>>
>> A generalised framework for manipulating content sounds like a great
>> idea, but probably should not live in either of the two projects (Caja
>> and Shindig) but rather should be shared by both of them, I suspect.
>
>
> I agree on both counts. As I mentioned, the piece of this idea that I
> expect to change the most is the parse tree, and Caja's .parser.html and
> .parser.css packages contain much of what I've thrown in here as a base.
>
> My key requirements are:
> * Lightweight framework.
> * Parser modularity, mostly for HTML parsers (to re-use the good work done
> by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
> * Automatic maintenance of DOM<->String conversion.
> * Easy to manipulate structure.
>
> I'd love to see both projects share the same base syntax tree
> representations. I considered .parser.html(.DomTree) and .parser.css for
> these, but at the moment these appeared to be a little more tied to Caja's
> lexer/parser implementation than I preferred (though I admit
> AbstractParseTreeNode contains most of what's needed).
>
> To be sure, I don't see this as an end-all-be-all transformation system in
> any way. I'd just like to put *something* reasonable in place that we can
> play with, provide some benefit, and enhance into a truly sophisticated
> vision of document rewriting.
>
>
>>
>>
>> >  c. Add Gadget.getParsedContent().
>> >    i. Returns a mutable GadgetContentParseTree used to manipulate Gadget
>> > Contents.
>> >    ii. Mutable tree calls back to the Gadget object indicating when any
>> > change is made, and emits an error if setContent() has been called in
>> the
>> > interim.
>>
>> In Caja we have been moving towards immutable trees...
>
>
> Interested to hear more about this. The whole idea is for the gadget's tree
> representation to be modifiable. Doing that with immutable trees to me
> suggests that a rewriter would have to create a completely new tree and set
> it as a representation of new content. That's convenient as far as the
> Gadget's maintenance of String<->Tree representations is concerned... but
> seems pretty heavyweight for many types of edits: in-situ modifications of
> text, content reordering, etc. That's particularly so in a single-threaded
> (viz rewriting) environment.
>
> --John
>

Re: Content Rewriter Modularization: Design/Change

Posted by John Hjelmstad <fa...@google.com>.
On Thu, Aug 7, 2008 at 3:20 AM, Ben Laurie <be...@google.com> wrote:

> On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fa...@google.com> wrote:
> > This proposal effectively enables the renderer to become a multi-pass
> > compiler for gadget content (essentially, arbitrary web content). Such a
> > compiler can provide several benefits: static optimization of gadget
> content
> > (auto-proxying of images, whitespace/comment removal, consolidation of
> CSS
> > blocks), security benefits (caja et al), new functionality (annotation of
> > content for stats, document analysis, container-specific features), etc.
> To
> > my knowledge no such infrastructure exists today (with the possible
> > exception of Caja itself, which I'd like to dovetail with this work).
>
> Caja clearly provides a large chunk of the code you'd need for this.
> I'd like to hear how we'd manage to avoid duplication between the two
> projects.
>
> A generalised framework for manipulating content sounds like a great
> idea, but probably should not live in either of the two projects (Caja
> and Shindig) but rather should be shared by both of them, I suspect.


I agree on both counts. As I mentioned, the piece of this idea that I expect
to change the most is the parse tree, and Caja's .parser.html and
.parser.css packages contain much of what I've thrown in here as a base.

My key requirements are:
* Lightweight framework.
* Parser modularity, mostly for HTML parsers (to re-use the good work done
by WebKit or Gecko.. CSS/JS can come direct from Caja I'd bet)
* Automatic maintenance of DOM<->String conversion.
* Easy to manipulate structure.

I'd love to see both projects share the same base syntax tree
representations. I considered .parser.html(.DomTree) and .parser.css for
these, but at the moment these appeared to be a little more tied to Caja's
lexer/parser implementation than I preferred (though I admit
AbstractParseTreeNode contains most of what's needed).

To be sure, I don't see this as an end-all-be-all transformation system in
any way. I'd just like to put *something* reasonable in place that we can
play with, provide some benefit, and enhance into a truly sophisticated
vision of document rewriting.


>
>
> >  c. Add Gadget.getParsedContent().
> >    i. Returns a mutable GadgetContentParseTree used to manipulate Gadget
> > Contents.
> >    ii. Mutable tree calls back to the Gadget object indicating when any
> > change is made, and emits an error if setContent() has been called in the
> > interim.
>
> In Caja we have been moving towards immutable trees...


Interested to hear more about this. The whole idea is for the gadget's tree
representation to be modifiable. Doing that with immutable trees to me
suggests that a rewriter would have to create a completely new tree and set
it as a representation of new content. That's convenient as far as the
Gadget's maintenance of String<->Tree representations is concerned... but
seems pretty heavyweight for many types of edits: in-situ modifications of
text, content reordering, etc. That's particularly so in a single-threaded
(viz rewriting) environment.

--John

Re: Content Rewriter Modularization: Design/Change

Posted by Ben Laurie <be...@google.com>.
On Wed, Aug 6, 2008 at 11:34 PM, John Hjelmstad <fa...@google.com> wrote:
> This proposal effectively enables the renderer to become a multi-pass
> compiler for gadget content (essentially, arbitrary web content). Such a
> compiler can provide several benefits: static optimization of gadget content
> (auto-proxying of images, whitespace/comment removal, consolidation of CSS
> blocks), security benefits (caja et al), new functionality (annotation of
> content for stats, document analysis, container-specific features), etc. To
> my knowledge no such infrastructure exists today (with the possible
> exception of Caja itself, which I'd like to dovetail with this work).

Caja clearly provides a large chunk of the code you'd need for this.
I'd like to hear how we'd manage to avoid duplication between the two
projects.

A generalised framework for manipulating content sounds like a great
idea, but probably should not live in either of the two projects (Caja
and Shindig) but rather should be shared by both of them, I suspect.

>  c. Add Gadget.getParsedContent().
>    i. Returns a mutable GadgetContentParseTree used to manipulate Gadget
> Contents.
>    ii. Mutable tree calls back to the Gadget object indicating when any
> change is made, and emits an error if setContent() has been called in the
> interim.

In Caja we have been moving towards immutable trees...