You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Berin Loritsch <bl...@apache.org> on 2003/04/03 15:09:22 UTC

Compiling XML, and its replacement (was Re: [RT] the quest for the perfect template language)

Bertrand Delacretaz wrote:
> Le Mercredi, 2 avr 2003, à 20:25 Europe/Zurich, Stefano Mazzocchi a écrit :
> 
> <snips cause="commenting on some specifics only"/>
> 
>> ...IMHO, the template language which is closer to the optimum is XSLT 
>> but only with one change:
>>
>>  FORGET THE XML SYNTAX!
> 

Imagine using YAML (with a YAML to XML converter), or anything else
you may want.

BTW, My Binary XML project (http://d-haven.org/bxml) has the ability
to compile an XML document into a Java Class (I know this is nothing
extraordinary as XSP has been doing it for years).  However what is
very different from XSP is the following:

1) No Java file is ever written (it uses BCEL)
2) No Class file *needs* to be written (although it is an option)
3) The original document name and the line numbers are part of the
    generated source code.
    * That means the Stack trace has debug information you can
      use.

As long as your solution can be translated into an InputSource that
interprets your solution as a SAX stream, the BCEL compiler will still
work with it.

There is still alot to do with the project, and on the TODO list is:
* Enable callbacks
   - I am still struggling with how to recognize them in the source
     document.
   - Trying to decide if I want to limit its support as an XInclude
     only approach, or more generic.
* Finish the XMLRepository
   - Uses weak references, so garbage collection is friendly
   - Need to add support for monitoring the source file as an option;
     that way the XMLRepository can update the class file behind the
     scenes.
* Make the compiler extendable so that things like XSP can be included
   seamlessly.

There are a few good points about Binary XML as it stands right now:
Every XML document is defined as an org.xml.sax.XMLReader so it will
incorporate itself seamlessly with any XML enabled application (Cocoon
included).  It is not dependant on any outside library with the one
obvious exception of BCEL.

Re: Compiling XML, and its replacement

Posted by Berin Loritsch <bl...@apache.org>.

Stefano Mazzocchi wrote:
> Berin Loritsch wrote:
> 
>> BTW, My Binary XML project (http://d-haven.org/bxml) has the ability
>> to compile an XML document into a Java Class (I know this is nothing
>> extraordinary as XSP has been doing it for years).  However what is
>> very different from XSP is the following:
>>
>> 1) No Java file is ever written (it uses BCEL)
>> 2) No Class file *needs* to be written (although it is an option)
>> 3) The original document name and the line numbers are part of the
>>    generated source code.
>>    * That means the Stack trace has debug information you can
>>      use.
> 
> 
> This has nothing to do with my rants about how stupid the xslt syntax 
> is, but I would be interested to see actual performances between your 
> approach to compiled xml and mine (the one that currently cocoon uses 
> for the cache).

I understand.  However, if the language generates SAX events, it works
pretty well to compile the non XML representation.

> This is in light of the compiled/interpreted question. I think it would 
> be kind of cool to have a serious benchmark between the two approaches, 
> because this would be very helpful on the recent discussion on XSP.

It would be.  However the real performance tradeoff that I am seeking
to gain is in the CallBack facility which has only been partially
defined and implemented.

> I mean, the fact that you aren't generating the source code, well, 
> that's nice from an implementation perspective (I thought about making a 
> BCEL java assembler for XSP but then stopped because it was simply too 
> complex and not really needed now that we have the superb eclipse 
> compiler) but it's nothing different from XSP.

The major difference between what I have and XSP is that we actually
have line numbers that correspond to the original XML document :)
That makes debugging 100% easier.

> So, you are, in fact, unrolling a big loop of sax events, while my xml 
> compilation approach works at serializing the sax events in a 
> easy-to-be-parsed-later binary format.

For now.  Yes.  Eventually everything gets called the same way--so it
would be interesting to see authoritatively if the HotSpot would have
any reason to perform some optimizations.

> I would expect my approach to be faster than yours on modern virtual 
> machines because hotspot can optimize the tight binary SAX reparsing 
> code, while your approach will never reveal any hotspots.

There's only one way to find out.  To keep things as similar as
possible, both sources need to be compiled and in memory.  Then we
need to read the documents.

The interesting benchmarks are server load and scalability.  Just
because something is technically faster doesn't mean it scales well.

> I'll also be interested to see how different the performance gets on 
> hotspot server/client and how much it changes with several subsequent runs

Hmm. Anyone know of a XML parser benchmark out there?

> Too bad I don't have any time for this right now... but if you want to 
> do it, it would be *extremely* useful not only for you but also for the 
> future of compiled/interpreted stuff in java on the server side.

Problem is time.  It took me a year to find time for Binary XML, and
I only got it this far.  Who knows, I will probably get to it
eventually, but if someone has a benchmark that they already know about,
they could do the benchmark too.

Re: [RT:Long] Initial Results and comments (was Re: Compiling XML, and its replacement)

Posted by Berin Loritsch <bl...@apache.org>.

Stefano Mazzocchi wrote:
>> Considering we have a 5:1 size to time scaling ratio, it would be
>> interesting to see if it carries out to a much larger XML file--
>> if only I had one.  If scalability was linear, then a 1,580,000
>> byte file should only take .23 ms to parse.
> 
> 
> Are you aware of the fact that any java method cannot be greater than 
> 64Kb of bytecode? And I'm also sure there is a limit on how many methods 
> a java class can have.
> 
> So, at the very end, you have a top-size limit on how big your 
> compiled-in-memory object can be.

Absolutely.  However this is a stepping stone.  I haven't begun to look
at compiler optimizations yet.  I am trying to get the interface the way
I like it, and the thing to merely function (which I did last night!)

>> In this instance though, I believe that we are dealing with more than
>> just "unrolled loops"  We are dealing with file reading overhead, and
>> interpretation overhead.  Your *compressed* XML addresses the second
>> issue, but in the end I believe it will behave very similarly to my
>> solution.
> 
> 
> Good point. But you are ignoring the fact that all modern operating 
> systems have cached file systems. And, if this was not the case, it 
> would be fairly trivial to implement one underneat a source resolver.

:) And yet certain operations touch the file and incorporate a call to
blocking filesystem code.  Seriously though, once a file is read into
memory, it's all about the parsing and processing.  With my solution
there is nothing to process--it's all been done.

>> Also keep in mind that improvements in the compiler design (far future)
>> can allow for repetitive constructs to be moved into a separate method.
>> For instance, the following XML is highly repetitive:

<snip/>

>>
>> Still allowing for some level of hotspot action.
> 
> 
> I see, also to overcome the 64kb method limitation.

:)  Yep.

>> However, I believe the true power of Binary XML will be with its
>> support for XMLCallBacks and (in the mid term future) decorators.
> 
> 
> Can you elaborate more on this?

I just got this "working" in the sense that it is operational, not
in the sense that it is elegant, or where it needs to be.  Presently
I am using Processing Instructions to represent when a callback is
required.  What I want to do is allow actual XMLFragments be converted
to callbacks in the compiler.  That would allow direct support for
conventions such XInclude or other standards.  Unfortunately, it
proved too difficult for the short term.

For now, what I have working is this:

<test>
   <element withAttribute="true"/>
   <document>Add some text here</document>

   <?include-xml ../../build.xml?>
</test>

When this document is compiled you get the standard SAX events that
you expect, but the processing instruction is compiled as an
XMLCallBack.  This proved to be the easiest thing from an implementation
perspective--but I am open to alternatives.

The beauty of this approach is that CallBacks are much easier to
develop than something that works with SAX events on the fly.  I
have to add some more helper classes to make that statement true,
and your compressed XML would most likely be a key element of that.

However the concept is simple.  A document can be boiled down to
the parts that *never* change, and the elements that do change
are represented by easily developed code.  I'm thinking like a
developer, not a script kiddie.

The consequence of the design decisions is that we can never have
anything like [AJX]SP abuses like the following:

<xsp:logic>
   for (int i = 0; i < 10; i++)
   {
</xsp:logic>

   <element/>

<xsp:logic>
   }
</xsp:logic>

That is valid (but *very* poorly written XSP).  The XML can be boiled
down to things like this:

<html>
   <head><title><?doc-title theme="coco"?></title></head>
   <body>
     <table>
       <tr><td><img src="logo.png"/></td>
           <td><?doc-title theme="coco"?></td></tr>
       <tr rowspan="2"/><td><?site-tabs theme="coco"?></td></tr>
     </table>
     <table>
       <tr>
         <td><?site-menu theme="coco"?></td>
         <td><?doc-content theme="coco"?></td>
         <td><?site-tools theme="coco"></td>
       </tr>
     </table>
   </body>
</html>

Notice the embedded processing instructions?  They would be set
to call certain callback methods which could be used to provide
a common look and feel to all the docs.

The processing instruction would have the callback name (which
will be accessible via the JAR Services mechanism), and the
proper theme is preserved throughout the document.

It also means that certain things like the menu, tabs, content,
and tools can have the same logic but apply the specified
decorator (could be XSLTC, or could be something else).

The pipeline for this would be very simple:

<site:match pattern="*.html">
   <site:act name="choose-doc" source="{1}"/>
   <site:generate type="bxml" source="coco.xml"/>
   <site:serialize/>
</site:match>

It's been a while so I appologize if my sitemap logic is off.

But notice that there is no need for a transformer?

There is alot of work to make my vision happen, but it should be
much more natural for developers to work with than trying to write
a transformer to intercept certain logic.  The code that the developer
would have to write would be much more compact and readable as well.

To make this a reality, the XMLRepository needs to be modified to
allow temporary storage of XMLFragments, and the compiler needs to
be altered to allow for different compilation strategies (i.e.
optimize for fragments, etc.)

Anyway, hopefully you will see some advantages in the approach.

>> The decorator concept will allow us to set a series of SAX events
>> for a common object.  This will render the XSLT stage a moot point
>> as we can apply pre-styled decorators to the same set of objects.
> 
> 
> Isn't this what a translet (an xsltc-compiled XSLT stylesheet) was 
> supposed to be?

You would know better.  However what I was thinking of is something more
along the lines of this:

XMLDecorator
{
     transform( Object o, ContentHandler handler );
}

In a directory renderer callback I might have code like this:

DirectoryCallBack
{
     // exclude all the init code

     XMLFragment process( Properties props )
     {
         File dir = new File(props.getProperty("dir");
         CompressedFragment xml = new CompressedFragment();

         m_fileDecorator(dir, xml.contentHandler());

         return xml;
     }
}

The callback code is pretty simple.  I can easily create the callback,
and delegate to a delegator the actual representation of the object.
The object can be represented as XHTML directly, and it would be
embedded in the proper location.

> Anyway, I'm happy to see new approaches to xml generation being researched.

I had the concept a long time ago, and I think it could fit quite well
in the Cocoon concept.  My goal is to replace XSP with a more programmer
friendly alternative--not to make Cocoon absolete.

Re: [RT:Long] Initial Results and comments (was Re: Compiling XML, and its replacement)

Posted by Stefano Mazzocchi <st...@apache.org>.

Berin Loritsch wrote:
> Stefano Mazzocchi wrote:
> 
>> I'll also be interested to see how different the performance gets on 
>> hotspot server/client and how much it changes with several subsequent 
>> runs.
> 
> 
> Well, with HotSpot client and a 15.4 KB (15,798 bytes) test document
> (my build.xml file), I got the following results:
> 
>      [junit] Parsed 873557 times in 10005ms
>      [junit] Average of 0.011453173633775472ms per parse
> 
> Compare that to a much smaller 170 bytes (170 bytes) test document:
> 
>      [junit] Parsed 16064210 times in 10004ms
>      [junit] Average of 6.227508231030347E-4ms per parse
> 
> 
> The two documents are at completely different complexities,
> but the ratio of results is:
> 
>      170b      .000623ms
>   --------- = -----------
>    15,800b      .0115ms
> 
> That's a size increase of 92.9 times
> 
> compared to a time increase of 18.5 times
> 
> 
> Times were comparable to Server Hotspot for this solution--although it
> was only run for 10 seconds.
> 
> Considering we have a 5:1 size to time scaling ratio, it would be
> interesting to see if it carries out to a much larger XML file--
> if only I had one.  If scalability was linear, then a 1,580,000
> byte file should only take .23 ms to parse.

Are you aware of the fact that any java method cannot be greater than 
64Kb of bytecode? And I'm also sure there is a limit on how many methods 
a java class can have.

So, at the very end, you have a top-size limit on how big your 
compiled-in-memory object can be.

> I also tried the test with the -Xint (interpreted mode only) option
> set, and there was no appreciable difference.  As best I can tell,
> this is largely because the code is already as optimized as it
> possibly can be.  This fits in line with your observations of unrolled
> "loops".

Yep.

> In this instance though, I believe that we are dealing with more than
> just "unrolled loops"  We are dealing with file reading overhead, and
> interpretation overhead.  Your *compressed* XML addresses the second
> issue, but in the end I believe it will behave very similarly to my
> solution.

Good point. But you are ignoring the fact that all modern operating 
systems have cached file systems. And, if this was not the case, it 
would be fairly trivial to implement one underneat a source resolver.

> Also keep in mind that improvements in the compiler design (far future)
> can allow for repetitive constructs to be moved into a separate method.
> For instance, the following XML is highly repetitive:
> 
> <demo>
>    <entry name="foo">
>      bar
>    </entry>
>    <entry name="foo">
>      bar
>    </entry>
>    <entry name="foo">
>      bar
>    </entry>
>    <entry name="foo">
>      bar
>    </entry>
>    <entry name="foo">
>      bar
>    </entry>
>    <entry name="foo">
>      bar
>    </entry>
> </demo>
> 
> As documents become very large it becomes critical to do something
> other than my very simplistic compilation.  However there are plenty
> of opportunities to optimize the XML compiler.  For example, we could
> easily reduce the above XML to something along the lines of:
> 
> startElement("demo")
> 
> for (int i = 0; i < 6; i++)
> {
>      outputEntry()
> }
> 
> endElement("demo")
> 
> Even if the attribute values and element values were different,
> but the same structure remained, the compiler would be able
> to (theorhetically) reduce it to a method with parameters:
> 
> startElement("demo")
> 
> outputEntry("foo", "bar");
> outputEntry("ego", "centric");
> outputEntry("gas", "bag");
> outputEntry("I", "am");
> outputEntry("just", "kidding");
> outputEntry("my", "peeps");
> 
> endElement("demo")
> 
> Still allowing for some level of hotspot action.

I see, also to overcome the 64kb method limitation.

> However, I believe the true power of Binary XML will be with its
> support for XMLCallBacks and (in the mid term future) decorators.

Can you elaborate more on this?

> The decorator concept will allow us to set a series of SAX events
> for a common object.  This will render the XSLT stage a moot point
> as we can apply pre-styled decorators to the same set of objects.

Isn't this what a translet (an xsltc-compiled XSLT stylesheet) was 
supposed to be?

Anyway, I'm happy to see new approaches to xml generation being researched.

Stefano.

[RT:Long] Initial Results and comments (was Re: Compiling XML, and its replacement)

Posted by Berin Loritsch <bl...@apache.org>.

Stefano Mazzocchi wrote:
> I'll also be interested to see how different the performance gets on 
> hotspot server/client and how much it changes with several subsequent runs.

Well, with HotSpot client and a 15.4 KB (15,798 bytes) test document
(my build.xml file), I got the following results:

      [junit] Parsed 873557 times in 10005ms
      [junit] Average of 0.011453173633775472ms per parse

Compare that to a much smaller 170 bytes (170 bytes) test document:

      [junit] Parsed 16064210 times in 10004ms
      [junit] Average of 6.227508231030347E-4ms per parse

The two documents are at completely different complexities,
but the ratio of results is:

      170b      .000623ms
   --------- = -----------
    15,800b      .0115ms

That's a size increase of 92.9 times

compared to a time increase of 18.5 times

Times were comparable to Server Hotspot for this solution--although it
was only run for 10 seconds.

Considering we have a 5:1 size to time scaling ratio, it would be
interesting to see if it carries out to a much larger XML file--
if only I had one.  If scalability was linear, then a 1,580,000
byte file should only take .23 ms to parse.

I also tried the test with the -Xint (interpreted mode only) option
set, and there was no appreciable difference.  As best I can tell,
this is largely because the code is already as optimized as it
possibly can be.  This fits in line with your observations of unrolled
"loops".

In this instance though, I believe that we are dealing with more than
just "unrolled loops"  We are dealing with file reading overhead, and
interpretation overhead.  Your *compressed* XML addresses the second
issue, but in the end I believe it will behave very similarly to my
solution.

Also keep in mind that improvements in the compiler design (far future)
can allow for repetitive constructs to be moved into a separate method.
For instance, the following XML is highly repetitive:

<demo>
    <entry name="foo">
      bar
    </entry>
    <entry name="foo">
      bar
    </entry>
    <entry name="foo">
      bar
    </entry>
    <entry name="foo">
      bar
    </entry>
    <entry name="foo">
      bar
    </entry>
    <entry name="foo">
      bar
    </entry>
</demo>

As documents become very large it becomes critical to do something
other than my very simplistic compilation.  However there are plenty
of opportunities to optimize the XML compiler.  For example, we could
easily reduce the above XML to something along the lines of:

startElement("demo")

for (int i = 0; i < 6; i++)
{
      outputEntry()
}

endElement("demo")

Even if the attribute values and element values were different,
but the same structure remained, the compiler would be able
to (theorhetically) reduce it to a method with parameters:

startElement("demo")

outputEntry("foo", "bar");
outputEntry("ego", "centric");
outputEntry("gas", "bag");
outputEntry("I", "am");
outputEntry("just", "kidding");
outputEntry("my", "peeps");

endElement("demo")

Still allowing for some level of hotspot action.

However, I believe the true power of Binary XML will be with its
support for XMLCallBacks and (in the mid term future) decorators.
The decorator concept will allow us to set a series of SAX events
for a common object.  This will render the XSLT stage a moot point
as we can apply pre-styled decorators to the same set of objects.
These will call for some alterations of the compiler as it stands
now, and will be required before a 1.0 relese.

I am trying to keep the library lean and mean.

Re: Compiling XML, and its replacement

Posted by Stefano Mazzocchi <st...@apache.org>.

Berin Loritsch wrote:

> BTW, My Binary XML project (http://d-haven.org/bxml) has the ability
> to compile an XML document into a Java Class (I know this is nothing
> extraordinary as XSP has been doing it for years).  However what is
> very different from XSP is the following:
> 
> 1) No Java file is ever written (it uses BCEL)
> 2) No Class file *needs* to be written (although it is an option)
> 3) The original document name and the line numbers are part of the
>    generated source code.
>    * That means the Stack trace has debug information you can
>      use.

This has nothing to do with my rants about how stupid the xslt syntax 
is, but I would be interested to see actual performances between your 
approach to compiled xml and mine (the one that currently cocoon uses 
for the cache).

This is in light of the compiled/interpreted question. I think it would 
be kind of cool to have a serious benchmark between the two approaches, 
because this would be very helpful on the recent discussion on XSP.

I mean, the fact that you aren't generating the source code, well, 
that's nice from an implementation perspective (I thought about making a 
BCEL java assembler for XSP but then stopped because it was simply too 
complex and not really needed now that we have the superb eclipse 
compiler) but it's nothing different from XSP.

So, you are, in fact, unrolling a big loop of sax events, while my xml 
compilation approach works at serializing the sax events in a 
easy-to-be-parsed-later binary format.

I would expect my approach to be faster than yours on modern virtual 
machines because hotspot can optimize the tight binary SAX reparsing 
code, while your approach will never reveal any hotspots.

I'll also be interested to see how different the performance gets on 
hotspot server/client and how much it changes with several subsequent runs.

Too bad I don't have any time for this right now... but if you want to 
do it, it would be *extremely* useful not only for you but also for the 
future of compiled/interpreted stuff in java on the server side.

Stefano.