You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Bruno Dumon <br...@outerthought.org> on 2003/02/22 16:43:09 UTC

XML/HTML serializers buffering everything and using threads.

Just found out about some suboptimal serializer things in the current
Cocoon CVS, and since I didn't find anything relevant in the mail
archives, I thought I'd just come in and complain.

There are actually 2 things:

* the current XML/HTML serializers are performing identity XSL
transforms (with a real stylesheet!), instead of just serializing. This
means that they require the building of a complete DTM-tree, effectively
creating a buffer of all SAX-events entering the serializer. From the
CVS logs, I see this has been introduced to work around a bug occuring
in (of all things) the SourceWritingTransformer, though I did not find
what exactly.

* by default, cocoon is configured with "incremental-processing" for
XSL's set to true (this only applies when using xalan, not when using
xsltc). Since Xalan manages this setting in a static variable, this
property is shared for all XSL transformers in Cocoon. Since the
serializer is also performing an XSL transform, it also applies to the
serializer. Incremental processing is achieved by performing the
transform on a separate thread. This means that a simple pipeline
containing a (xalan) XSLT transform and a HTML serializer will already
use 3 threads (the original request-dispatching thread, and an
additional thread for each transform). (and the threads created by xalan
are not pooled).

So what I would propose is:

* if the xsl-ing serializer workaround is only required by the SWT, lets
make this behaviour configurable and make the default serializers not
use it.

* and possibly, lets set the incremental-processing to false per
default. It has an advantage if and only if the processed XML is rather
large, and the stylesheet is written is such a way that it can actually
be performed incrementally (for most XSL's I've seen in practical use,
this is not the case).

* finally (or better first of all), lets look if the serializer problems
cannot be solved in xalan.

Thoughts?

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org


Re: XML/HTML serializers buffering everything and using threads.

Posted by Stefano Mazzocchi <st...@apache.org>.
Bruno Dumon wrote:
> Just found out about some suboptimal serializer things in the current
> Cocoon CVS, and since I didn't find anything relevant in the mail
> archives, I thought I'd just come in and complain.

Pier was telling me something equivalent yesterday. Good timing!

> There are actually 2 things:
> 
> * the current XML/HTML serializers are performing identity XSL
> transforms (with a real stylesheet!), instead of just serializing. This
> means that they require the building of a complete DTM-tree, effectively
> creating a buffer of all SAX-events entering the serializer. From the
> CVS logs, I see this has been introduced to work around a bug occuring
> in (of all things) the SourceWritingTransformer, though I did not find
> what exactly.

Holy shit! Didn't know that!

> * by default, cocoon is configured with "incremental-processing" for
> XSL's set to true (this only applies when using xalan, not when using
> xsltc). Since Xalan manages this setting in a static variable, this
> property is shared for all XSL transformers in Cocoon. Since the
> serializer is also performing an XSL transform, it also applies to the
> serializer. Incremental processing is achieved by performing the
> transform on a separate thread. This means that a simple pipeline
> containing a (xalan) XSLT transform and a HTML serializer will already
> use 3 threads (the original request-dispatching thread, and an
> additional thread for each transform). (and the threads created by xalan
> are not pooled).

Oh dear. Talking about hotspots.

> So what I would propose is:
> 
> * if the xsl-ing serializer workaround is only required by the SWT, lets
> make this behaviour configurable and make the default serializers not
> use it.

Totally +1. Even better, identify what's wrong with the 
SourceWritingTransformer and fix that without expensive workarounds!

> * and possibly, lets set the incremental-processing to false per
> default. It has an advantage if and only if the processed XML is rather
> large, and the stylesheet is written is such a way that it can actually
> be performed incrementally (for most XSL's I've seen in practical use,
> this is not the case).

Agreed. +1

> * finally (or better first of all), lets look if the serializer problems
> cannot be solved in xalan.

+10

Thanks for keeping us sane on this!

-- 
Stefano Mazzocchi                               <st...@apache.org>
    Pluralitas non est ponenda sine necessitate [William of Ockham]
--------------------------------------------------------------------



Re: XML/HTML serializers buffering everything and using threads.

Posted by Bruno Dumon <br...@outerthought.org>.
On Mon, 2003-02-24 at 18:06, Sylvain Wallez wrote:
[...]
> 
> Guys,
> 
> I added a fix in AbstractTextSerializer ages ago in this area : it adds 
> in front of the IdentityTransform (it was before the current workaround) 
> a "NamespaceAsAttributes" XMLPipe that. adds namespace declarations as 
> attributes on the fly, without requiring building the full DOM tree.
> 
> Note also that this XMLPipe is added only if needed (there are some 
> init-time checks) since some other XSLT processors handle this correctly 
> (namely Saxon).

the logs say:

DEBUG   (2003-02-22) 14:44.42:438   [sitemap.serializer.xml]
(/cocoon/threadtest/test2) Thread-11/AbstractTextSerializer: Trax
handler org.apache.xalan.transformer.TransformerHandlerImpl handles
correctly namespaces.

so apparently that specific thing is fixed in the current Xalan.

The problem we're facing here though originates with dom-trees: suppose
there is a dom element with a prefix and a namespace, but no xmlns
attribute that declares the namespace. The dom->sax code only generates
start/endPrefixMappings for the explicitely declared xmlns attributes.
So in this case there are neither start/endPrefixMappings nor xmlns
attributes, and the NamespaceAsAttributes doesn't seem to fix this.

Currently I'm again more inclined to fix this either on the dom tree
itself (before doing dom->sax), or when doing dom->sax (in the
DOMStreamer, though that code is currently also based on the identity
transformer). Otherwise we would have to derive the namespace prefixes
from the qNames in startElement.

> 
> Sylvain (on ski vacation ;-)

leave some snow for me (it's my turn next week) ;-)

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org


Re: XML/HTML serializers buffering everything and using threads.

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Bruno Dumon wrote:

>On Mon, 2003-02-24 at 13:01, Carsten Ziegeler wrote:
>  
>
>>>-----Original Message-----
>>>From: Bruno Dumon [mailto:bruno@outerthought.org]
>>>Sent: Monday, February 24, 2003 12:47 PM
>>>To: cocoon-dev@xml.apache.org
>>>Subject: RE: XML/HTML serializers buffering everything and using
>>>threads.
>>>
>>>
>>>On Mon, 2003-02-24 at 12:24, Carsten Ziegeler wrote:
>>>[...]
>>>      
>>>
>>>>I personally see this as the right solution... :) I tried to fix this
>>>>bug in Xalan several month ago but failed :( (Ok, I only had three hours
>>>>time to understand whats going on inside Xalan).
>>>>So, best thing in my eyes is to build up some pressure on the
>>>>        
>>>>
>>>xalan team,
>>>      
>>>
>>>>so that they will fix this annoying bug.
>>>>        
>>>>
>>>I assume it is this bug you're referring to?
>>>http://nagoya.apache.org/bugzilla/show_bug.cgi?id=5779
>>>
>>>I think that one is a bit in the grey zone: the DOM spec puts the
>>>responsibility for creating correct namespace declaration attributes to
>>>the creator of the document.
>>>
>>>      
>>>
>>The DOM has this extra information
>>    
>>
>
>I know, but the point is that the creator of a dom-tree is responsible
>for adding namespace declaration attributes themselves. Though I agree
>that this is something that is impractical and that the serializer
>should be able to add correct namespace declarations.
>
>I've added a comment to bug
>http://nagoya.apache.org/bugzilla/show_bug.cgi?id=1831
>in that regard.
>  
>

Guys,

I added a fix in AbstractTextSerializer ages ago in this area : it adds 
in front of the IdentityTransform (it was before the current workaround) 
a "NamespaceAsAttributes" XMLPipe that. adds namespace declarations as 
attributes on the fly, without requiring building the full DOM tree.

Note also that this XMLPipe is added only if needed (there are some 
init-time checks) since some other XSLT processors handle this correctly 
(namely Saxon).

Shouldn't it be an answer to this problem ?

Sylvain (on ski vacation ;-)

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }



RE: XML/HTML serializers buffering everything and using threads.

Posted by Bruno Dumon <br...@outerthought.org>.
On Mon, 2003-02-24 at 13:01, Carsten Ziegeler wrote:
> > -----Original Message-----
> > From: Bruno Dumon [mailto:bruno@outerthought.org]
> > Sent: Monday, February 24, 2003 12:47 PM
> > To: cocoon-dev@xml.apache.org
> > Subject: RE: XML/HTML serializers buffering everything and using
> > threads.
> >
> >
> > On Mon, 2003-02-24 at 12:24, Carsten Ziegeler wrote:
> > [...]
> > > >
> > > I personally see this as the right solution... :) I tried to fix this
> > > bug in Xalan several month ago but failed :( (Ok, I only had three hours
> > > time to understand whats going on inside Xalan).
> > > So, best thing in my eyes is to build up some pressure on the
> > xalan team,
> > > so that they will fix this annoying bug.
> >
> > I assume it is this bug you're referring to?
> > http://nagoya.apache.org/bugzilla/show_bug.cgi?id=5779
> >
> > I think that one is a bit in the grey zone: the DOM spec puts the
> > responsibility for creating correct namespace declaration attributes to
> > the creator of the document.
> >
> The DOM has this extra information

I know, but the point is that the creator of a dom-tree is responsible
for adding namespace declaration attributes themselves. Though I agree
that this is something that is impractical and that the serializer
should be able to add correct namespace declarations.

I've added a comment to bug
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=1831
in that regard.

>  - and Joe has already identified where
> the problem is (something is not initialized properly I think).


-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org


RE: XML/HTML serializers buffering everything and using threads.

Posted by Carsten Ziegeler <cz...@s-und-n.de>.

> -----Original Message-----
> From: Bruno Dumon [mailto:bruno@outerthought.org]
> Sent: Monday, February 24, 2003 12:47 PM
> To: cocoon-dev@xml.apache.org
> Subject: RE: XML/HTML serializers buffering everything and using
> threads.
>
>
> On Mon, 2003-02-24 at 12:24, Carsten Ziegeler wrote:
> [...]
> > >
> > I personally see this as the right solution... :) I tried to fix this
> > bug in Xalan several month ago but failed :( (Ok, I only had three hours
> > time to understand whats going on inside Xalan).
> > So, best thing in my eyes is to build up some pressure on the
> xalan team,
> > so that they will fix this annoying bug.
>
> I assume it is this bug you're referring to?
> http://nagoya.apache.org/bugzilla/show_bug.cgi?id=5779
>
> I think that one is a bit in the grey zone: the DOM spec puts the
> responsibility for creating correct namespace declaration attributes to
> the creator of the document.
>
The DOM has this extra information - and Joe has already identified where
the problem is (something is not initialized properly I think).

Carsten


RE: XML/HTML serializers buffering everything and using threads.

Posted by Bruno Dumon <br...@outerthought.org>.
On Mon, 2003-02-24 at 12:24, Carsten Ziegeler wrote:
[...]
> > 
> I personally see this as the right solution... :) I tried to fix this
> bug in Xalan several month ago but failed :( (Ok, I only had three hours
> time to understand whats going on inside Xalan).
> So, best thing in my eyes is to build up some pressure on the xalan team,
> so that they will fix this annoying bug.

I assume it is this bug you're referring to?
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=5779

I think that one is a bit in the grey zone: the DOM spec puts the
responsibility for creating correct namespace declaration attributes to
the creator of the document.

In DOM level 3 there's a method normalizeDocument() which can cleanup
namespace declarations, the algorithm is described here:
http://www.w3.org/TR/2002/WD-DOM-Level-3-Core-20021022/namespaces-algorithms.html#normalizeDocumentAlgo

Putting this extra logic in Xalan's serializer would be unnecessary
overhead for cases where it's not needed. Maybe we could make an
equivalent of the normalizeDocument method that works on a dom level 2
document and call that before serializing the document?

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org


RE: XML/HTML serializers buffering everything and using threads.

Posted by Carsten Ziegeler <cz...@s-und-n.de>.

> -----Original Message-----
> From: Bruno Dumon [mailto:bruno@outerthought.org]
> 
> Just found out about some suboptimal serializer things in the current
> Cocoon CVS, and since I didn't find anything relevant in the mail
> archives, I thought I'd just come in and complain.
> 
> There are actually 2 things:
> 
> * the current XML/HTML serializers are performing identity XSL
> transforms (with a real stylesheet!), instead of just serializing. This
> means that they require the building of a complete DTM-tree, effectively
> creating a buffer of all SAX-events entering the serializer. From the
> CVS logs, I see this has been introduced to work around a bug occuring
> in (of all things) the SourceWritingTransformer, though I did not find
> what exactly.
> 
> * by default, cocoon is configured with "incremental-processing" for
> XSL's set to true (this only applies when using xalan, not when using
> xsltc). Since Xalan manages this setting in a static variable, this
> property is shared for all XSL transformers in Cocoon. Since the
> serializer is also performing an XSL transform, it also applies to the
> serializer. Incremental processing is achieved by performing the
> transform on a separate thread. This means that a simple pipeline
> containing a (xalan) XSLT transform and a HTML serializer will already
> use 3 threads (the original request-dispatching thread, and an
> additional thread for each transform). (and the threads created by xalan
> are not pooled).
> 
> So what I would propose is:
> 
> * if the xsl-ing serializer workaround is only required by the SWT, lets
> make this behaviour configurable and make the default serializers not
> use it.
> 
> * and possibly, lets set the incremental-processing to false per
> default. It has an advantage if and only if the processed XML is rather
> large, and the stylesheet is written is such a way that it can actually
> be performed incrementally (for most XSL's I've seen in practical use,
> this is not the case).
> 
> * finally (or better first of all), lets look if the serializer problems
> cannot be solved in xalan.
> 
I personally see this as the right solution... :) I tried to fix this
bug in Xalan several month ago but failed :( (Ok, I only had three hours
time to understand whats going on inside Xalan).
So, best thing in my eyes is to build up some pressure on the xalan team,
so that they will fix this annoying bug.

Carsten