You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Reinhard Pötz <re...@apache.org> on 2008/12/02 12:54:48 UTC
[cocoon3] Stax Pipelines
I've had Stax pipelines on my radar for a rather long time because I
think that Stax can simplify the writing of transformers a lot.
I proposed this idea to Alexander Schatten, an assistant professor at
the Vienna University of Technology and he then proposed it to his
students.
A group of four students accepted to work on this as part of their
studies. Steven and I are coaching this group from October to January
and the goal is to support Stax pipeline components in Cocoon 3.
So far the students learned more about Cocoon 3, Sax, Stax and did some
performance comparisons. This week we've entered the phase where the
students have to work on the actual Stax pipeline implementation.
I asked the students to introduce themselves and also to present the
current ideas of how to implement Stax pipelines. So Andreas, Killian,
Michael and Jakob, the floor is yours!
--
Reinhard Pötz Managing Director, {Indoqa} GmbH
http://www.indoqa.com/en/people/reinhard.poetz/
Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member reinhard@apache.org
________________________________________________________________________
Re: [cocoon3] Stax Pipelines
Posted by Simone Tripodi <si...@gmail.com>.
Hi All,
very nice and interesting thread, congratulations!!! I've exactly your
same feelings, I can't wait to see the first components, but my
curiosity is focused more on seeing how the Stax and Sax components
will be integrated :)
I'm not able to give my contribution in this area, since I'm not
expert on Stax, so I can just support you!
Best regards!!!
Simone
2008/12/3 Sylvain Wallez <sy...@apache.org>:
> Steven Dolg wrote:
>>
>> Thorsten Scherler schrieb:
>>>
>>> El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
>>>
>>>>
>>>> Andreas Pieber wrote:
>>>>
>>>
>>> ...
>>>
>>>>>
>>>>> One big "problem" in this approach is that the "flow direction of
>>>>> events" is completely inverted. This means that StAX and SAX components
>>>>> would not be able to work "directly" together. But also in a push-pull
>>>>> approach a conversion between StAX and SAX events have to be done and
>>>>> further more this problem could be tackled by writing a wrapper or adapters
>>>>> around the SAX components and add them to an StAX pipe.
>>>>>
>>>>
>>>> Absolutely. Converting Stax to SAX is fairly trivial, but the other way
>>>> around requires buffering or multithreading. Have you looked at Stax-Utils
>>>> [1]? It contains many classes to ease the SAX <-> Stax translation.
>>>>
>>>>
>>>
>>> I lately played around (and still do) with such approach in the forrest
>>> dispatcher rewrite [3]. I am using Axiom which is a quite interesting
>>> approach and maybe worth looking into [4]. However I did some profiling
>>> and for the dispatcher the old SAX approach had been ways faster.
>>>
>>
>> We started with an evaluation of some StAX implementations including
>> Axiom, WoodStox and the reference implementation.
>> However quite early we felt that the DOM-like approach of Axiom is not
>> ideally suited for our current phase.
>> I'm quite sure that there are occasions where Axiom can be really
>> charming, but I believe there are too many premises required to efficiently
>> use it (e.g. you will want to be sure that the XML data is not too large).
>> But if you have some complex transformation that appear to difficult to be
>> implemented in a one-pass approach Axiom could probably do the trick.
>> I'm sure we will explore this idea at a later time, though...
>
> Axiom is very interesting as the DOM-like structure it provides is the
> easiest way to traverse a document while avoiding full parsing of the
> document. But this comes with a price, since any traversal on a list of
> elements requires to parse all of these elements. So without very careful
> use, it can quickly degenerate into a classical DOM with the assiociated
> problems, or even worse because of the additional complexity required by
> deferred parsing.
>
> Not to say that Axiom is bad, rather the contrary: it's a powerful weapon
> which you can easily shoot yourself in the foot with :-)
>
>>> However this is due to the buffering issue pointed out by Sylvain which
>>> [5] is not solving at all. Brings me back to do a sax (+stax) approach
>>> again (the other class in the package).
>>>
>>> I am really exited about this thread. :)
>>>
>>
>> I must admit that it got me all excited by now, too.
>
> Should I say me too? ;-)
>
>> Yesterday, I did a very minimalistic POC, just to make sure our current
>> approach is not missing any major point.
>> I have to say I was simply amazed how easy state handling can be when
>> using StAX compared to SAX and I'm very confident that we came up with a
>> pretty thorough concept.
>
> I'm eager to see what it looks like!
>
>> After all, we tortured our poor students more than a month with evaluating
>> implemenations, writing uses cases - outside Cocoon! - using both SAX and
>> StAX, before even "allowing" them to think about how to integrate this into
>> Cocoon.
>> I believe this was necessary to fully understand the differences between
>> StAX and SAX and - even more important - the different usage patterns
>> associated with them.
>> And I'm sure this allows us now to fully reap the benefits of this API.
>>
>> Well, I can't wait to see the first components ...
>
> Same here! But don't forget in the torture program the important XSLT
> transformer, since I don't know of any implementation that would support
> pull callbacks and thus avoid buffering the ouput in a Stax pipeline.
>
> Sylvain
>
> --
> Sylvain Wallez - http://bluxte.net
>
>
--
My LinkedIn profile: http://www.linkedin.com/in/simonetripodi
My GoogleCode profile: http://code.google.com/u/simone.tripodi/
My Picasa: http://picasaweb.google.com/simone.tripodi/
My Tube: http://www.youtube.com/user/stripodi
My Del.icio.us: http://del.icio.us/simone.tripodi
Re: [cocoon3] Stax Pipelines
Posted by Sylvain Wallez <sy...@apache.org>.
Steven Dolg wrote:
> Thorsten Scherler schrieb:
>> El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
>>
>>> Andreas Pieber wrote:
>>>
>> ...
>>
>>>> One big "problem" in this approach is that the "flow direction of
>>>> events" is completely inverted. This means that StAX and SAX
>>>> components would not be able to work "directly" together. But also
>>>> in a push-pull approach a conversion between StAX and SAX events
>>>> have to be done and further more this problem could be tackled by
>>>> writing a wrapper or adapters around the SAX components and add
>>>> them to an StAX pipe.
>>>>
>>> Absolutely. Converting Stax to SAX is fairly trivial, but the other
>>> way around requires buffering or multithreading. Have you looked at
>>> Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax
>>> translation.
>>>
>>>
>>
>> I lately played around (and still do) with such approach in the forrest
>> dispatcher rewrite [3]. I am using Axiom which is a quite interesting
>> approach and maybe worth looking into [4]. However I did some profiling
>> and for the dispatcher the old SAX approach had been ways faster.
>>
> We started with an evaluation of some StAX implementations including
> Axiom, WoodStox and the reference implementation.
> However quite early we felt that the DOM-like approach of Axiom is not
> ideally suited for our current phase.
> I'm quite sure that there are occasions where Axiom can be really
> charming, but I believe there are too many premises required to
> efficiently use it (e.g. you will want to be sure that the XML data is
> not too large). But if you have some complex transformation that
> appear to difficult to be implemented in a one-pass approach Axiom
> could probably do the trick.
> I'm sure we will explore this idea at a later time, though...
Axiom is very interesting as the DOM-like structure it provides is the
easiest way to traverse a document while avoiding full parsing of the
document. But this comes with a price, since any traversal on a list of
elements requires to parse all of these elements. So without very
careful use, it can quickly degenerate into a classical DOM with the
assiociated problems, or even worse because of the additional complexity
required by deferred parsing.
Not to say that Axiom is bad, rather the contrary: it's a powerful
weapon which you can easily shoot yourself in the foot with :-)
>> However this is due to the buffering issue pointed out by Sylvain which
>> [5] is not solving at all. Brings me back to do a sax (+stax) approach
>> again (the other class in the package).
>>
>> I am really exited about this thread. :)
>>
> I must admit that it got me all excited by now, too.
Should I say me too? ;-)
> Yesterday, I did a very minimalistic POC, just to make sure our
> current approach is not missing any major point.
> I have to say I was simply amazed how easy state handling can be when
> using StAX compared to SAX and I'm very confident that we came up with
> a pretty thorough concept.
I'm eager to see what it looks like!
> After all, we tortured our poor students more than a month with
> evaluating implemenations, writing uses cases - outside Cocoon! -
> using both SAX and StAX, before even "allowing" them to think about
> how to integrate this into Cocoon.
> I believe this was necessary to fully understand the differences
> between StAX and SAX and - even more important - the different usage
> patterns associated with them.
> And I'm sure this allows us now to fully reap the benefits of this API.
>
> Well, I can't wait to see the first components ...
Same here! But don't forget in the torture program the important XSLT
transformer, since I don't know of any implementation that would support
pull callbacks and thus avoid buffering the ouput in a Stax pipeline.
Sylvain
--
Sylvain Wallez - http://bluxte.net
Re: [cocoon3] Stax Pipelines
Posted by Steven Dolg <st...@indoqa.com>.
Thorsten Scherler schrieb:
> El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
>
>> Andreas Pieber wrote:
>>
> ...
>
>>> One big "problem" in this approach is that the "flow direction of events" is
>>> completely inverted. This means that StAX and SAX components would not be able
>>> to work "directly" together. But also in a push-pull approach a conversion
>>> between StAX and SAX events have to be done and further more this problem could
>>> be tackled by writing a wrapper or adapters around the SAX components and add
>>> them to an StAX pipe.
>>>
>>>
>> Absolutely. Converting Stax to SAX is fairly trivial, but the other way
>> around requires buffering or multithreading. Have you looked at
>> Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax
>> translation.
>>
>>
>
> I lately played around (and still do) with such approach in the forrest
> dispatcher rewrite [3]. I am using Axiom which is a quite interesting
> approach and maybe worth looking into [4]. However I did some profiling
> and for the dispatcher the old SAX approach had been ways faster.
>
We started with an evaluation of some StAX implementations including
Axiom, WoodStox and the reference implementation.
However quite early we felt that the DOM-like approach of Axiom is not
ideally suited for our current phase.
I'm quite sure that there are occasions where Axiom can be really
charming, but I believe there are too many premises required to
efficiently use it (e.g. you will want to be sure that the XML data is
not too large). But if you have some complex transformation that appear
to difficult to be implemented in a one-pass approach Axiom could
probably do the trick.
I'm sure we will explore this idea at a later time, though...
> However this is due to the buffering issue pointed out by Sylvain which
> [5] is not solving at all. Brings me back to do a sax (+stax) approach
> again (the other class in the package).
>
> I am really exited about this thread. :)
>
I must admit that it got me all excited by now, too.
Yesterday, I did a very minimalistic POC, just to make sure our current
approach is not missing any major point.
I have to say I was simply amazed how easy state handling can be when
using StAX compared to SAX and I'm very confident that we came up with a
pretty thorough concept.
After all, we tortured our poor students more than a month with
evaluating implemenations, writing uses cases - outside Cocoon! - using
both SAX and StAX, before even "allowing" them to think about how to
integrate this into Cocoon.
I believe this was necessary to fully understand the differences between
StAX and SAX and - even more important - the different usage patterns
associated with them.
And I'm sure this allows us now to fully reap the benefits of this API.
Well, I can't wait to see the first components ...
> salu2
>
> ...
>
>> [1] http://stax-utils.dev.java.net/
>> [2] http://www.flickr.com/services/api/response.rest.html
>>
>
> [3]
> https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher
> [4] http://ws.apache.org/commons/axiom/OMTutorial.html
> [5]
> https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher/src/java/org/apache/forrest/dispatcher/transformation/DispatcherWrapperTransformer.java
>
Re: [cocoon3] Stax Pipelines
Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
> Andreas Pieber wrote:
...
> > One big "problem" in this approach is that the "flow direction of events" is
> > completely inverted. This means that StAX and SAX components would not be able
> > to work "directly" together. But also in a push-pull approach a conversion
> > between StAX and SAX events have to be done and further more this problem could
> > be tackled by writing a wrapper or adapters around the SAX components and add
> > them to an StAX pipe.
> >
>
> Absolutely. Converting Stax to SAX is fairly trivial, but the other way
> around requires buffering or multithreading. Have you looked at
> Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax
> translation.
>
I lately played around (and still do) with such approach in the forrest
dispatcher rewrite [3]. I am using Axiom which is a quite interesting
approach and maybe worth looking into [4]. However I did some profiling
and for the dispatcher the old SAX approach had been ways faster.
However this is due to the buffering issue pointed out by Sylvain which
[5] is not solving at all. Brings me back to do a sax (+stax) approach
again (the other class in the package).
I am really exited about this thread. :)
salu2
...
> [1] http://stax-utils.dev.java.net/
> [2] http://www.flickr.com/services/api/response.rest.html
[3]
https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher
[4] http://ws.apache.org/commons/axiom/OMTutorial.html
[5]
https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher/src/java/org/apache/forrest/dispatcher/transformation/DispatcherWrapperTransformer.java
--
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>
Sociedad Andaluza para el Desarrollo de la Sociedad
de la Información, S.A.U. (SADESI)
Re: [cocoon3] Stax Pipelines
Posted by Sylvain Wallez <sy...@apache.org>.
Andreas Pieber wrote:
> First of all, my name is Andreas and I'm one of the students working on the StAX
> implementation for cocoon. Therfore hello from my colleagues and me.
>
Hi Andreas and colleagues!
> Secondly me first post ever to the mailing list of an open source project and
> such a long post to answer. Thank you Sylvain ;) Nevertheless I'm going to try
> my best.
>
Doh, sorry for that. But at least this brought some material for the
discussion :-P
> We (if i say we, I mean us students strongly influenced by Reinhard and Steven
> :)) also thought about the problems described by you and came to the same
> conclusion.
Good to hear!
> Therefore we're trying another approach. Pulling StAX-XmlEvents
> through the entire pipeline from the end.
>
> In other words, if we have a simple pipe of the following form:
>
> Producer - Transformer - Serializer
>
> the Serializer would have in its start method some code like:
>
> while(parent.hasNext()){
> xmlOutputWriter.add(parent.getNext());
> }
>
> retrieving the next event on the Transformer in this case and writing it into an
> XmlOutputWriter. The transformer on his self calls the getNext method on the
> Starter (in this case) which retrieves the XmlEvents directly from the
> XmlInputReader.
>
> In this approach the Transformer needs (of course) some kind of buffer since in
> response to one sibling from the parent much new content could be produced by
> the transformer. This content is only retrieved one by one while the next
> pipeline component calls getNext which explains the need for some kind of
> buffer.
>
> Of course this buffer and some more helper code have to be produced to avoid
> code duplication and helping the developer.
>
I thought about that approach as well, but it doesn't avoid state
management, which is the main complexity that Stax is supposed to solve.
This is still a callback-based processing, although we have here pull
callbacks rather than push callbacks.
Now you're right: a single pull callback can consume several input
events that are related, making it thus easy to process a subtree of
several closely related elements from the input. It would for exemple
radically simplify the implementation of the I18nTransformer where
<i18n:translate> and <i18n:choose> have a nested structure.
But in many situations the elements of interest to a transformer enclose
large document sections that are to be propagated without modification.
Examples are JXTemplateTransformer or FormsTransformer (but does anybody
still use these instead of their generator replacements?),
RoleFilterTransfomer, SQLTransformer, LuceneIndexTransformer,
MailTransformer, etc.
In that case, if we want to avoid processing the full input when
reacting to a start element in order to keep the benefits of streaming,
we have to use state management very similar to what would be needed for
a SAX implementation.
I also have the feeling that because of the need for state management,
we'll end up with quite complex structures, because of the mix of a
callback and state automata approach with the pull approach where state
is kept in the method calls stack and local variables.
Now I'd love to be proven wrong, since after considering these issues
I've never actually experimented with this approach.
> One big "problem" in this approach is that the "flow direction of events" is
> completely inverted. This means that StAX and SAX components would not be able
> to work "directly" together. But also in a push-pull approach a conversion
> between StAX and SAX events have to be done and further more this problem could
> be tackled by writing a wrapper or adapters around the SAX components and add
> them to an StAX pipe.
>
Absolutely. Converting Stax to SAX is fairly trivial, but the other way
around requires buffering or multithreading. Have you looked at
Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax
translation.
> At the moment we're developing a prototype for such a "pull only pipe" to get
> some experience with it.
>
Even if I may seem a big negative above, keep up on this work. As I
said, I haven't actually experimented Stax-based state management, so
maybe my feelings were wrong and I'm very interested in seeing what you
can come up with.
Now there's one very interesting use case for Stax we should not forget:
communication with remote APIs in a xmlrpc-style where the response body
contains both status information useful to a controller, and actual data
that can be used by a pipeline. In that case, the application controller
should be able to pull a few events from the request until it has all
the necessary information to decide what to do next, and then replay the
full request event stream into a pipeline.
A typical example is the Flickr "REST" response [2], which BTW is
actually not REST at all since the status code is in the response body
rather than in the HTTP status. A typical controller for this API would be:
InputStream flickrResponse = callFlickerAPI("foo");
PushBackStreamReader input = new PushBackStreamReader(flickrResponse);
input.nextTag();
if ("ok".equals(in.getAttributeValue(null, "status")) {
// go back to the first event in the stream
input.reset();
Pipeline pipe = new Pipeline();
pipe.setGenerator(input);
... build the pipeline and run it ...
} else {
sendErrorResponse("Flickr failed");
}
(note that in "pipe.setGenerator(input)" I don't care if the pipeline is
Stax-based or SAX-based with a Stax to SAX converter)
> I hope i was able to point out the nub of our thoughts. So, what do you think?
>
Yes, you got it! And sorry for throwing at you a large email for your
first participation :-)
But you'll quickly learn that cocoon-dev is friendly place where
everybody can voice his opinions... and have them challenged :-P
Sylvain
[1] http://stax-utils.dev.java.net/
[2] http://www.flickr.com/services/api/response.rest.html
--
Sylvain Wallez - http://bluxte.net
Re: [cocoon3] Stax Pipelines
Posted by Andreas Pieber <an...@schmutterer-partner.at>.
First of all, my name is Andreas and I'm one of the students working on the StAX
implementation for cocoon. Therfore hello from my colleagues and me.
Secondly me first post ever to the mailing list of an open source project and
such a long post to answer. Thank you Sylvain ;) Nevertheless I'm going to try
my best.
We (if i say we, I mean us students strongly influenced by Reinhard and Steven
:)) also thought about the problems described by you and came to the same
conclusion. Therefore we're trying another approach. Pulling StAX-XmlEvents
through the entire pipeline from the end.
In other words, if we have a simple pipe of the following form:
Producer - Transformer - Serializer
the Serializer would have in its start method some code like:
while(parent.hasNext()){
xmlOutputWriter.add(parent.getNext());
}
retrieving the next event on the Transformer in this case and writing it into an
XmlOutputWriter. The transformer on his self calls the getNext method on the
Starter (in this case) which retrieves the XmlEvents directly from the
XmlInputReader.
In this approach the Transformer needs (of course) some kind of buffer since in
response to one sibling from the parent much new content could be produced by
the transformer. This content is only retrieved one by one while the next
pipeline component calls getNext which explains the need for some kind of
buffer.
Of course this buffer and some more helper code have to be produced to avoid
code duplication and helping the developer.
One big "problem" in this approach is that the "flow direction of events" is
completely inverted. This means that StAX and SAX components would not be able
to work "directly" together. But also in a push-pull approach a conversion
between StAX and SAX events have to be done and further more this problem could
be tackled by writing a wrapper or adapters around the SAX components and add
them to an StAX pipe.
At the moment we're developing a prototype for such a "pull only pipe" to get
some experience with it.
I hope i was able to point out the nub of our thoughts. So, what do you think?
Andreas
On Tuesday 02 December 2008 17:16:25 Sylvain Wallez wrote:
> Reinhard Pötz wrote:
> > I've had Stax pipelines on my radar for a rather long time because I
> > think that Stax can simplify the writing of transformers a lot.
> > I proposed this idea to Alexander Schatten, an assistant professor at
> > the Vienna University of Technology and he then proposed it to his
> > students.
> >
> > A group of four students accepted to work on this as part of their
> > studies. Steven and I are coaching this group from October to January
> > and the goal is to support Stax pipeline components in Cocoon 3.
> >
> > So far the students learned more about Cocoon 3, Sax, Stax and did some
> > performance comparisons. This week we've entered the phase where the
> > students have to work on the actual Stax pipeline implementation.
> >
> > I asked the students to introduce themselves and also to present the
> > current ideas of how to implement Stax pipelines. So Andreas, Killian,
> > Michael and Jakob, the floor is yours!
>
> I have spent some cycles on this subject and came to the surprising
> conclusion that writing Stax _pipelines_ is actually rather complex.
>
> A Stax transformer pulls events from the previous component in the
> pipeline, which removes the need for the complex state machinery often
> needed for SAX (push) transformers by transforming it in a simple
> function call stack and local variables. This is the main interest of
> Stax vs SAX.
>
> But how does a transformer expose its result to the next component in
> the chain so that this next component can also pull events in the Stax
> style?
>
> When it produces an event, a Stax transformer should put this event
> somewhere so that it can be pulled and processed by the next component.
> But pulling also means the transformer does not suspend its execution
> since it continues pulling events from the previous component. This is
> actually reflected in the Stax API which provides a pull-based
> XMLStreamReader, but only a very SAX-like XMLStreamWriter.
>
> So a Stax transformer is actually a pull input / push output component.
>
> To allow the next component in the pipeline to be also push-based, there
> are 3 solutions (at least this is what I came up with) :
>
> Buffering
> ---------
> The XMLStreamWriter where the transformer writes to buffers all events
> in a data structure similar to our XMLByteStreamCompiler, that can be
> used as a XMLStreamReader by the next component in the chain. The
> pipeline object then has to call some execute() method on every
> component in the pipeline in sequence, after having provided them with
> the proper buffer-based reader and writer.
>
> Execution is single-threaded, which fits well with all the non
> threadsafe classes and threadlocals we usually have in web applications,
> but requires buffering and thus somehow defeats the purpose of
> stream-based processing and can be simply not possible to process large
> documents.
>
> Note however that because it is single-threaded, we can work with two
> buffers (one for input, one for output) that are reused whatever the
> number of components in the pipeline.
>
> Multithreading
> --------------
> Each component of the pipeline runs in a separate thread, and writes its
> output into an event queue that is consumed asynchronously by the next
> component in the pipeline. The event queue is presented as an
> XMLStreamReader to the next component.
>
> This approach requires very little buffering (and we can even have an
> upper bound on the event queue size). It also uses nicely the parallel
> proccessing capabilities of multi-core CPUs, although in web apps the
> parallelism is also handled by concurrent http requests. This is
> typically the approach that would be used with Erlang or Scala actors.
>
> Multithreading has some issues though, since the servlet API more or
> less implies that a single thread processes the request and we may have
> some concurrency issues. Web app developers also take single threading
> as a basic assumption and use threadlocals here and there.
>
> This approach also prevents the reuse of char[] buffers as is usually
> done by XML parsers since events are processed asychronously. All char[]
> have to be copied, but this is a minor issue.
>
> Continuations
> -------------
> When a transformer sends an event to the next component in the chain,
> its execution is suspended and captured in a continuation. The
> continuation of the next pipeline component is resumed until it has
> consumed the event. We then switch back to the current component until
> it produces an event, etc, etc.
>
> This approach is single-threaded and so avoids the concurrency issues
> mentioned above, and also avoids buffering. But there is certainly a
> high overhead with the large number of continuation capturing/resuming.
> This number can be reduced though is we have some level of buffering to
> allow processing of several events in one capture/resume cycle.
>
> It also requires all the bytecode of transfomers to be instrumented for
> continuations, which in itself adds quite some memory and processing
> overhead. Torsten also posted on this subject quite long ago [1].
>
>
> Conclusion
> ----------
> All things considered, I came to the conclusion that a full Stax
> pipeline either requires buffering to be reliable (but we're no more
> streaming), or requires very careful inspection of all components for
> multi-threading issues.
>
> So in the end, Stax probably has to be considered as a helper _inside_ a
> component to ease processing : buffer all SAX input, then pull the
> received events to avoid complex state automata.
>
> Looks like I'm in a "long mail" period and I hope I haven't lost anybody
> here :-)
>
> So, what do you think?
>
> Sylvain
>
> [1] http://vafer.org/blog/20060807003609
--
SCHMUTTERER+PARTNER Information Technology GmbH
Hiessbergergasse 1
A-3002 Purkersdorf
T +43 (0) 69911127344
F +43 (2231) 61899-99
mail to: andreas.pieber@schmutterer-partner.at
R: Re: [cocoon3] Stax Pipelines
Posted by Simone Gianni <si...@semeru.it>.
Hi all,
since Stax is an inversion of the call flow, what we have is an inversion of the advantages and disadvantages we had with SAX.
I'll try to explain it better. Suppose we have two schemas, one contains "LONG" elements, with lots of children and stuff inside, the other contains "SHORT" elements, with just as attribute "id". Now suppose it is possible to translate from one to the other, for example it could be that LONG stuff is stored on the database, and SHORT is a placeholder pointing to LONGs on the database.
Now, we want to write two transformers. One is SHORT to LONG, which will perform some selects on the database and expand those SHORT into LONG. The other one stores stuff on the database, and convert LONG to SHORT.
As we all know (the i18n transformer is a good example), in SAX, transforming from LONG to SHORT is a pain, cause we need to keep the state between multiple calls. In our example, if the LONG to SHORT transformer is a SAX based one, we would need to buffer all the LONG content, then store it on the DB and then emit a single SHORT. That buffering is our state.
Instead, this kind of transformation is quite easy in a Stax transformer, cause when we encounter a LONG we can just fetch all the data we need, and perform everything we need to do in a single method, without having to preserve the state across different calls. Such a transformer in Stax could be nearly stateless/threadsafe from an XML point of view (the database connection would be state, but that's just for the sake of the example).
However suppose we are doing the SHORT to LONG translation. In this case, using SAX is by fax simpler than Stax. In fact, when we encounter a SHORT, we can fetch stuff from the DB and start bombing the next handler in the pipeline with elements as soon as they arrive from the DB. Doing it in Stax instead would require us to have a state, cause we would need to buffer data from the DB, and serve that data to the subseguent calls from our Stax consumer until the buffer is empty. Exactly the opposite problem of a SAX pipeline.
The SAX part of these example is nothing new to Cocoon. We already have an infrastructure for buffering SAX events when we need to in our transformers, in extremis even building a DOM out of it (which we could consider the most versatile and expensive form of buffering). Couldn't we just provide such a buffer for those Stax based transformers when they need it?
This would be an intermediate solution, cause there would be an easy way to keep the state during Stax calls (as it was for SAX, but the opposite way around), it would still be a pure Stax based pipeline, buffering would be limited to the bare minimum required by the transformer, and could be avoided at all reimplementing the transformer with more complex state logic if needed for performance reasons.
This is not a solution to the SAX<->Stax cooperation problem, but my two cents on the "Is implementing a Stax based transformer easier or more complicated than a Sax one" discussion :)
Simone
----- Messaggio originale -----
Da: Sylvain Wallez <sy...@apache.org>
A: dev@cocoon.apache.org
Posta Inviata: martedì 2 dicembre 2008 17.16.25 GMT+0100 Europe/Berlin
Oggetto: Re: [cocoon3] Stax Pipelines
Reinhard Pötz wrote:
> I've had Stax pipelines on my radar for a rather long time because I
> think that Stax can simplify the writing of transformers a lot.
> I proposed this idea to Alexander Schatten, an assistant professor at
> the Vienna University of Technology and he then proposed it to his
> students.
>
> A group of four students accepted to work on this as part of their
> studies. Steven and I are coaching this group from October to January
> and the goal is to support Stax pipeline components in Cocoon 3.
>
> So far the students learned more about Cocoon 3, Sax, Stax and did some
> performance comparisons. This week we've entered the phase where the
> students have to work on the actual Stax pipeline implementation.
>
> I asked the students to introduce themselves and also to present the
> current ideas of how to implement Stax pipelines. So Andreas, Killian,
> Michael and Jakob, the floor is yours!
>
I have spent some cycles on this subject and came to the surprising
conclusion that writing Stax _pipelines_ is actually rather complex.
A Stax transformer pulls events from the previous component in the
pipeline, which removes the need for the complex state machinery often
needed for SAX (push) transformers by transforming it in a simple
function call stack and local variables. This is the main interest of
Stax vs SAX.
But how does a transformer expose its result to the next component in
the chain so that this next component can also pull events in the Stax
style?
When it produces an event, a Stax transformer should put this event
somewhere so that it can be pulled and processed by the next component.
But pulling also means the transformer does not suspend its execution
since it continues pulling events from the previous component. This is
actually reflected in the Stax API which provides a pull-based
XMLStreamReader, but only a very SAX-like XMLStreamWriter.
So a Stax transformer is actually a pull input / push output component.
To allow the next component in the pipeline to be also push-based, there
are 3 solutions (at least this is what I came up with) :
Buffering
---------
The XMLStreamWriter where the transformer writes to buffers all events
in a data structure similar to our XMLByteStreamCompiler, that can be
used as a XMLStreamReader by the next component in the chain. The
pipeline object then has to call some execute() method on every
component in the pipeline in sequence, after having provided them with
the proper buffer-based reader and writer.
Execution is single-threaded, which fits well with all the non
threadsafe classes and threadlocals we usually have in web applications,
but requires buffering and thus somehow defeats the purpose of
stream-based processing and can be simply not possible to process large
documents.
Note however that because it is single-threaded, we can work with two
buffers (one for input, one for output) that are reused whatever the
number of components in the pipeline.
Multithreading
--------------
Each component of the pipeline runs in a separate thread, and writes its
output into an event queue that is consumed asynchronously by the next
component in the pipeline. The event queue is presented as an
XMLStreamReader to the next component.
This approach requires very little buffering (and we can even have an
upper bound on the event queue size). It also uses nicely the parallel
proccessing capabilities of multi-core CPUs, although in web apps the
parallelism is also handled by concurrent http requests. This is
typically the approach that would be used with Erlang or Scala actors.
Multithreading has some issues though, since the servlet API more or
less implies that a single thread processes the request and we may have
some concurrency issues. Web app developers also take single threading
as a basic assumption and use threadlocals here and there.
This approach also prevents the reuse of char[] buffers as is usually
done by XML parsers since events are processed asychronously. All char[]
have to be copied, but this is a minor issue.
Continuations
-------------
When a transformer sends an event to the next component in the chain,
its execution is suspended and captured in a continuation. The
continuation of the next pipeline component is resumed until it has
consumed the event. We then switch back to the current component until
it produces an event, etc, etc.
This approach is single-threaded and so avoids the concurrency issues
mentioned above, and also avoids buffering. But there is certainly a
high overhead with the large number of continuation capturing/resuming.
This number can be reduced though is we have some level of buffering to
allow processing of several events in one capture/resume cycle.
It also requires all the bytecode of transfomers to be instrumented for
continuations, which in itself adds quite some memory and processing
overhead. Torsten also posted on this subject quite long ago [1].
Conclusion
----------
All things considered, I came to the conclusion that a full Stax
pipeline either requires buffering to be reliable (but we're no more
streaming), or requires very careful inspection of all components for
multi-threading issues.
So in the end, Stax probably has to be considered as a helper _inside_ a
component to ease processing : buffer all SAX input, then pull the
received events to avoid complex state automata.
Looks like I'm in a "long mail" period and I hope I haven't lost anybody
here :-)
So, what do you think?
Sylvain
[1] http://vafer.org/blog/20060807003609
--
Sylvain Wallez - http://bluxte.net
Re: [cocoon3] Stax Pipelines
Posted by Sylvain Wallez <sy...@apache.org>.
Reinhard Pötz wrote:
> I've had Stax pipelines on my radar for a rather long time because I
> think that Stax can simplify the writing of transformers a lot.
> I proposed this idea to Alexander Schatten, an assistant professor at
> the Vienna University of Technology and he then proposed it to his
> students.
>
> A group of four students accepted to work on this as part of their
> studies. Steven and I are coaching this group from October to January
> and the goal is to support Stax pipeline components in Cocoon 3.
>
> So far the students learned more about Cocoon 3, Sax, Stax and did some
> performance comparisons. This week we've entered the phase where the
> students have to work on the actual Stax pipeline implementation.
>
> I asked the students to introduce themselves and also to present the
> current ideas of how to implement Stax pipelines. So Andreas, Killian,
> Michael and Jakob, the floor is yours!
>
I have spent some cycles on this subject and came to the surprising
conclusion that writing Stax _pipelines_ is actually rather complex.
A Stax transformer pulls events from the previous component in the
pipeline, which removes the need for the complex state machinery often
needed for SAX (push) transformers by transforming it in a simple
function call stack and local variables. This is the main interest of
Stax vs SAX.
But how does a transformer expose its result to the next component in
the chain so that this next component can also pull events in the Stax
style?
When it produces an event, a Stax transformer should put this event
somewhere so that it can be pulled and processed by the next component.
But pulling also means the transformer does not suspend its execution
since it continues pulling events from the previous component. This is
actually reflected in the Stax API which provides a pull-based
XMLStreamReader, but only a very SAX-like XMLStreamWriter.
So a Stax transformer is actually a pull input / push output component.
To allow the next component in the pipeline to be also push-based, there
are 3 solutions (at least this is what I came up with) :
Buffering
---------
The XMLStreamWriter where the transformer writes to buffers all events
in a data structure similar to our XMLByteStreamCompiler, that can be
used as a XMLStreamReader by the next component in the chain. The
pipeline object then has to call some execute() method on every
component in the pipeline in sequence, after having provided them with
the proper buffer-based reader and writer.
Execution is single-threaded, which fits well with all the non
threadsafe classes and threadlocals we usually have in web applications,
but requires buffering and thus somehow defeats the purpose of
stream-based processing and can be simply not possible to process large
documents.
Note however that because it is single-threaded, we can work with two
buffers (one for input, one for output) that are reused whatever the
number of components in the pipeline.
Multithreading
--------------
Each component of the pipeline runs in a separate thread, and writes its
output into an event queue that is consumed asynchronously by the next
component in the pipeline. The event queue is presented as an
XMLStreamReader to the next component.
This approach requires very little buffering (and we can even have an
upper bound on the event queue size). It also uses nicely the parallel
proccessing capabilities of multi-core CPUs, although in web apps the
parallelism is also handled by concurrent http requests. This is
typically the approach that would be used with Erlang or Scala actors.
Multithreading has some issues though, since the servlet API more or
less implies that a single thread processes the request and we may have
some concurrency issues. Web app developers also take single threading
as a basic assumption and use threadlocals here and there.
This approach also prevents the reuse of char[] buffers as is usually
done by XML parsers since events are processed asychronously. All char[]
have to be copied, but this is a minor issue.
Continuations
-------------
When a transformer sends an event to the next component in the chain,
its execution is suspended and captured in a continuation. The
continuation of the next pipeline component is resumed until it has
consumed the event. We then switch back to the current component until
it produces an event, etc, etc.
This approach is single-threaded and so avoids the concurrency issues
mentioned above, and also avoids buffering. But there is certainly a
high overhead with the large number of continuation capturing/resuming.
This number can be reduced though is we have some level of buffering to
allow processing of several events in one capture/resume cycle.
It also requires all the bytecode of transfomers to be instrumented for
continuations, which in itself adds quite some memory and processing
overhead. Torsten also posted on this subject quite long ago [1].
Conclusion
----------
All things considered, I came to the conclusion that a full Stax
pipeline either requires buffering to be reliable (but we're no more
streaming), or requires very careful inspection of all components for
multi-threading issues.
So in the end, Stax probably has to be considered as a helper _inside_ a
component to ease processing : buffer all SAX input, then pull the
received events to avoid complex state automata.
Looks like I'm in a "long mail" period and I hope I haven't lost anybody
here :-)
So, what do you think?
Sylvain
[1] http://vafer.org/blog/20060807003609
--
Sylvain Wallez - http://bluxte.net
Re: [cocoon3] Stax Pipelines
Posted by Steven Dolg <st...@indoqa.com>.
David Crossley schrieb:
> Reinhard P?tz wrote:
>
>> I've had Stax pipelines on my radar for a rather long time because I
>> think that Stax can simplify the writing of transformers a lot.
>> I proposed this idea to Alexander Schatten, an assistant professor at
>> the Vienna University of Technology and he then proposed it to his
>> students.
>>
>
> This is a fantastic moment for open source.
> Thanks to all involved. More of this style of
> collaboration please.
>
> -David
>
>
Thank you, David.
I believe it was Werner Guttmann from Castor who started to do this.
Since he's a good friend and colleague we gained some insights about how
it worked out and lots of details about the promising results they produced.
He suggested we should do the same with Cocoon.
I'm really glad we took the time to prepare two proposals and I'm pretty
sure we will continue to do this - provided the feedback is good and
there are students willing to work with us, of course... ;-)
Steven
Re: [cocoon3] Stax Pipelines
Posted by David Crossley <cr...@apache.org>.
Reinhard P?tz wrote:
>
> I've had Stax pipelines on my radar for a rather long time because I
> think that Stax can simplify the writing of transformers a lot.
> I proposed this idea to Alexander Schatten, an assistant professor at
> the Vienna University of Technology and he then proposed it to his
> students.
This is a fantastic moment for open source.
Thanks to all involved. More of this style of
collaboration please.
-David
Re: [cocoon3] Stax Pipelines
Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.
Reinhard Pötz pisze:
> I've had Stax pipelines on my radar for a rather long time because I
> think that Stax can simplify the writing of transformers a lot.
> I proposed this idea to Alexander Schatten, an assistant professor at
> the Vienna University of Technology and he then proposed it to his
> students.
>
> A group of four students accepted to work on this as part of their
> studies. Steven and I are coaching this group from October to January
> and the goal is to support Stax pipeline components in Cocoon 3.
>
> So far the students learned more about Cocoon 3, Sax, Stax and did some
> performance comparisons. This week we've entered the phase where the
> students have to work on the actual Stax pipeline implementation.
>
> I asked the students to introduce themselves and also to present the
> current ideas of how to implement Stax pipelines. So Andreas, Killian,
> Michael and Jakob, the floor is yours!
Wow, what a surprise, Reinhard!
I won't comment on actual proposal until I hear some details from the research that students have
made but I would like to say "Thank you" to all people involved into this effort.
It's nice to hear that there will be more students involved in Cocoon!
--
Best regards,
Grzegorz Kossakowski