You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Reinhard Pötz <re...@apache.org> on 2008/12/02 12:54:48 UTC

[cocoon3] Stax Pipelines

I've had Stax pipelines on my radar for a rather long time because I
think that Stax can simplify the writing of transformers a lot.
I proposed this idea to Alexander Schatten, an assistant professor at
the Vienna University of Technology and he then proposed it to his
students.

A group of four students accepted to work on this as part of their
studies. Steven and I are coaching this group from October to January
and the goal is to support Stax pipeline components in Cocoon 3.

So far the students learned more about Cocoon 3, Sax, Stax and did some
performance comparisons. This week we've entered the phase where the
students have to work on the actual Stax pipeline implementation.

I asked the students to introduce themselves and also to present the
current ideas of how to implement Stax pipelines. So Andreas, Killian,
Michael and Jakob, the floor is yours!

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [cocoon3] Stax Pipelines

Posted by Simone Tripodi <si...@gmail.com>.
Hi All,
very nice and interesting thread, congratulations!!! I've exactly your
same feelings, I can't wait to see the first components, but my
curiosity is focused more on seeing how the Stax and Sax components
will be integrated :)
I'm not able to give my contribution in this area, since I'm not
expert on Stax, so I can just support you!
Best regards!!!
Simone

2008/12/3 Sylvain Wallez <sy...@apache.org>:
> Steven Dolg wrote:
>>
>> Thorsten Scherler schrieb:
>>>
>>> El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
>>>
>>>>
>>>> Andreas Pieber wrote:
>>>>
>>>
>>> ...
>>>
>>>>>
>>>>> One big "problem" in this approach is that the "flow direction of
>>>>> events" is completely inverted. This means that StAX and SAX components
>>>>> would not be able to work "directly" together. But also in a push-pull
>>>>> approach a conversion between StAX and SAX events have to be done and
>>>>> further more this problem could be tackled by writing a wrapper or adapters
>>>>> around the SAX components and add them to an StAX pipe.
>>>>>
>>>>
>>>> Absolutely. Converting Stax to SAX is fairly trivial, but the other way
>>>> around requires buffering or multithreading. Have you looked at Stax-Utils
>>>> [1]? It contains many classes to ease the SAX <-> Stax translation.
>>>>
>>>>
>>>
>>> I lately played around (and still do) with such approach in the forrest
>>> dispatcher rewrite [3]. I am using Axiom which is a quite interesting
>>> approach and maybe worth looking into [4]. However I did some profiling
>>> and for the dispatcher the old SAX approach had been ways faster.
>>>
>>
>> We started with an evaluation of some StAX implementations including
>> Axiom, WoodStox and the reference implementation.
>> However quite early we felt that the DOM-like approach of Axiom is not
>> ideally suited for our current phase.
>> I'm quite sure that there are occasions where Axiom can be really
>> charming, but I believe there are too many premises required to efficiently
>> use it (e.g. you will want to be sure that the XML data is not too large).
>> But if you have some complex transformation that appear to difficult to be
>> implemented in a one-pass approach Axiom could probably do the trick.
>> I'm sure we will explore this idea at a later time, though...
>
> Axiom is very interesting as the DOM-like structure it provides is the
> easiest way to traverse a document while avoiding full parsing of the
> document. But this comes with a price, since any traversal on a list of
> elements requires to parse all of these elements. So without very careful
> use, it can quickly degenerate into a classical DOM with the assiociated
> problems, or even worse because of the additional complexity required by
> deferred parsing.
>
> Not to say that Axiom is bad, rather the contrary: it's a powerful weapon
> which you can easily shoot yourself in the foot with :-)
>
>>> However this is due to the buffering issue pointed out by Sylvain which
>>> [5] is not solving at all. Brings me back to do a sax (+stax) approach
>>> again (the other class in the package).
>>>
>>> I am really exited about this thread. :)
>>>
>>
>> I must admit that it got me all excited by now, too.
>
> Should I say me too? ;-)
>
>> Yesterday, I did a very minimalistic POC, just to make sure our current
>> approach is not missing any major point.
>> I have to say I was simply amazed how easy state handling can be when
>> using StAX compared to SAX and I'm very confident that we came up with a
>> pretty thorough concept.
>
> I'm eager to see what it looks like!
>
>> After all, we tortured our poor students more than a month with evaluating
>> implemenations, writing uses cases - outside Cocoon! - using both SAX and
>> StAX, before even "allowing" them to think about how to integrate this into
>> Cocoon.
>> I believe this was necessary to fully understand the differences between
>> StAX and SAX and - even more important - the different usage patterns
>> associated with them.
>> And I'm sure this allows us now to fully reap the benefits of this API.
>>
>> Well, I can't wait to see the first components ...
>
> Same here! But don't forget in the torture program the important XSLT
> transformer, since I don't know of any implementation that would support
> pull callbacks and thus avoid buffering the ouput in a Stax pipeline.
>
> Sylvain
>
> --
> Sylvain Wallez - http://bluxte.net
>
>



-- 
My LinkedIn profile: http://www.linkedin.com/in/simonetripodi
My GoogleCode profile: http://code.google.com/u/simone.tripodi/
My Picasa: http://picasaweb.google.com/simone.tripodi/
My Tube: http://www.youtube.com/user/stripodi
My Del.icio.us: http://del.icio.us/simone.tripodi

Re: [cocoon3] Stax Pipelines

Posted by Sylvain Wallez <sy...@apache.org>.
Steven Dolg wrote:
> Thorsten Scherler schrieb:
>> El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
>>  
>>> Andreas Pieber wrote:
>>>     
>> ...
>>  
>>>> One big "problem" in this approach is that the "flow direction of 
>>>> events" is completely inverted. This means that StAX and SAX 
>>>> components would not be able to work "directly" together. But also 
>>>> in a push-pull approach a conversion between StAX and SAX events 
>>>> have to be done and further more this problem could be tackled by 
>>>> writing a wrapper or adapters around the SAX components and add 
>>>> them to an StAX pipe.
>>>>         
>>> Absolutely. Converting Stax to SAX is fairly trivial, but the other 
>>> way around requires buffering or multithreading. Have you looked at 
>>> Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax 
>>> translation.
>>>
>>>     
>>
>> I lately played around (and still do) with such approach in the forrest
>> dispatcher rewrite [3]. I am using Axiom which is a quite interesting
>> approach and maybe worth looking into [4]. However I did some profiling
>> and for the dispatcher the old SAX approach had been ways faster.
>>   
> We started with an evaluation of some StAX implementations including 
> Axiom, WoodStox and the reference implementation.
> However quite early we felt that the DOM-like approach of Axiom is not 
> ideally suited for our current phase.
> I'm quite sure that there are occasions where Axiom can be really 
> charming, but I believe there are too many premises required to 
> efficiently use it (e.g. you will want to be sure that the XML data is 
> not too large). But if you have some complex transformation that 
> appear to difficult to be implemented in a one-pass approach Axiom 
> could probably do the trick.
> I'm sure we will explore this idea at a later time, though...

Axiom is very interesting as the DOM-like structure it provides is the 
easiest way to traverse a document while avoiding full parsing of the 
document. But this comes with a price, since any traversal on a list of 
elements requires to parse all of these elements. So without very 
careful use, it can quickly degenerate into a classical DOM with the 
assiociated problems, or even worse because of the additional complexity 
required by deferred parsing.

Not to say that Axiom is bad, rather the contrary: it's a powerful 
weapon which you can easily shoot yourself in the foot with :-)

>> However this is due to the buffering issue pointed out by Sylvain which
>> [5] is not solving at all. Brings me back to do a sax (+stax) approach
>> again (the other class in the package).
>>
>> I am really exited about this thread. :)
>>   
> I must admit that it got me all excited by now, too.

Should I say me too? ;-)

> Yesterday, I did a very minimalistic POC, just to make sure our 
> current approach is not missing any major point.
> I have to say I was simply amazed how easy state handling can be when 
> using StAX compared to SAX and I'm very confident that we came up with 
> a pretty thorough concept.

I'm eager to see what it looks like!

> After all, we tortured our poor students more than a month with 
> evaluating implemenations, writing uses cases - outside Cocoon! - 
> using both SAX and StAX, before even "allowing" them to think about 
> how to integrate this into Cocoon.
> I believe this was necessary to fully understand the differences 
> between StAX and SAX and - even more important - the different usage 
> patterns associated with them.
> And I'm sure this allows us now to fully reap the benefits of this API.
>
> Well, I can't wait to see the first components ...

Same here! But don't forget in the torture program the important XSLT 
transformer, since I don't know of any implementation that would support 
pull callbacks and thus avoid buffering the ouput in a Stax pipeline.

Sylvain

-- 
Sylvain Wallez - http://bluxte.net


Re: [cocoon3] Stax Pipelines

Posted by Steven Dolg <st...@indoqa.com>.
Thorsten Scherler schrieb:
> El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
>   
>> Andreas Pieber wrote:
>>     
> ...
>   
>>> One big "problem" in this approach is that the "flow direction of events" is 
>>> completely inverted. This means that StAX and SAX components would not be able 
>>> to work "directly" together. But also in a push-pull approach a conversion 
>>> between StAX and SAX events have to be done and further more this problem could 
>>> be tackled by writing a wrapper or adapters around the SAX components and add 
>>> them to an StAX pipe.
>>>   
>>>       
>> Absolutely. Converting Stax to SAX is fairly trivial, but the other way 
>> around requires buffering or multithreading. Have you looked at 
>> Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax 
>> translation.
>>
>>     
>
> I lately played around (and still do) with such approach in the forrest
> dispatcher rewrite [3]. I am using Axiom which is a quite interesting
> approach and maybe worth looking into [4]. However I did some profiling
> and for the dispatcher the old SAX approach had been ways faster.
>   
We started with an evaluation of some StAX implementations including 
Axiom, WoodStox and the reference implementation.
However quite early we felt that the DOM-like approach of Axiom is not 
ideally suited for our current phase.
I'm quite sure that there are occasions where Axiom can be really 
charming, but I believe there are too many premises required to 
efficiently use it (e.g. you will want to be sure that the XML data is 
not too large). But if you have some complex transformation that appear 
to difficult to be implemented in a one-pass approach Axiom could 
probably do the trick.
I'm sure we will explore this idea at a later time, though...

> However this is due to the buffering issue pointed out by Sylvain which
> [5] is not solving at all. Brings me back to do a sax (+stax) approach
> again (the other class in the package).
>
> I am really exited about this thread. :)
>   
I must admit that it got me all excited by now, too.

Yesterday, I did a very minimalistic POC, just to make sure our current 
approach is not missing any major point.
I have to say I was simply amazed how easy state handling can be when 
using StAX compared to SAX and I'm very confident that we came up with a 
pretty thorough concept.

After all, we tortured our poor students more than a month with 
evaluating implemenations, writing uses cases - outside Cocoon! - using 
both SAX and StAX, before even "allowing" them to think about how to 
integrate this into Cocoon.
I believe this was necessary to fully understand the differences between 
StAX and SAX and - even more important - the different usage patterns 
associated with them.
And I'm sure this allows us now to fully reap the benefits of this API.

Well, I can't wait to see the first components ...
> salu2
>
> ...
>   
>> [1] http://stax-utils.dev.java.net/
>> [2] http://www.flickr.com/services/api/response.rest.html
>>     
>
> [3]
> https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher
> [4] http://ws.apache.org/commons/axiom/OMTutorial.html
> [5]
> https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher/src/java/org/apache/forrest/dispatcher/transformation/DispatcherWrapperTransformer.java
>   


Re: [cocoon3] Stax Pipelines

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
El mié, 03-12-2008 a las 08:56 +0100, Sylvain Wallez escribió:
> Andreas Pieber wrote:
...
> > One big "problem" in this approach is that the "flow direction of events" is 
> > completely inverted. This means that StAX and SAX components would not be able 
> > to work "directly" together. But also in a push-pull approach a conversion 
> > between StAX and SAX events have to be done and further more this problem could 
> > be tackled by writing a wrapper or adapters around the SAX components and add 
> > them to an StAX pipe.
> >   
> 
> Absolutely. Converting Stax to SAX is fairly trivial, but the other way 
> around requires buffering or multithreading. Have you looked at 
> Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax 
> translation.
> 

I lately played around (and still do) with such approach in the forrest
dispatcher rewrite [3]. I am using Axiom which is a quite interesting
approach and maybe worth looking into [4]. However I did some profiling
and for the dispatcher the old SAX approach had been ways faster. 

However this is due to the buffering issue pointed out by Sylvain which
[5] is not solving at all. Brings me back to do a sax (+stax) approach
again (the other class in the package).

I am really exited about this thread. :)

salu2

...
> [1] http://stax-utils.dev.java.net/
> [2] http://www.flickr.com/services/api/response.rest.html

[3]
https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher
[4] http://ws.apache.org/commons/axiom/OMTutorial.html
[5]
https://svn.apache.org/repos/asf/forrest/branches/dispatcher_rewrite/plugins/org.apache.forrest.plugin.internal.dispatcher/src/java/org/apache/forrest/dispatcher/transformation/DispatcherWrapperTransformer.java
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)





Re: [cocoon3] Stax Pipelines

Posted by Sylvain Wallez <sy...@apache.org>.
Andreas Pieber wrote:
> First of all, my name is Andreas and I'm one of the students working on the StAX 
> implementation for cocoon. Therfore hello from my colleagues and me.
>   

Hi Andreas and colleagues!

> Secondly me first post ever to the mailing list of an open source project and 
> such a long post to answer. Thank you Sylvain ;) Nevertheless I'm going to try 
> my best.
>   

Doh, sorry for that. But at least this brought some material for the 
discussion :-P

> We (if i say we, I mean us students strongly influenced by Reinhard and Steven 
> :)) also thought about the problems described by you and came to the same 
> conclusion.

Good to hear!

> Therefore we're trying another approach. Pulling StAX-XmlEvents 
> through the entire pipeline from the end. 
>
> In other words, if we have a simple pipe of the following form:
>
> Producer - Transformer - Serializer
>
> the Serializer would have in its start method some code like:
>
> while(parent.hasNext()){
> 	xmlOutputWriter.add(parent.getNext());
> }
>
> retrieving the next event on the Transformer in this case and writing it into an 
> XmlOutputWriter. The transformer on his self calls the getNext method on the 
> Starter (in this case) which retrieves the XmlEvents directly from the 
> XmlInputReader.
>
> In this approach the Transformer needs (of course) some kind of buffer since in 
> response to one sibling from the parent much new content could be produced by 
> the transformer. This content is only retrieved one by one while the next 
> pipeline component calls getNext which explains the need for some kind of 
> buffer.
>
> Of course this buffer and some more helper code have to be produced to avoid 
> code duplication and helping the developer.
>   

I thought about that approach as well, but it doesn't avoid state 
management, which is the main complexity that Stax is supposed to solve. 
This is still a callback-based processing, although we have here pull 
callbacks rather than push callbacks.

Now you're right: a single pull callback can consume several input 
events that are related, making it thus easy to process a subtree of 
several closely related elements from the input. It would for exemple 
radically simplify the implementation of the I18nTransformer where 
<i18n:translate> and <i18n:choose> have a nested structure.

But in many situations the elements of interest to a transformer enclose 
large document sections that are to be propagated without modification. 
Examples are JXTemplateTransformer or FormsTransformer (but does anybody 
still use these instead of their generator replacements?), 
RoleFilterTransfomer, SQLTransformer, LuceneIndexTransformer, 
MailTransformer, etc.

In that case, if we want to avoid processing the full input when 
reacting to a start element in order to keep the benefits of streaming, 
we have to use state management very similar to what would be needed for 
a SAX implementation.

I also have the feeling that because of the need for state management, 
we'll end up with quite complex structures, because of the mix of a 
callback and state automata approach with the pull approach where state 
is kept in the method calls stack and local variables.

Now I'd love to be proven wrong, since after considering these issues 
I've never actually experimented with this approach.

> One big "problem" in this approach is that the "flow direction of events" is 
> completely inverted. This means that StAX and SAX components would not be able 
> to work "directly" together. But also in a push-pull approach a conversion 
> between StAX and SAX events have to be done and further more this problem could 
> be tackled by writing a wrapper or adapters around the SAX components and add 
> them to an StAX pipe.
>   

Absolutely. Converting Stax to SAX is fairly trivial, but the other way 
around requires buffering or multithreading. Have you looked at 
Stax-Utils [1]? It contains many classes to ease the SAX <-> Stax 
translation.

> At the moment we're developing a prototype for such a "pull only pipe" to get 
> some experience with it.
>   

Even if I may seem a big negative above, keep up on this work. As I 
said, I haven't actually experimented Stax-based state management, so 
maybe my feelings were wrong and I'm very interested in seeing what you 
can come up with.

Now there's one very interesting use case for Stax we should not forget: 
communication with remote APIs in a xmlrpc-style where the response body 
contains both status information useful to a controller, and actual data 
that can be used by a pipeline. In that case, the application controller 
should be able to pull a few events from the request until it has all 
the necessary information to decide what to do next, and then replay the 
full request event stream into a pipeline.

A typical example is the Flickr "REST" response [2], which BTW is 
actually not REST at all since the status code is in the response body 
rather than in the HTTP status. A typical controller for this API would be:

  InputStream flickrResponse = callFlickerAPI("foo");
  PushBackStreamReader input = new PushBackStreamReader(flickrResponse);
  input.nextTag();
  if ("ok".equals(in.getAttributeValue(null, "status")) {
      // go back to the first event in the stream
      input.reset();
      Pipeline pipe = new Pipeline();
      pipe.setGenerator(input);
      ... build the pipeline and run it ...
  } else {
      sendErrorResponse("Flickr failed");
  }

(note that in "pipe.setGenerator(input)" I don't care if the pipeline is 
Stax-based or SAX-based with a Stax to SAX converter)

> I hope i was able to point out the nub of our thoughts. So, what do you think?
>   

Yes, you got it! And sorry for throwing at you a large email for your 
first participation :-)

But you'll quickly learn that cocoon-dev is friendly place where 
everybody can voice his opinions... and have them challenged :-P

Sylvain

[1] http://stax-utils.dev.java.net/
[2] http://www.flickr.com/services/api/response.rest.html

-- 
Sylvain Wallez - http://bluxte.net


Re: [cocoon3] Stax Pipelines

Posted by Andreas Pieber <an...@schmutterer-partner.at>.
First of all, my name is Andreas and I'm one of the students working on the StAX 
implementation for cocoon. Therfore hello from my colleagues and me.

Secondly me first post ever to the mailing list of an open source project and 
such a long post to answer. Thank you Sylvain ;) Nevertheless I'm going to try 
my best.

We (if i say we, I mean us students strongly influenced by Reinhard and Steven 
:)) also thought about the problems described by you and came to the same 
conclusion. Therefore we're trying another approach. Pulling StAX-XmlEvents 
through the entire pipeline from the end. 

In other words, if we have a simple pipe of the following form:

Producer - Transformer - Serializer

the Serializer would have in its start method some code like:

while(parent.hasNext()){
	xmlOutputWriter.add(parent.getNext());
}

retrieving the next event on the Transformer in this case and writing it into an 
XmlOutputWriter. The transformer on his self calls the getNext method on the 
Starter (in this case) which retrieves the XmlEvents directly from the 
XmlInputReader.

In this approach the Transformer needs (of course) some kind of buffer since in 
response to one sibling from the parent much new content could be produced by 
the transformer. This content is only retrieved one by one while the next 
pipeline component calls getNext which explains the need for some kind of 
buffer.

Of course this buffer and some more helper code have to be produced to avoid 
code duplication and helping the developer.

One big "problem" in this approach is that the "flow direction of events" is 
completely inverted. This means that StAX and SAX components would not be able 
to work "directly" together. But also in a push-pull approach a conversion 
between StAX and SAX events have to be done and further more this problem could 
be tackled by writing a wrapper or adapters around the SAX components and add 
them to an StAX pipe.

At the moment we're developing a prototype for such a "pull only pipe" to get 
some experience with it.

I hope i was able to point out the nub of our thoughts. So, what do you think?

Andreas

On Tuesday 02 December 2008 17:16:25 Sylvain Wallez wrote:
> Reinhard Pötz wrote:
> > I've had Stax pipelines on my radar for a rather long time because I
> > think that Stax can simplify the writing of transformers a lot.
> > I proposed this idea to Alexander Schatten, an assistant professor at
> > the Vienna University of Technology and he then proposed it to his
> > students.
> >
> > A group of four students accepted to work on this as part of their
> > studies. Steven and I are coaching this group from October to January
> > and the goal is to support Stax pipeline components in Cocoon 3.
> >
> > So far the students learned more about Cocoon 3, Sax, Stax and did some
> > performance comparisons. This week we've entered the phase where the
> > students have to work on the actual Stax pipeline implementation.
> >
> > I asked the students to introduce themselves and also to present the
> > current ideas of how to implement Stax pipelines. So Andreas, Killian,
> > Michael and Jakob, the floor is yours!
>
> I have spent some cycles on this subject and came to the surprising
> conclusion that writing Stax _pipelines_ is actually rather complex.
>
> A Stax transformer pulls events from the previous component in the
> pipeline, which removes the need for the complex state machinery often
> needed for SAX (push) transformers by transforming it in a simple
> function call stack and local variables. This is the main interest of
> Stax vs SAX.
>
> But how does a transformer expose its result to the next component in
> the chain so that this next component can also pull events in the Stax
> style?
>
> When it produces an event, a Stax transformer should put this event
> somewhere so that it can be pulled and processed by the next component.
> But pulling also means the transformer does not suspend its execution
> since it continues pulling events from the previous component. This is
> actually reflected in the Stax API which provides a pull-based
> XMLStreamReader, but only a very SAX-like XMLStreamWriter.
>
> So a Stax transformer is actually a pull input / push output component.
>
> To allow the next component in the pipeline to be also push-based, there
> are 3 solutions (at least this is what I came up with) :
>
> Buffering
> ---------
> The XMLStreamWriter where the transformer writes to buffers all events
> in a data structure similar to our XMLByteStreamCompiler, that can be
> used as a XMLStreamReader by the next component in the chain. The
> pipeline object then has to call some execute() method on every
> component in the pipeline in sequence, after having provided them with
> the proper buffer-based reader and writer.
>
> Execution is single-threaded, which fits well with all the non
> threadsafe classes and threadlocals we usually have in web applications,
> but requires buffering and thus somehow defeats the purpose of
> stream-based processing and can be simply not possible to process large
> documents.
>
> Note however that because it is single-threaded, we can work with two
> buffers (one for input, one for output) that are reused whatever the
> number of components in the pipeline.
>
> Multithreading
> --------------
> Each component of the pipeline runs in a separate thread, and writes its
> output into an event queue that is consumed asynchronously by the next
> component in the pipeline. The event queue is presented as an
> XMLStreamReader to the next component.
>
> This approach requires very little buffering (and we can even have an
> upper bound on the event queue size). It also uses nicely the parallel
> proccessing capabilities of multi-core CPUs, although in web apps the
> parallelism is also handled by concurrent http requests. This is
> typically the approach that would be used with Erlang or Scala actors.
>
> Multithreading has some issues though, since the servlet API more or
> less implies that a single thread processes the request and we may have
> some concurrency issues. Web app developers also take single threading
> as a basic assumption and use threadlocals here and there.
>
> This approach also prevents the reuse of char[] buffers as is usually
> done by XML parsers since events are processed asychronously. All char[]
> have to be copied, but this is a minor issue.
>
> Continuations
> -------------
> When a transformer sends an event to the next component in the chain,
> its execution is suspended and captured in a continuation. The
> continuation of the next pipeline component is resumed until it has
> consumed the event. We then switch back to the current component until
> it produces an event, etc, etc.
>
> This approach is single-threaded and so avoids the concurrency issues
> mentioned above, and also avoids buffering. But there is certainly a
> high overhead with the large number of continuation capturing/resuming.
> This number can be reduced though is we have some level of buffering to
> allow processing of several events in one capture/resume cycle.
>
> It also requires all the bytecode of transfomers to be instrumented for
> continuations, which in itself adds quite some memory and processing
> overhead. Torsten also posted on this subject quite long ago [1].
>
>
> Conclusion
> ----------
> All things considered, I came to the conclusion that a full Stax
> pipeline either requires buffering to be reliable (but we're no more
> streaming), or requires very careful inspection of all components for
> multi-threading issues.
>
> So in the end, Stax probably has to be considered as a helper _inside_ a
> component to ease processing : buffer all SAX input, then pull the
> received events to avoid complex state automata.
>
> Looks like I'm in a "long mail" period and I hope I haven't lost anybody
> here :-)
>
> So, what do you think?
>
> Sylvain
>
> [1] http://vafer.org/blog/20060807003609

-- 
SCHMUTTERER+PARTNER Information Technology GmbH

Hiessbergergasse 1
A-3002 Purkersdorf

T   +43 (0) 69911127344
F   +43 (2231) 61899-99
mail to: andreas.pieber@schmutterer-partner.at

R: Re: [cocoon3] Stax Pipelines

Posted by Simone Gianni <si...@semeru.it>.
Hi all, 
since Stax is an inversion of the call flow, what we have is an inversion of the advantages and disadvantages we had with SAX. 

I'll try to explain it better. Suppose we have two schemas, one contains "LONG" elements, with lots of children and stuff inside, the other contains "SHORT" elements, with just as attribute "id". Now suppose it is possible to translate from one to the other, for example it could be that LONG stuff is stored on the database, and SHORT is a placeholder pointing to LONGs on the database. 

Now, we want to write two transformers. One is SHORT to LONG, which will perform some selects on the database and expand those SHORT into LONG. The other one stores stuff on the database, and convert LONG to SHORT. 

As we all know (the i18n transformer is a good example), in SAX, transforming from LONG to SHORT is a pain, cause we need to keep the state between multiple calls. In our example, if the LONG to SHORT transformer is a SAX based one, we would need to buffer all the LONG content, then store it on the DB and then emit a single SHORT. That buffering is our state. 

Instead, this kind of transformation is quite easy in a Stax transformer, cause when we encounter a LONG we can just fetch all the data we need, and perform everything we need to do in a single method, without having to preserve the state across different calls. Such a transformer in Stax could be nearly stateless/threadsafe from an XML point of view (the database connection would be state, but that's just for the sake of the example). 

However suppose we are doing the SHORT to LONG translation. In this case, using SAX is by fax simpler than Stax. In fact, when we encounter a SHORT, we can fetch stuff from the DB and start bombing the next handler in the pipeline with elements as soon as they arrive from the DB. Doing it in Stax instead would require us to have a state, cause we would need to buffer data from the DB, and serve that data to the subseguent calls from our Stax consumer until the buffer is empty. Exactly the opposite problem of a SAX pipeline. 

The SAX part of these example is nothing new to Cocoon. We already have an infrastructure for buffering SAX events when we need to in our transformers, in extremis even building a DOM out of it (which we could consider the most versatile and expensive form of buffering). Couldn't we just provide such a buffer for those Stax based transformers when they need it? 

This would be an intermediate solution, cause there would be an easy way to keep the state during Stax calls (as it was for SAX, but the opposite way around), it would still be a pure Stax based pipeline, buffering would be limited to the bare minimum required by the transformer, and could be avoided at all reimplementing the transformer with more complex state logic if needed for performance reasons. 

This is not a solution to the SAX<->Stax cooperation problem, but my two cents on the "Is implementing a Stax based transformer easier or more complicated than a Sax one" discussion :) 

Simone 



----- Messaggio originale ----- 
Da: Sylvain Wallez <sy...@apache.org> 
A: dev@cocoon.apache.org 
Posta Inviata: martedì 2 dicembre 2008 17.16.25 GMT+0100 Europe/Berlin 
Oggetto: Re: [cocoon3] Stax Pipelines 

Reinhard Pötz wrote: 
> I've had Stax pipelines on my radar for a rather long time because I 
> think that Stax can simplify the writing of transformers a lot. 
> I proposed this idea to Alexander Schatten, an assistant professor at 
> the Vienna University of Technology and he then proposed it to his 
> students. 
> 
> A group of four students accepted to work on this as part of their 
> studies. Steven and I are coaching this group from October to January 
> and the goal is to support Stax pipeline components in Cocoon 3. 
> 
> So far the students learned more about Cocoon 3, Sax, Stax and did some 
> performance comparisons. This week we've entered the phase where the 
> students have to work on the actual Stax pipeline implementation. 
> 
> I asked the students to introduce themselves and also to present the 
> current ideas of how to implement Stax pipelines. So Andreas, Killian, 
> Michael and Jakob, the floor is yours! 
> 

I have spent some cycles on this subject and came to the surprising 
conclusion that writing Stax _pipelines_ is actually rather complex. 

A Stax transformer pulls events from the previous component in the 
pipeline, which removes the need for the complex state machinery often 
needed for SAX (push) transformers by transforming it in a simple 
function call stack and local variables. This is the main interest of 
Stax vs SAX. 

But how does a transformer expose its result to the next component in 
the chain so that this next component can also pull events in the Stax 
style? 

When it produces an event, a Stax transformer should put this event 
somewhere so that it can be pulled and processed by the next component. 
But pulling also means the transformer does not suspend its execution 
since it continues pulling events from the previous component. This is 
actually reflected in the Stax API which provides a pull-based 
XMLStreamReader, but only a very SAX-like XMLStreamWriter. 

So a Stax transformer is actually a pull input / push output component. 

To allow the next component in the pipeline to be also push-based, there 
are 3 solutions (at least this is what I came up with) : 

Buffering 
--------- 
The XMLStreamWriter where the transformer writes to buffers all events 
in a data structure similar to our XMLByteStreamCompiler, that can be 
used as a XMLStreamReader by the next component in the chain. The 
pipeline object then has to call some execute() method on every 
component in the pipeline in sequence, after having provided them with 
the proper buffer-based reader and writer. 

Execution is single-threaded, which fits well with all the non 
threadsafe classes and threadlocals we usually have in web applications, 
but requires buffering and thus somehow defeats the purpose of 
stream-based processing and can be simply not possible to process large 
documents. 

Note however that because it is single-threaded, we can work with two 
buffers (one for input, one for output) that are reused whatever the 
number of components in the pipeline. 

Multithreading 
-------------- 
Each component of the pipeline runs in a separate thread, and writes its 
output into an event queue that is consumed asynchronously by the next 
component in the pipeline. The event queue is presented as an 
XMLStreamReader to the next component. 

This approach requires very little buffering (and we can even have an 
upper bound on the event queue size). It also uses nicely the parallel 
proccessing capabilities of multi-core CPUs, although in web apps the 
parallelism is also handled by concurrent http requests. This is 
typically the approach that would be used with Erlang or Scala actors. 

Multithreading has some issues though, since the servlet API more or 
less implies that a single thread processes the request and we may have 
some concurrency issues. Web app developers also take single threading 
as a basic assumption and use threadlocals here and there. 

This approach also prevents the reuse of char[] buffers as is usually 
done by XML parsers since events are processed asychronously. All char[] 
have to be copied, but this is a minor issue. 

Continuations 
------------- 
When a transformer sends an event to the next component in the chain, 
its execution is suspended and captured in a continuation. The 
continuation of the next pipeline component is resumed until it has 
consumed the event. We then switch back to the current component until 
it produces an event, etc, etc. 

This approach is single-threaded and so avoids the concurrency issues 
mentioned above, and also avoids buffering. But there is certainly a 
high overhead with the large number of continuation capturing/resuming. 
This number can be reduced though is we have some level of buffering to 
allow processing of several events in one capture/resume cycle. 

It also requires all the bytecode of transfomers to be instrumented for 
continuations, which in itself adds quite some memory and processing 
overhead. Torsten also posted on this subject quite long ago [1]. 


Conclusion 
---------- 
All things considered, I came to the conclusion that a full Stax 
pipeline either requires buffering to be reliable (but we're no more 
streaming), or requires very careful inspection of all components for 
multi-threading issues. 

So in the end, Stax probably has to be considered as a helper _inside_ a 
component to ease processing : buffer all SAX input, then pull the 
received events to avoid complex state automata. 

Looks like I'm in a "long mail" period and I hope I haven't lost anybody 
here :-) 

So, what do you think? 

Sylvain 

[1] http://vafer.org/blog/20060807003609 

-- 
Sylvain Wallez - http://bluxte.net 


Re: [cocoon3] Stax Pipelines

Posted by Sylvain Wallez <sy...@apache.org>.
Reinhard Pötz wrote:
> I've had Stax pipelines on my radar for a rather long time because I
> think that Stax can simplify the writing of transformers a lot.
> I proposed this idea to Alexander Schatten, an assistant professor at
> the Vienna University of Technology and he then proposed it to his
> students.
>
> A group of four students accepted to work on this as part of their
> studies. Steven and I are coaching this group from October to January
> and the goal is to support Stax pipeline components in Cocoon 3.
>
> So far the students learned more about Cocoon 3, Sax, Stax and did some
> performance comparisons. This week we've entered the phase where the
> students have to work on the actual Stax pipeline implementation.
>
> I asked the students to introduce themselves and also to present the
> current ideas of how to implement Stax pipelines. So Andreas, Killian,
> Michael and Jakob, the floor is yours!
>   

I have spent some cycles on this subject and came to the surprising 
conclusion that writing Stax _pipelines_ is actually rather complex.

A Stax transformer pulls events from the previous component in the 
pipeline, which removes the need for the complex state machinery often 
needed for SAX (push) transformers by transforming it in a simple 
function call stack and local variables. This is the main interest of 
Stax vs SAX.

But how does a transformer expose its result to the next component in 
the chain so that this next component can also pull events in the Stax 
style?

When it produces an event, a Stax transformer should put this event 
somewhere so that it can be pulled and processed by the next component. 
But pulling also means the transformer does not suspend its execution 
since it continues pulling events from the previous component. This is 
actually reflected in the Stax API which provides a pull-based 
XMLStreamReader, but only a very SAX-like XMLStreamWriter.

So a Stax transformer is actually a pull input / push output component.

To allow the next component in the pipeline to be also push-based, there 
are 3 solutions (at least this is what I came up with) :

Buffering
---------
The XMLStreamWriter where the transformer writes to buffers all events 
in a data structure similar to our XMLByteStreamCompiler, that can be 
used as a XMLStreamReader by the next component in the chain. The 
pipeline object then has to call some execute() method on every 
component in the pipeline in sequence, after having provided them with 
the proper buffer-based reader and writer.

Execution is single-threaded, which fits well with all the non 
threadsafe classes and threadlocals we usually have in web applications, 
but requires buffering and thus somehow defeats the purpose of 
stream-based processing and can be simply not possible to process large 
documents.

Note however that because it is single-threaded, we can work with two 
buffers (one for input, one for output) that are reused whatever the 
number of components in the pipeline.

Multithreading
--------------
Each component of the pipeline runs in a separate thread, and writes its 
output into an event queue that is consumed asynchronously by the next 
component in the pipeline. The event queue is presented as an 
XMLStreamReader to the next component.

This approach requires very little buffering (and we can even have an 
upper bound on the event queue size). It also uses nicely the parallel 
proccessing capabilities of multi-core CPUs, although in web apps the 
parallelism is also handled by concurrent http requests. This is 
typically the approach that would be used with Erlang or Scala actors.

Multithreading has some issues though, since the servlet API more or 
less implies that a single thread processes the request and we may have 
some concurrency issues. Web app developers also take single threading 
as a basic assumption and use threadlocals here and there.

This approach also prevents the reuse of char[] buffers as is usually 
done by XML parsers since events are processed asychronously. All char[] 
have to be copied, but this is a minor issue.

Continuations
-------------
When a transformer sends an event to the next component in the chain, 
its execution is suspended and captured in a continuation. The 
continuation of the next pipeline component is resumed until it has 
consumed the event. We then switch back to the current component until 
it produces an event, etc, etc.

This approach is single-threaded and so avoids the concurrency issues 
mentioned above, and also avoids buffering. But there is certainly a 
high overhead with the large number of continuation capturing/resuming. 
This number can be reduced though is we have some level of buffering to 
allow processing of several events in one capture/resume cycle.

It also requires all the bytecode of transfomers to be instrumented for 
continuations, which in itself adds quite some memory and processing 
overhead. Torsten also posted on this subject quite long ago [1].


Conclusion
----------
All things considered, I came to the conclusion that a full Stax 
pipeline either requires buffering to be reliable (but we're no more 
streaming), or requires very careful inspection of all components for 
multi-threading issues.

So in the end, Stax probably has to be considered as a helper _inside_ a 
component to ease processing : buffer all SAX input, then pull the 
received events to avoid complex state automata.

Looks like I'm in a "long mail" period and I hope I haven't lost anybody 
here :-)

So, what do you think?

Sylvain

[1] http://vafer.org/blog/20060807003609

-- 
Sylvain Wallez - http://bluxte.net


Re: [cocoon3] Stax Pipelines

Posted by Steven Dolg <st...@indoqa.com>.
David Crossley schrieb:
> Reinhard P?tz wrote:
>   
>> I've had Stax pipelines on my radar for a rather long time because I
>> think that Stax can simplify the writing of transformers a lot.
>> I proposed this idea to Alexander Schatten, an assistant professor at
>> the Vienna University of Technology and he then proposed it to his
>> students.
>>     
>
> This is a fantastic moment for open source.
> Thanks to all involved. More of this style of
> collaboration please.
>
> -David
>
>   
Thank you, David.

I believe it was Werner Guttmann from Castor who started to do this.
Since he's a good friend and colleague we gained some insights about how 
it worked out and lots of details about the promising results they produced.
He suggested we should do the same with Cocoon.

I'm really glad we took the time to prepare two proposals and I'm pretty 
sure we will continue to do this - provided the feedback is good and 
there are students willing to work with us, of course... ;-)

Steven


Re: [cocoon3] Stax Pipelines

Posted by David Crossley <cr...@apache.org>.
Reinhard P?tz wrote:
> 
> I've had Stax pipelines on my radar for a rather long time because I
> think that Stax can simplify the writing of transformers a lot.
> I proposed this idea to Alexander Schatten, an assistant professor at
> the Vienna University of Technology and he then proposed it to his
> students.

This is a fantastic moment for open source.
Thanks to all involved. More of this style of
collaboration please.

-David

Re: [cocoon3] Stax Pipelines

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.
Reinhard Pötz pisze:
> I've had Stax pipelines on my radar for a rather long time because I
> think that Stax can simplify the writing of transformers a lot.
> I proposed this idea to Alexander Schatten, an assistant professor at
> the Vienna University of Technology and he then proposed it to his
> students.
> 
> A group of four students accepted to work on this as part of their
> studies. Steven and I are coaching this group from October to January
> and the goal is to support Stax pipeline components in Cocoon 3.
> 
> So far the students learned more about Cocoon 3, Sax, Stax and did some
> performance comparisons. This week we've entered the phase where the
> students have to work on the actual Stax pipeline implementation.
> 
> I asked the students to introduce themselves and also to present the
> current ideas of how to implement Stax pipelines. So Andreas, Killian,
> Michael and Jakob, the floor is yours!

Wow, what a surprise, Reinhard!

I won't comment on actual proposal until I hear some details from the research that students have
made but I would like to say "Thank you" to all people involved into this effort.

It's nice to hear that there will be more students involved in Cocoon!

-- 
Best regards,
Grzegorz Kossakowski