You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Michael Seydl <mi...@gmail.com> on 2008/12/24 10:52:04 UTC

[C3] StAX research reveiled!

Hi all!

One more mail for the student group! Behind this lurid topic hides our 
evaluation of the latest XML processing technologies regarding their 
usability in Cocoon3 (especially if there are suited to be used in a 
streaming pipeline).
As it's commonly know we decided to use StAX as our weapon of choice to 
do the XML, but this paper should explain the whys and hows and 
especially the way we took to come to our decision, which resulted in 
using the very same API.
Eleven pages should be a to big read and it contains all necessary links 
to all the APIs we evaluated and also line wise our two cents about the 
API we observed. Concludingly we also tried to show the difference 
between the currently used SAX and the of us proposed StAX API.

I hope this work sheds some light on our decision making and taking and 
that someone dares to read it.

That's from me, I wish you all a pleasant and very merry Christmas!

Regards,
Michael Seydl

Re: [C3] StAX research reveiled!

Posted by Steven Dolg <st...@indoqa.com>.

Andreas Pieber schrieb:
> On Sunday 28 December 2008 08:07:57 Steven Dolg wrote:
>   
>> Sylvain Wallez schrieb:
>>     
>>> Andreas Pieber wrote:
>>>       
>>>> On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote:
>>>>         
>>>>> Michael Seydl wrote:
>>>>>           
>>>>>> Hi all!
>>>>>>
>>>>>> One more mail for the student group! Behind this lurid topic hides our
>>>>>> evaluation of the latest XML processing technologies regarding their
>>>>>> usability in Cocoon3 (especially if there are suited to be used in a
>>>>>> streaming pipeline).
>>>>>> As it's commonly know we decided to use StAX as our weapon of choice
>>>>>> to do the XML, but this paper should explain the whys and hows and
>>>>>> especially the way we took to come to our decision, which resulted in
>>>>>> using the very same API.
>>>>>> Eleven pages should be a to big read and it contains all necessary
>>>>>> links to all the APIs we evaluated and also line wise our two cents
>>>>>> about the API we observed. Concludingly we also tried to show the
>>>>>> difference between the currently used SAX and the of us proposed StAX
>>>>>> API.
>>>>>>
>>>>>> I hope this work sheds some light on our decision making and taking
>>>>>> and that someone dares to read it.
>>>>>>
>>>>>> That's from me, I wish you all a pleasant and very merry Christmas!
>>>>>>
>>>>>> Regards,
>>>>>> Michael Seydl
>>>>>>             
>>>>> Good work and interesting read, but don't agree with some of its
>>>>> statements!
>>>>>
>>>>> The big if/else or switch statements mentioned as a drawback of the
>>>>> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API,
>>>>> since
>>>>> it provides abstract events whose type needs also to be inspected to
>>>>> decide what to do.
>>>>>           
>>>> Of course, you're right!
>>>>
>>>>         
>>>>> The drawbacks of the stream API compared to the event API are, as you
>>>>> mention, that some methods of XMLStreamReader will throw an exception
>>>>> depending on the current event's type and that the event is not
>>>>> represented as a data structure that can be passed directly to the next
>>>>> element in the pipeline or stored in an event buffer.
>>>>>
>>>>> The first point (exceptions) should not happen, unless the code is
>>>>> buggy
>>>>> and tries to get information that doesn't belong to the context. I have
>>>>> used many times the cursor API and haven't found any usability problems
>>>>> with it.
>>>>>           
>>>> Also here you're right, but IMHO it is not necessary to add another
>>>> source for bugs if not required...
>>>>         
>>> Well, there are so many other sources of bugs... I wouldn't sacrifice
>>> efficiency for bad usage of an API. And when dealing with XML, people
>>> should know that e.g. calling getAttribute() for a text event is
>>> meaningless.
>>>
>>>       
>>>>> The second point (lack of data structure) can be easily solved by using
>>>>> an XMLEventAllocator [1] that creates an XMLEvent from the current
>>>>> state
>>>>> of an XMLStreamReader.
>>>>>           
>>>> Mhm but if we use an XMLEventAllocator, y not directly use the
>>>> StAXEvent api?
>>>>         
>>> Sorry, I wasn't clear: *if* and XMLEvent is needed, then it's easy to
>>> get it from a stream.
>>>
>>>       
>>>>> The event API has the major drawback of always creating a new object
>>>>> for
>>>>> every event (since as the javadoc says "events may be cached and
>>>>> referenced after the parse has completed"). This can lead to a big
>>>>> strain on the memory system and garbage collection on a busy
>>>>> application.
>>>>>           
>>>> Thats right, but having in mind to create a pull pipe, where the
>>>> serializer pulls each event from the producer through each
>>>> transformer and writing it to an output stream we don't have any
>>>> other possibility than creating an object for each event.
>>>>
>>>> Think about it a little more in detail. To be able to pull each event
>>>> you have to have the possibility to call a method looking like:
>>>>
>>>> Object next();
>>>>
>>>> on the parent of the pipelineComponent. Doing it in a StAX cursor way
>>>> means to increase the complexity from one method to 10 or more which
>>>> have to be available through the parent...
>>>>         
>>> Not necessarily, depending on how the API is designed. Let's give it a
>>> try:
>>>
>>> /** A generator can pull events from somewhere and writes them to an
>>> output */
>>> interface Generator {
>>>    /** Do we still have something to produce? */
>>>    boolean hasNext();
>>>
>>>    /** Do some processing and produce some output */
>>>    void pull(XMLStreamWriter output);
>>> }
>>>
>>> /** A transformer is a generator that has an XML input */
>>> interface Transformer extends Generator {
>>>    void setInput(XMLStreamReader input);
>>> }
>>>
>>> class StaxFIFO implements XMLStreamReader, XMLStreamWriter {
>>>    Generator generator;
>>>    StaxFIFO(Generator generator) {
>>>        this.generator = generator;
>>>    }
>>>
>>>    // Implement all XMLStreamWriter methods as writing to an
>>>    // internal stream FIFO buffer
>>>
>>>    // Implement all XMLStreamReader methods as reading from an
>>>    // internal stream FIFO buffer, except hasNext() below:
>>>
>>>    boolean hasNext() {
>>>        while (eventBufferIsEmpty() && generator.hasNext()) {
>>>            // Ask the generator to produce some events
>>>            generator.pull(this);
>>>        }
>>>        return !eventBufferIsEmpty();
>>>    }
>>> }
>>>
>>> Building and executing a pipeline is then rather simple :
>>>
>>> class Pipeline {
>>>    Generator generator;
>>>    Transformer transformers[];
>>>    XMLStreamWriter serializer;
>>>
>>>    void execute() {
>>>        Generator last = generator;
>>>        for (Transformer tr : transformers) {
>>>            tr.setInput(new StaxFIFO(previous);
>>>            last = tr;
>>>        }
>>>
>>>        // Pull from the whole chain to the serializer
>>>        while(last.hasNext()) {
>>>            last.pull(serializer);
>>>        }
>>>    }
>>> }
>>>
>>> Every component gets an XMLStreamWriter where to write their output
>>> (in a style equivalent to SAX), and transformers get an
>>> XMLStreamReader where to get their input from.
>>>
>>> The programming model is then very simple: for every call to pull(),
>>> read something in, process it and produce the corresponding output
>>> (optional, since end of processing is defined by hasNext()). The
>>> buffers used to connect the various components allow pull() to read
>>> and process a set of related events, resulting in any number of events
>>> being written to the buffer.
>>>
>>>       
>>>>> So the cursor API is the most efficient IMO when it comes to consuming
>>>>> data, since it doesn't require creating useless event objects.
>>>>>
>>>>> Now in a pipeline context, we will want to transmit events untouched
>>>>> from one component to the next one, using some partial buffering as
>>>>> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be
>>>>> the natural solution for this, but would require the use of events at
>>>>> the pipeline API level, with their associated costs mentioned above.
>>>>>           
>>>> I'm not sure if I get the point here, but we do not like to
>>>> "transmit" events. They are pulled. Therefore in most cases we simply
>>>> do not need a buffer, since events could be directly returned.
>>>>         
>>> In the previous discussions, it was considered that pulling from the
>>> previous component would lead that component to process in one pull
>>> call a number or related events, to avoid state handling that would
>>> make it even more complex than SAX. And processing these related
>>> events will certainly mean returning ("transmitting" as I said)
>>> several events. So buffering *is* needed in most non-trivial cases.
>>>
>>>       
>>>>> So what should be used for pipelines ? My impression is that we should
>>>>> stick to the most efficient API and build the simple tools needed to
>>>>> buffer events from a StreamReader, taking inspiration from the
>>>>> XMLBytestreamCompiler we already have.
>>>>>           
>>>> Maybe some events could be avoided using the cursor API, but IMO the
>>>> performance we could get is not worth the simplicity we sacrifice...
>>>>         
>>> I don't agree that we sacrifice simplicity. With the above, the
>>> developer only deals with XMLStreamWriter and XMLStreamReader objects,
>>> and never has to implement them.
>>>
>>> We just need an efficient StaxFIFO class, which people shouldn't care
>>> about since it is completely hidden in the Pipeline object.
>>>
>>> Thoughts?
>>>       
>> Well the approach outlined about will certainly work.
>>
>> Basically you're providing a buffer between every pair of components and
>> fill it as needed.
>> But you need to implement both XMLStreamWriter and XMLStreamReader and
>> optimize that for any possible thing a transformer might do.
>> In order to buffer all the data from the components you will have to
>> create some objects as well - I guess you will end up with something
>> like the XMLEvent and maintaining a list of them in the StaxFIFO.
>> That's why I think an efficient (as in faster than the Event API)
>> implementation of the StaxFIFO is difficult to make.
>>
>> On the other hand I do think that the cursor API is quite a bit harder
>> to use.
>> As stated in the Javadoc of XMLStreamReader it is the lowest level for
>> reading XML data - which usually means more logic in the code using the
>> API and more knowledge in the head of the developer reading/writing the
>> code is required.
>> So I second Andreas' statement that we will sacrifice simplicity for (a
>> small amount of ?) performance.
>>
>>
>> The other thing is that - at least the way you suggested - we would need
>> a special implementation of the Pipeline interface.
>> That is something that compromises the intention behind having a
>> Pipeline API.
>> Right now we can use the new StAX components and simply put them into
>> any of the Pipeline implementations we already have.
>> Sacrificing this is completely out of the question IMO.
>>
>>     
>
> I did a little (and quite dirty) implementation of Sylvains ideas to be able to 
> see all the advantages and drawbacks of such an approach. I came to the 
> following conclusions:
>
> First of all we do not need to change the interfaces of the pipeline api. It is 
> possible to do it in a quite similar way as we did it for the stax event 
> iteration api.
>
> Writing code with the streaming api making things a little bit more complicated 
> than with the XML-Event object. Instead of working with three methods and an 
> object you have to handle x methods (but this is my personal opinion).
>
> I'm with Steven that you'll end with a list of XMLEvent like objects in the 
> StAXFIFO buffer (my prototype does :) ). And this makes things worse... The 
> tasks which could/would be handled by such an api could be reduced to the 
> following cases: Removing XMLEvents (nodes, how ever...), adding XMLEvents, 
> changing XMLEvents and simply letting them through. With the approach we are 
> using at the moment you're required to buffer events in the case of adding them. 
> In all other cases a buffer is NOT required (thanks to the navigator idea :) ). 
> And thats the reason making a StreamReader-StreamWriter StAXFifo-Buffering 
> approach worse than the XMLEvent approach. Think about a situation with x 
> transformers and an XML where 90% simply dont have to be handled. In this case 
> for each transformer, for each node not touched, an object has to be created in 
> the buffer whereas one object is enough in the XMLEvent object approach.
>
> To sum it up: We add (ok, not too much, but we do) another layer of complexity 
> without increasing the performance (as it may look like at a first glance). IMHO 
> we'll end up with more created objects than in an Event-Iterator approach.
>   
I am still convinced the iterator API is the way to go.
The recommendations from Sun (as cited in the PDF attached to the inital 
mail) IMO clearly point in that direction:
    * If you are programming for a particularly memory-constrained 
environment, like J2ME, you can make smaller, more efficient code with 
the cursor API.
    * If performance is your highest priority--for example, when 
creating low-level libraries or infrastructure--the cursor API is more 
efficient.
    * If you want to create XML processing pipelines, use the iterator API.
    * If you want to modify the event stream, use the iterator API.
    * If you want to your application to be able to handle pluggable 
processing of the event stream, use the iterator API.
    * In general, if you do not have a strong preference one way or the 
other, using the iterator API is recommended because it is more flexible 
and extensible, thereby "future-proofing" your applications.

We're not operating in a particularly memory-constrained environment.
IMO performance is not the *highest* priority (it is important of course).

But we want to make processing pipelines, containing pluggable 
components in order to modify the event stream and use StAX as a more 
flexible and intuitive *alternative* to SAX.
If you're really all about performance you should probably stick with 
SAX anyway.


Well we can still make a shootout between the cursor and the iterator 
API at a later time.
Actually that would be really interesting and demonstrate one of the 
most important aspects (IMO) about Cocoon 3:
Fill your pipelines with whatever you want - and if the components are 
nice and sweet others might like them as well.

> Andreas
>
>   
>> Steven
>>
>>     
>>> Sylvain
>>>

Re: [C3] StAX research reveiled!

Posted by Andreas Pieber <an...@schmutterer-partner.at>.

On Sunday 28 December 2008 08:07:57 Steven Dolg wrote:
> Sylvain Wallez schrieb:
> > Andreas Pieber wrote:
> >> On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote:
> >>> Michael Seydl wrote:
> >>>> Hi all!
> >>>>
> >>>> One more mail for the student group! Behind this lurid topic hides our
> >>>> evaluation of the latest XML processing technologies regarding their
> >>>> usability in Cocoon3 (especially if there are suited to be used in a
> >>>> streaming pipeline).
> >>>> As it's commonly know we decided to use StAX as our weapon of choice
> >>>> to do the XML, but this paper should explain the whys and hows and
> >>>> especially the way we took to come to our decision, which resulted in
> >>>> using the very same API.
> >>>> Eleven pages should be a to big read and it contains all necessary
> >>>> links to all the APIs we evaluated and also line wise our two cents
> >>>> about the API we observed. Concludingly we also tried to show the
> >>>> difference between the currently used SAX and the of us proposed StAX
> >>>> API.
> >>>>
> >>>> I hope this work sheds some light on our decision making and taking
> >>>> and that someone dares to read it.
> >>>>
> >>>> That's from me, I wish you all a pleasant and very merry Christmas!
> >>>>
> >>>> Regards,
> >>>> Michael Seydl
> >>>
> >>> Good work and interesting read, but don't agree with some of its
> >>> statements!
> >>>
> >>> The big if/else or switch statements mentioned as a drawback of the
> >>> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API,
> >>> since
> >>> it provides abstract events whose type needs also to be inspected to
> >>> decide what to do.
> >>
> >> Of course, you're right!
> >>
> >>> The drawbacks of the stream API compared to the event API are, as you
> >>> mention, that some methods of XMLStreamReader will throw an exception
> >>> depending on the current event's type and that the event is not
> >>> represented as a data structure that can be passed directly to the next
> >>> element in the pipeline or stored in an event buffer.
> >>>
> >>> The first point (exceptions) should not happen, unless the code is
> >>> buggy
> >>> and tries to get information that doesn't belong to the context. I have
> >>> used many times the cursor API and haven't found any usability problems
> >>> with it.
> >>
> >> Also here you're right, but IMHO it is not necessary to add another
> >> source for bugs if not required...
> >
> > Well, there are so many other sources of bugs... I wouldn't sacrifice
> > efficiency for bad usage of an API. And when dealing with XML, people
> > should know that e.g. calling getAttribute() for a text event is
> > meaningless.
> >
> >>> The second point (lack of data structure) can be easily solved by using
> >>> an XMLEventAllocator [1] that creates an XMLEvent from the current
> >>> state
> >>> of an XMLStreamReader.
> >>
> >> Mhm but if we use an XMLEventAllocator, y not directly use the
> >> StAXEvent api?
> >
> > Sorry, I wasn't clear: *if* and XMLEvent is needed, then it's easy to
> > get it from a stream.
> >
> >>> The event API has the major drawback of always creating a new object
> >>> for
> >>> every event (since as the javadoc says "events may be cached and
> >>> referenced after the parse has completed"). This can lead to a big
> >>> strain on the memory system and garbage collection on a busy
> >>> application.
> >>
> >> Thats right, but having in mind to create a pull pipe, where the
> >> serializer pulls each event from the producer through each
> >> transformer and writing it to an output stream we don't have any
> >> other possibility than creating an object for each event.
> >>
> >> Think about it a little more in detail. To be able to pull each event
> >> you have to have the possibility to call a method looking like:
> >>
> >> Object next();
> >>
> >> on the parent of the pipelineComponent. Doing it in a StAX cursor way
> >> means to increase the complexity from one method to 10 or more which
> >> have to be available through the parent...
> >
> > Not necessarily, depending on how the API is designed. Let's give it a
> > try:
> >
> > /** A generator can pull events from somewhere and writes them to an
> > output */
> > interface Generator {
> >    /** Do we still have something to produce? */
> >    boolean hasNext();
> >
> >    /** Do some processing and produce some output */
> >    void pull(XMLStreamWriter output);
> > }
> >
> > /** A transformer is a generator that has an XML input */
> > interface Transformer extends Generator {
> >    void setInput(XMLStreamReader input);
> > }
> >
> > class StaxFIFO implements XMLStreamReader, XMLStreamWriter {
> >    Generator generator;
> >    StaxFIFO(Generator generator) {
> >        this.generator = generator;
> >    }
> >
> >    // Implement all XMLStreamWriter methods as writing to an
> >    // internal stream FIFO buffer
> >
> >    // Implement all XMLStreamReader methods as reading from an
> >    // internal stream FIFO buffer, except hasNext() below:
> >
> >    boolean hasNext() {
> >        while (eventBufferIsEmpty() && generator.hasNext()) {
> >            // Ask the generator to produce some events
> >            generator.pull(this);
> >        }
> >        return !eventBufferIsEmpty();
> >    }
> > }
> >
> > Building and executing a pipeline is then rather simple :
> >
> > class Pipeline {
> >    Generator generator;
> >    Transformer transformers[];
> >    XMLStreamWriter serializer;
> >
> >    void execute() {
> >        Generator last = generator;
> >        for (Transformer tr : transformers) {
> >            tr.setInput(new StaxFIFO(previous);
> >            last = tr;
> >        }
> >
> >        // Pull from the whole chain to the serializer
> >        while(last.hasNext()) {
> >            last.pull(serializer);
> >        }
> >    }
> > }
> >
> > Every component gets an XMLStreamWriter where to write their output
> > (in a style equivalent to SAX), and transformers get an
> > XMLStreamReader where to get their input from.
> >
> > The programming model is then very simple: for every call to pull(),
> > read something in, process it and produce the corresponding output
> > (optional, since end of processing is defined by hasNext()). The
> > buffers used to connect the various components allow pull() to read
> > and process a set of related events, resulting in any number of events
> > being written to the buffer.
> >
> >>> So the cursor API is the most efficient IMO when it comes to consuming
> >>> data, since it doesn't require creating useless event objects.
> >>>
> >>> Now in a pipeline context, we will want to transmit events untouched
> >>> from one component to the next one, using some partial buffering as
> >>> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be
> >>> the natural solution for this, but would require the use of events at
> >>> the pipeline API level, with their associated costs mentioned above.
> >>
> >> I'm not sure if I get the point here, but we do not like to
> >> "transmit" events. They are pulled. Therefore in most cases we simply
> >> do not need a buffer, since events could be directly returned.
> >
> > In the previous discussions, it was considered that pulling from the
> > previous component would lead that component to process in one pull
> > call a number or related events, to avoid state handling that would
> > make it even more complex than SAX. And processing these related
> > events will certainly mean returning ("transmitting" as I said)
> > several events. So buffering *is* needed in most non-trivial cases.
> >
> >>> So what should be used for pipelines ? My impression is that we should
> >>> stick to the most efficient API and build the simple tools needed to
> >>> buffer events from a StreamReader, taking inspiration from the
> >>> XMLBytestreamCompiler we already have.
> >>
> >> Maybe some events could be avoided using the cursor API, but IMO the
> >> performance we could get is not worth the simplicity we sacrifice...
> >
> > I don't agree that we sacrifice simplicity. With the above, the
> > developer only deals with XMLStreamWriter and XMLStreamReader objects,
> > and never has to implement them.
> >
> > We just need an efficient StaxFIFO class, which people shouldn't care
> > about since it is completely hidden in the Pipeline object.
> >
> > Thoughts?
>
> Well the approach outlined about will certainly work.
>
> Basically you're providing a buffer between every pair of components and
> fill it as needed.
> But you need to implement both XMLStreamWriter and XMLStreamReader and
> optimize that for any possible thing a transformer might do.
> In order to buffer all the data from the components you will have to
> create some objects as well - I guess you will end up with something
> like the XMLEvent and maintaining a list of them in the StaxFIFO.
> That's why I think an efficient (as in faster than the Event API)
> implementation of the StaxFIFO is difficult to make.
>
> On the other hand I do think that the cursor API is quite a bit harder
> to use.
> As stated in the Javadoc of XMLStreamReader it is the lowest level for
> reading XML data - which usually means more logic in the code using the
> API and more knowledge in the head of the developer reading/writing the
> code is required.
> So I second Andreas' statement that we will sacrifice simplicity for (a
> small amount of ?) performance.
>
>
> The other thing is that - at least the way you suggested - we would need
> a special implementation of the Pipeline interface.
> That is something that compromises the intention behind having a
> Pipeline API.
> Right now we can use the new StAX components and simply put them into
> any of the Pipeline implementations we already have.
> Sacrificing this is completely out of the question IMO.
>

I did a little (and quite dirty) implementation of Sylvains ideas to be able to 
see all the advantages and drawbacks of such an approach. I came to the 
following conclusions:

First of all we do not need to change the interfaces of the pipeline api. It is 
possible to do it in a quite similar way as we did it for the stax event 
iteration api.

Writing code with the streaming api making things a little bit more complicated 
than with the XML-Event object. Instead of working with three methods and an 
object you have to handle x methods (but this is my personal opinion).

I'm with Steven that you'll end with a list of XMLEvent like objects in the 
StAXFIFO buffer (my prototype does :) ). And this makes things worse... The 
tasks which could/would be handled by such an api could be reduced to the 
following cases: Removing XMLEvents (nodes, how ever...), adding XMLEvents, 
changing XMLEvents and simply letting them through. With the approach we are 
using at the moment you're required to buffer events in the case of adding them. 
In all other cases a buffer is NOT required (thanks to the navigator idea :) ). 
And thats the reason making a StreamReader-StreamWriter StAXFifo-Buffering 
approach worse than the XMLEvent approach. Think about a situation with x 
transformers and an XML where 90% simply dont have to be handled. In this case 
for each transformer, for each node not touched, an object has to be created in 
the buffer whereas one object is enough in the XMLEvent object approach.

To sum it up: We add (ok, not too much, but we do) another layer of complexity 
without increasing the performance (as it may look like at a first glance). IMHO 
we'll end up with more created objects than in an Event-Iterator approach.

Andreas

>
> Steven
>
> > Sylvain
-- 
SCHMUTTERER+PARTNER Information Technology GmbH

Hiessbergergasse 1
A-3002 Purkersdorf

T   +43 (0) 69911127344
F   +43 (2231) 61899-99
mail to: andreas.pieber@schmutterer-partner.at

Re: [C3] StAX research reveiled!

Posted by Reinhard Pötz <re...@apache.org>.

Steven Dolg wrote:
> Sylvain Wallez schrieb:
>> <snip/>
>>
>> Steven Dolg wrote:
>>> Basically you're providing a buffer between every pair of components
>>> and fill it as needed.
>>
>> Yes. Now this buffer will always contain a very limited number of
>> events, corresponding to the result of processing an amount of input
>> data that is convenient to process at once to avoid complex state
>> management (e.g. an <i18:text> tag with all its children). And so most
>> often, this buffer will contain just one event.
>>
>> Think of it as being just a bridge between the writer view used by a
>> producer and the reader view used by its consumer. These are in my
>> opinion the most convenient views to write StAX components.
>>
>>> But you need to implement both XMLStreamWriter and XMLStreamReader
>>> and optimize that for any possible thing a transformer might do.
>>> In order to buffer all the data from the components you will have to
>>> create some objects as well - I guess you will end up with something
>>> like the XMLEvent and maintaining a list of them in the StaxFIFO.
>>> That's why I think an efficient (as in faster than the Event API) 
>>> implementation of the StaxFIFO is difficult to make.
>>
>> It's certainly less trivial than maitaining a list of events, but
>> should be doable quite efficiently by using an int FIFO (to store
>> event types and attribute counts) and a String FIFO (for everything
>> else). I'll try find a couple of hours to prototype this.
>>
>>> On the other hand I do think that the cursor API is quite a bit
>>> harder to use.
>>> As stated in the Javadoc of XMLStreamReader it is the lowest level
>>> for reading XML data - which usually means more logic in the code
>>> using the API and more knowledge in the head of the developer
>>> reading/writing the code is required.
>>> So I second Andreas' statement that we will sacrifice simplicity for
>>> (a small amount of ?) performance.
>>
>> I understand your point, even if I don't totally agree :-) Now it
>> should be mentioned that if even with events, my proposal still
>> stands: just replace XMLStream{Reader|Writer} with
>> XMLEvent{Reader|Writer}.
>>
>>> The other thing is that - at least the way you suggested - we would
>>> need a special implementation of the Pipeline interface.
>>> That is something that compromises the intention behind having a
>>> Pipeline API.
>>> Right now we can use the new StAX components and simply put them into
>>> any of the Pipeline implementations we already have.
>>> Sacrificing this is completely out of the question IMO.
>>
>> Actually, I'm wondering if wanting a single API is not wishful
>> thinking and will in the end lead to something that is overly abstract
>> and hence difficult to understand and use, or where underlying
>> implementations will leak in the high-level abstraction.
>>
>> There is already some impedence mismatch appearing between pull and
>> push in the code:
>> - a StAXGenerator has to call initiatePullProcessing() on its
>> consumer, which in turn will have to call it on it's own consumer, etc
>> until we reach the Finisher that will finally start pulling events.
>> This moves a responsibility that belongs to the pipeline down to its
>> components.
> Well I don't see the problem with that.
> From the pipeline's point of view those are normal components just like
> all the other.
> The pipeline was never intended to "care" about the internals of the
> components - so why bothering that the StAXGenerator calls
> "initiatePullProcessing" on its consumer instead of calling some other
> method like e.g. "startDocument".
> 
>> - an AbstractStAXProducer only accepts a StAXConsumer, defeating the
>> idea of a unified pipeline implementation that will accept everything.
> The idea was to have pipelines being capable of processing virtually any
> data.
> But that is not the same as combining components in an arbitrary way,
> e.g. there is no sense in linking a FileGenerator with an (not yet
> existing) ImageTransformer based on Java's Imaging API.
> 
> The components must be "compatible" - that is they must understand the
> data they exchange with each other.
> We may however provide some adapters/converters to make certain "types"
> of components compatible, e.g. SAX <--> StAX.
> 
>>
>> So we should either have several APIs specifically tailored to the
>> underlying push or pull model, or make sure the unified API and its
>> implementations accept any kind of component and set the appropriate
>> conversion bridges between them.
> As I tried to state above: that will not be possible for every
> conceivable combination of components.
> At least not when thinking beyond XML - which I do.

Steven was faster than me but his comments are the same that I wanted to
provide.

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [C3] StAX research reveiled!

Posted by Reinhard Pötz <re...@apache.org>.

Grzegorz Kossakowski wrote:
> I just wanted to say that I share many points with Sylvain even if
> it's still hard for me to say which option I prefer.
> 
> This will require some more thinking...

As I said in http://cocoon.markmail.org/message/kkblwes6xcj3wog7 - now
you have the chance to comment on concrete code by looking at the
cocoon-stax module. More samples will follow next week.

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [C3] StAX research reveiled!

Posted by Sylvain Wallez <sy...@apache.org>.

Jakob Spörk wrote:
> Hello,
>
> I just want to give my thoughts to unified pipeline and data conversion
> topic. In my opinion, the pipeline can't do the data conversion, because it
> has no information about how to do this. Let's take a simple example: We
> have a pipeline processing XML documents that describe images. The first
> components process this xml data while the rest of the components do
> operations on the actual image. Now is the question, who will transform the
> xml data to image data in the middle of the pipeline? 
>
> I believe the pipeline cannot do this, because it simply do not know how to
> transform, because that’s a custom operation. You would need a component
> that is on the one hand a XML consumer and on the other hand an image
> producer. Providing some automatic data conversions directly in the pipeline
> may help developers that need exactly these default cases but I believe it
> would be harder for people requiring custom data conversions (and that are
> most of the cases).
>   

Absolutely. The discussion was about having the pipeline automate the 
connection of components that deal with the same data, but with 
different representations of it. Think XML data represented as SAX, 
StAX, DOM or even text, and binary data represented as byte[], 
InputStream, OutputStream or NIO buffers.

Let's consider your example. We can have:
- an XML producer that outputs SAX events
- an XML tranformer that pulls StAX events a writes SVG as StAX events 
in an XMLStreamWriter
- an SVG serializer that takes a DOM and renders it as a JPEG image on 
an output stream
- and finally an image transformer that adds a watermark to the image, 
reading an input stream and writing on an output stream.

The pipeline must not have the reponsibility of transforming data from 
one paradigm to another (i.e. an XML document to a jpeg image) because 
the way to do that highly depends on the application, But the pipeline 
should allow the component developers to use whatever representation of 
that data best fits their needs, and allow the user of not caring about 
the actual data representation as long as the components that are added 
to the pipeline are "compatible" (i.e. StAX, SAX and DOM are 
compatible). This can be achieved by adding the necessary transcoding 
bridges between components. And if such a bridge does not exist, then we 
can throw an exception because the pipeline is obviously incorrect.

Note that XML is a quite unique area where components can allow data to 
flow in one single direction through them (i.e. a SAX consumer producing 
SAX events). Most components that deal with binary data pull their input 
and push their output, which is actually exactly what Unix pipes do 
(read from stdin, write to stderr). So wanting a universal pipeline API 
that also works with binary data requires to address the push/pull 
conversion problem.

Sylvain

-- 
Sylvain Wallez - http://bluxte.net

Re: [C3] StAX research reveiled!

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Grzegorz Kossakowski pisze:
<snip/>

> I've been thinking about generic but at the same time type-safe pipelines for some time. I've designed them on paper and
> everything looked quite promising. Then moved to implementation of my ideas and got rather disappointing result which
> can be seen here:
> http://github.com/gkossakowski/cocoonpipelines/tree/master
> 
> The most interesting files are:
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/Pipeline.java (generic and
> type-safe pipeline interface)
> 
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/PipelineComponent.java
> (generic and type-safe component def.)
> 
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/demo/RunPipeline.java
> (shows how to use that thing)

Argh, GitHub seems to have some problems with displaying uploaded repositories.
Here are new links:
http://github.com/gkossakowski/cocoonpipelines2/tree/master/src/org/apache/cocoon/pipeline/Pipeline.java

http://github.com/gkossakowski/cocoonpipelines2/tree/master/src/org/apache/cocoon/pipeline/PipelineComponent.java

http://github.com/gkossakowski/cocoonpipelines2/tree/master/src/org/apache/cocoon/pipeline/demo/RunPipeline.java


I hope that this time they will work for a little bit longer.

-- 
Best regards,
Grzegorz Kossakowski

Re: [C3] StAX research reveiled!

Posted by Steven Dolg <st...@indoqa.com>.

Grzegorz Kossakowski schrieb:
> Jakob Spörk pisze:
>   
>> Hello,
>>     
>
> Hello Jakob,
>
>   
>> I just want to give my thoughts to unified pipeline and data conversion
>> topic. In my opinion, the pipeline can't do the data conversion, because it
>> has no information about how to do this. Let's take a simple example: We
>> have a pipeline processing XML documents that describe images. The first
>> components process this xml data while the rest of the components do
>> operations on the actual image. Now is the question, who will transform the
>> xml data to image data in the middle of the pipeline? 
>>     
>
> I agree with you that pipeline implementation should not handle data conversion because there is no generic way to
> handle it.
>
> Now I would like to answer your question: it should be another /pipeline component/ that handles data conversion.
>
>   
>> I believe the pipeline cannot do this, because it simply do not know how to
>> transform, because that’s a custom operation. You would need a component
>> that is on the one hand a XML consumer and on the other hand an image
>> producer. Providing some automatic data conversions directly in the pipeline
>> may help developers that need exactly these default cases but I believe it
>> would be harder for people requiring custom data conversions (and that are
>> most of the cases).
>>
>> The actual architecture allows to fit any components into the pipeline, and
>> only the components itself have to know if they can work with their
>> predecessor or the component following them. That allow most flexibility
>> when thinking about any possible conversions. If a pipeline should do this,
>> you would need "plug-ins" for the pipeline that are registered and allow the
>> pipeline to do the conversions. But then, it is the responsibility of the
>> developer to register the right conversion plug-ins and you would have get
>> new problems if a pipeline requires two different converters from the same
>> to the same data type because the pipeline cannot have automatically the
>> information which converter to use in which situation.
>>     
>
> I believe that these problems could be addressed by... compiler. In my opinion, pipelines should be type-safe which
> basically means that for a given pipeline fragment you know what it expects on the input and what kind of output it
> gives to you. The same goes for components. This eliminates "flexibility" of having a component that accepts more than
> one kind of input or more than one kind of output. I believe that having more than one output or one input only adds to
> complexity and does not solve any problem.
>
> If component was going to accept more than one kind of input how a user could know the list of accepted inputs? I guess
> the only way to find out would be checking source and looking for all "instanceof" statements in its code.
>   
The same way as in Cocoon 2.2, I guess.
Users have to know that a FileReader must not be followed by any 
component, that the Serializer must be the last component of the 
pipeline and the Generator the first component.
Currently users don't need to actually read the source code to find that 
out and I don't see why this would need to change.


Of course the user of a pipeline needs to know which components he uses 
and he needs to know which combinations of components actually make sense.
But I also do expect him to know what the components he selected do and 
whether they are compatible or not.
It's not like we're building SAX components that cannot be combined with 
each other or that some StAX components won't work with some other StAX 
component.

That image data represented as a bunch of bytes cannot be passed to a 
SAX transformer is something I expect from someone using Cocoon.
Just as I expect as certain knowledge of relation databases from someone 
using an O/R mapper.
> I would prefer situation when components have well-defined type of input and output and if you one to combine components
> for which input-output pairs do not match you should add converters as intermediate components.
>
> I've been thinking about generic but at the same time type-safe pipelines for some time. I've designed them on paper and
> everything looked quite promising. Then moved to implementation of my ideas and got rather disappointing result which
> can be seen here:
> http://github.com/gkossakowski/cocoonpipelines/tree/master
>
> The most interesting files are:
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/Pipeline.java (generic and
> type-safe pipeline interface)
>
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/PipelineComponent.java
> (generic and type-safe component def.)
>
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/demo/RunPipeline.java
> (shows how to use that thing)
>   
The URLs above only return "Nothing to see here yet. Move along."...
Am I doing something wrong?
>   
>> The only thing cocoon can help here with is to provide as much "standard"
>> converters for use as possible, but it is still the responsibility of the
>> developer to use the right ones.
>>     
>
> I think Cocoon could define much better, type-safe Pipeline API but we are in unfortunate situation that we are using
> language that makes it extremely hard to express this kind of generic solutions.
>
> Of course, I would like to be proven that I'm wrong and Java is powerful enough to let us express our ideas and solve
> our problems. 
Actually I'm not sure which problems that are - as I'm sure we all have 
slightly different views on all this.
Some of the suggestions are actually hard for me to comprehend since I 
do not know which problem(s) they are trying to address.

I agree that we should try to avoid sources for mistakes as much as we can.
But trying to build a fail-proof API usually causes more harm than good IMO.

> Actually, the whole idea of pipeline is not a rocket science as it's, in essence, just ordinary function
> composition. The only unique property of pipelines I can see is that we want to access to _partial_ results of pipeline
> execution so we can make it streamable.
>   
What "_partial_ results" would you like to get from the pipeline?
And what for?
> This become more a brain-dump than a real answer to your e-mail Jakob, but I hope you (and others) have got my point.
>
>

Re: [C3] Pipeline component event types

Posted by Carsten Ziegeler <cz...@apache.org>.

Reinhard Pötz wrote:
> 
> The situation is similar to Spring which allows the wiring of components
> that don't fit together. As long as I get proper error messages I don't
> really have a problem with it.
Now, I think Spring config can be a nightmare, but that's a different
problem :)
Yes, proper error messages are a must.

> Currently we have 3 different pipeline implementations (noncaching,
> caching, async-caching). Your approach would multiply these three
> implementations with the content specific implementations that we would
> have to maintain separately. This doesn't sound like a promising approach.
Now, the impl should be shareable, I totally agree that having to maintain
9 (or more) implementations that vary only a little bit is a nightmare.

Carsten
-- 
Carsten Ziegeler
cziegeler@apache.org

Re: [C3] Pipeline component event types

Posted by Reinhard Pötz <re...@apache.org>.

Carsten Ziegeler wrote:
> Reinhard Pötz write:
>>>> Agreed. How do you know what kind of wrapper do you need if you don't
>>>> know what kind of events components consume and produce?
>> My assumption is that the developer that uses the pipeline knows what he
>> does.
> :) While this assumption *should* be true, we all know that in most
> cases it is not. So I fear many people will stumble across this problem.

The situation is similar to Spring which allows the wiring of components
that don't fit together. As long as I get proper error messages I don't
really have a problem with it.

It might help to introduce a PipelineComponent type:

public interface PipelineComponent<T extends PipelineContentType> {
   ...
}

but also adds to the verbosity of the code. Hmmm.

> But I have one question: if we don't allow to mix different event types
> in a single pipeline (and I guess by event types we mean sax, dom, stax)
> why do we have a generic pipeline interface?
> 
> Wouldn't it be better/easier to have a sax pipeline, a dom pipeline, a
> stax pipeline, perhaps sharing a common interface?

Currently we have 3 different pipeline implementations (noncaching,
caching, async-caching). Your approach would multiply these three
implementations with the content specific implementations that we would
have to maintain separately. This doesn't sound like a promising approach.

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [C3] Pipeline component event types

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Carsten Ziegeler pisze:
>>
> wouldn't work - and maybe this wouldn't even end in an exception. The
> pipeline has no knowledge of the possible event types and which event
> types are compatible. So how can it check the validity of this pipeline?

This is delegated to pipeline componets that has to check if particular way of combining components is correct. This is
a reason for my feeling of dissatisfaction.

> Hehe...no, no, it's not about the generator, transformer, serializer and
> how to chain them. For instance, the old cocoon pipeline interface made
> sure that you add the components to the pipeline in the correct order
> (first generator, then transformers, finally serializer).
> Just because something has been in one way or the other for years,
> doesn't mean that it's good and can't be made better/easier. :)

You mean hear that old interface didn't made sure?

> Yes, but that's you doing the stuff (or other people involved in
> implementing c3). For new users it is a little bit more complicated.
> Ok, granted, maybe I'm overestimating the benefit of the compile time
> check. Don't know.

Carsten, I don't think that compile time check (or better said just a type safety) can be overestimated. The problem we
are facing is how to abstract out common bits but still make everything rather general.

>> But so far not working solution/prototype has been made available (or
>> did I simply miss it?).
> :) No of course you didn't miss it. Maybe I have time to look into this
> next week.

Carsten, before you start hacking anything could you have a look at:
http://github.com/gkossakowski/cocoonpipelines2/tree/6c1234a5d439aeced316723d57b9471cbcdeb0e0/src/org/apache/cocoon/pipeline

Which is some kind of solution/prototype, IMHO.

>> And those that have merely scratch surface IMO.
>> There are still things like component order: Generator, Generator,
>> Transformer, Serializer, Generator is clearly not a valid pipeline, even
>> if all components are actually using the same technology.
> Yes, but this can be checked immediately when the component is added to
> the pipeline. Of course, this is not a compile time check.

It can be compile check if you define component's type carefully. Have a look at my discussion of PipelineComponent
definition in another e-mail.

> Maybe we don't need to change this and just properly naming things (like
> either adding SAX etc. to the class name or as a package) is enough. I
> actually don't know, but I think the easier we make it and the more we
> can check at compile time, the more we attract possible users.

Agreed.

It's not surprising that this subject is discussed warmly. Pipeline API is meant to be essential part of C3 and we don't
want to change it too often so we need to design it carefully and more importantly, get aware of advantages and
disadvantages of solution we chose.

-- 
Best regards,
Grzegorz Kossakowski

Re: [C3] Pipeline component event types

Posted by Carsten Ziegeler <cz...@apache.org>.

Steven Dolg wrote:
> I don't think  I understand what you mean by "same xml transportation".
Ah sorry, this was a try to find something better than "event types".

> Currently creating and executing a pipeline looks like:
> 
> Pipeline pipeline = ...;
> pipeline.add(new StAXGenerator(inputURL));
> pipeline.add(new StAXTransformer(myParameter));
> pipeline.add(new StAXSerializer();
> 
> pipeline.setup(outputStream);
> pipeline.execute();
> 
> It is the same when using SAX (except different components, of course)
> or when processing any other type of data (I think I should really go
> and build that Imaging module...).
Yes, sure - but:
> Pipeline pipeline = ...;
> pipeline.add(new StAXGenerator(inputURL));
> pipeline.add(new SAXTransformer(myParameter));
> pipeline.add(new StAXSerializer();
>
wouldn't work - and maybe this wouldn't even end in an exception. The
pipeline has no knowledge of the possible event types and which event
types are compatible. So how can it check the validity of this pipeline?

> Again I have the feeling that some concepts that Cocoon has for years
> now (Pipelines look like Generator -> Transformer -> Serializer; make
> sure the sitemap makes sense; know what the components actually do) are
> all of a sudden too complicated for any user to apply them safely
> without reading the source code.
> But that may be just me...
Hehe...no, no, it's not about the generator, transformer, serializer and
how to chain them. For instance, the old cocoon pipeline interface made
sure that you add the components to the pipeline in the correct order
(first generator, then transformers, finally serializer).
Just because something has been in one way or the other for years,
doesn't mean that it's good and can't be made better/easier. :)

> This simple switching of components works right now!
> While creating the StAX components we often exchanged the components
> (again all of them, not individual ones - SAX and StAX are not
> compatible without adapters)
> to compare the results from SAX and StAX.
Yes, that's my point - you have to change all components in the pipeline.

> IMO demanding that the user also selects the correct pipeline type for
> his choice of components - that need to be compatible with each other
> *and* the pipeline - is actually more than just demanding that he
> chooses components that are compatible with each other and guaranteeing
> that any (correctly implemented) pipeline will do the job.
Hmm, the user has to select the correct pipeline type, yes, but I don't
think that this is more complicated - and it would allow compile time
checking of the components.

> The components need to communicate with each other nonetheless. The
> additional "Are you compatible with me?" check is fairly easy compared
> to that (for StAX that are 3 lines of code in one abstract base class).
Hmm, ok.

> Being able to use the same Pipeline implementations (even the rather
> sophisticated asynchronous caching pipeline) has saved us probably days
> of work and that compatibility check took no more than 5 minutes once
> the interfaces for the components (which we needed anyway) were defined.
Yes, but that's you doing the stuff (or other people involved in
implementing c3). For new users it is a little bit more complicated.
Ok, granted, maybe I'm overestimating the benefit of the compile time
check. Don't know.

>> So we have something like:
>> public interface Pipeline<T extends PipelineComponent>
>>
>> and have sub interfaces for PipelineComponent for sax, dom, stax?
>>   This ensures to have a single implementation but gives compiler time
>> checks.
>>   
> Well that would be nice of course.
> But so far not working solution/prototype has been made available (or
> did I simply miss it?).
:) No of course you didn't miss it. Maybe I have time to look into this
next week.

> 
> And those that have merely scratch surface IMO.
> There are still things like component order: Generator, Generator,
> Transformer, Serializer, Generator is clearly not a valid pipeline, even
> if all components are actually using the same technology.
Yes, but this can be checked immediately when the component is added to
the pipeline. Of course, this is not a compile time check.


Maybe we don't need to change this and just properly naming things (like
either adding SAX etc. to the class name or as a package) is enough. I
actually don't know, but I think the easier we make it and the more we
can check at compile time, the more we attract possible users.

Carsten
-- 
Carsten Ziegeler
cziegeler@apache.org

Re: [C3] Pipeline component event types

Posted by Steven Dolg <st...@indoqa.com>.

Carsten Ziegeler schrieb:
> Steven Dolg wrote:
>   
>> Carsten Ziegeler schrieb:
>>     
>>> Wouldn't it be better/easier to have a sax pipeline, a dom pipeline, a
>>> stax pipeline, perhaps sharing a common interface?
>>>   
>>>       
>> From my point of view:
>> Currently the user must know which components he needs (as in "I want to
>> process XML and I'd like to do it with StAX").
>> As soon as he know this, he just selects the components (either existing
>> or created) but them in *any* pipeline (caching/noncaching/etc.)
>>     
> But the user needs to choose the same xml transportation for all
> components in the pipeline, being it directly or through wrappers.
>   
I don't think  I understand what you mean by "same xml transportation".
Currently creating and executing a pipeline looks like:

Pipeline pipeline = ...;
pipeline.add(new StAXGenerator(inputURL));
pipeline.add(new StAXTransformer(myParameter));
pipeline.add(new StAXSerializer();

pipeline.setup(outputStream);
pipeline.execute();

It is the same when using SAX (except different components, of course) 
or when processing any other type of data (I think I should really go 
and build that Imaging module...).

Again I have the feeling that some concepts that Cocoon has for years 
now (Pipelines look like Generator -> Transformer -> Serializer; make 
sure the sitemap makes sense; know what the components actually do) are 
all of a sudden too complicated for any user to apply them safely 
without reading the source code.
But that may be just me...
>   
>> If there were multiple, content-specific Pipelines he still needs to
>> know which components, but also which type of Pipeline.
>> If he feels the need to change to SAX (so a switch in the "event type" -
>> IMO a sub-optimal term, since not every component actually passes nice
>> events like StAX does) he also needs to change the Pipeline.
>> This may seem easy now, but imagine a larger system. Changing the
>> pipeline type can be challenging there.
>>     
> >From what I understand so far in this discussion this simple switching
> does not work (or is not intended to be implemented - which is fine for
> me).
This simple switching of components works right now!
While creating the StAX components we often exchanged the components 
(again all of them, not individual ones - SAX and StAX are not 
compatible without adapters)
to compare the results from SAX and StAX.

>  So besides from switching the pipeline implementation you have to
> switch all components or at least add matching wrappers around them.
>   
I'm not proposing content-specific pipeline implementations.

IMO demanding that the user also selects the correct pipeline type for 
his choice of components - that need to be compatible with each other 
*and* the pipeline - is actually more than just demanding that he 
chooses components that are compatible with each other and guaranteeing 
that any (correctly implemented) pipeline will do the job.

Since this is about making it easier for the user, I think content 
specific pipelines actually cause the opposite effect, because it adds 
another element to be considered.

On top of that it also makes introducing new content types harder since 
one has to provide a pipeline implemementation (and possibly interface), 
too.
The components need to communicate with each other nonetheless. The 
additional "Are you compatible with me?" check is fairly easy compared 
to that (for StAX that are 3 lines of code in one abstract base class).

Being able to use the same Pipeline implementations (even the rather 
sophisticated asynchronous caching pipeline) has saved us probably days 
of work and that compatibility check took no more than 5 minutes once 
the interfaces for the components (which we needed anyway) were defined.
>   
>> And what about automatically generated pipelines (e.g. the sitemap).
>> This will be so much harder as you have to collect and analyze all
>> components first before you can actually build the pipeline to use.
>>     
> I think you have to do this anyway - as not all components fit together.
>   
Currently we have no need to select the pipeline specifically for the 
components to be used (in fact the sitemap builds the pipeline before 
even knowing the components).
If the components are really not compatible this will be detected before 
the pipeline is actually executed (in the setup phase).
>   
>> Defining a common interface for different pipeline types does not really
>> change this.
>> If the common interface is sufficient for handling and operating the
>> pipeline they are exchangable (from the callers point of view) and
>> provide the same environment we have now.
>> If the common interface is not sufficient for handling and operating the
>> pipeline it is merely a marker interface and it probably wouldn't make
>> much difference. (Although it is still useful for declaring parameter
>> types, etc.)
>>
>>
>> I may be biased here ;-) but I have yet to see the benefits of different
>> pipeline types...
>>     
> :) I have the same feeling for the opposite...if I can't just mix dom
> with sax and maybe stax, then why following this generic approach?
>
> Hmm, maybe generics could help?
> So we have something like:
> public interface Pipeline<T extends PipelineComponent>
>
> and have sub interfaces for PipelineComponent for sax, dom, stax?
>   
> This ensures to have a single implementation but gives compiler time checks.
>   
Well that would be nice of course.
But so far not working solution/prototype has been made available (or 
did I simply miss it?).

And those that have merely scratch surface IMO.
There are still things like component order: Generator, Generator, 
Transformer, Serializer, Generator is clearly not a valid pipeline, even 
if all components are actually using the same technology.

> Carsten
>

Re: [C3] Pipeline component event types

Posted by Carsten Ziegeler <cz...@apache.org>.

Steven Dolg wrote:
> Carsten Ziegeler schrieb:
>> Wouldn't it be better/easier to have a sax pipeline, a dom pipeline, a
>> stax pipeline, perhaps sharing a common interface?
>>   
> From my point of view:
> Currently the user must know which components he needs (as in "I want to
> process XML and I'd like to do it with StAX").
> As soon as he know this, he just selects the components (either existing
> or created) but them in *any* pipeline (caching/noncaching/etc.)
But the user needs to choose the same xml transportation for all
components in the pipeline, being it directly or through wrappers.

> If there were multiple, content-specific Pipelines he still needs to
> know which components, but also which type of Pipeline.
> If he feels the need to change to SAX (so a switch in the "event type" -
> IMO a sub-optimal term, since not every component actually passes nice
> events like StAX does) he also needs to change the Pipeline.
> This may seem easy now, but imagine a larger system. Changing the
> pipeline type can be challenging there.
>From what I understand so far in this discussion this simple switching
does not work (or is not intended to be implemented - which is fine for
me). So besides from switching the pipeline implementation you have to
switch all components or at least add matching wrappers around them.

> And what about automatically generated pipelines (e.g. the sitemap).
> This will be so much harder as you have to collect and analyze all
> components first before you can actually build the pipeline to use.
I think you have to do this anyway - as not all components fit together.

> Defining a common interface for different pipeline types does not really
> change this.
> If the common interface is sufficient for handling and operating the
> pipeline they are exchangable (from the callers point of view) and
> provide the same environment we have now.
> If the common interface is not sufficient for handling and operating the
> pipeline it is merely a marker interface and it probably wouldn't make
> much difference. (Although it is still useful for declaring parameter
> types, etc.)
> 
> 
> I may be biased here ;-) but I have yet to see the benefits of different
> pipeline types...
:) I have the same feeling for the opposite...if I can't just mix dom
with sax and maybe stax, then why following this generic approach?

Hmm, maybe generics could help?
So we have something like:
public interface Pipeline<T extends PipelineComponent>

and have sub interfaces for PipelineComponent for sax, dom, stax?

This ensures to have a single implementation but gives compiler time checks.

Carsten
-- 
Carsten Ziegeler
cziegeler@apache.org

Re: [C3] Pipeline component event types

Posted by Steven Dolg <st...@indoqa.com>.

Carsten Ziegeler schrieb:
> Reinhard Pötz write:
>   
>>>> Agreed. How do you know what kind of wrapper do you need if you don't
>>>> know what kind of events components consume and produce?
>>>>         
>> My assumption is that the developer that uses the pipeline knows what he
>> does.
>>     
> :) While this assumption *should* be true, we all know that in most
> cases it is not. So I fear many people will stumble across this problem.
>
> But I have one question: if we don't allow to mix different event types
> in a single pipeline (and I guess by event types we mean sax, dom, stax)
> why do we have a generic pipeline interface?
>
> Wouldn't it be better/easier to have a sax pipeline, a dom pipeline, a
> stax pipeline, perhaps sharing a common interface?
>   
 From my point of view:
Currently the user must know which components he needs (as in "I want to 
process XML and I'd like to do it with StAX").
As soon as he know this, he just selects the components (either existing 
or created) but them in *any* pipeline (caching/noncaching/etc.)
Done!
If he feels that he needs that little extra performance and can handle 
it in SAX as well: just change the components - Done!


If there were multiple, content-specific Pipelines he still needs to 
know which components, but also which type of Pipeline.
If he feels the need to change to SAX (so a switch in the "event type" - 
IMO a sub-optimal term, since not every component actually passes nice 
events like StAX does) he also needs to change the Pipeline.
This may seem easy now, but imagine a larger system. Changing the 
pipeline type can be challenging there.

And what about automatically generated pipelines (e.g. the sitemap).
This will be so much harder as you have to collect and analyze all 
components first before you can actually build the pipeline to use.


Defining a common interface for different pipeline types does not really 
change this.
If the common interface is sufficient for handling and operating the 
pipeline they are exchangable (from the callers point of view) and 
provide the same environment we have now.
If the common interface is not sufficient for handling and operating the 
pipeline it is merely a marker interface and it probably wouldn't make 
much difference. (Although it is still useful for declaring parameter 
types, etc.)


I may be biased here ;-) but I have yet to see the benefits of different 
pipeline types...

Steven
> Carsten
>
>

Re: [C3] Pipeline component event types

Posted by Carsten Ziegeler <cz...@apache.org>.

Reinhard Pötz write:
>>> Agreed. How do you know what kind of wrapper do you need if you don't
>>> know what kind of events components consume and produce?
> 
> My assumption is that the developer that uses the pipeline knows what he
> does.
:) While this assumption *should* be true, we all know that in most
cases it is not. So I fear many people will stumble across this problem.

But I have one question: if we don't allow to mix different event types
in a single pipeline (and I guess by event types we mean sax, dom, stax)
why do we have a generic pipeline interface?

Wouldn't it be better/easier to have a sax pipeline, a dom pipeline, a
stax pipeline, perhaps sharing a common interface?

Carsten

-- 
Carsten Ziegeler
cziegeler@apache.org

Re: [C3] Pipeline component event types

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Steven Dolg pisze:
>> It depends how you define being successful. I've managed to express
>> this rather simple idea but the code is horrible
>> thus I consider it as a failure.
>>   
> Well cleaning up working code cannot be that hard - you haven't invested
> years of time, have you ;-)

The problem is that I see no way how to clean it up at the moment. I have the feeling that I've just hit limitations of
Java as language and I can't get rid of this feeling. As I said in my original e-mail I would like to be proven I'm
wrong so any patches are welcome. :-)

You can treat it as a nice exercise on Java generics usage. If you are curious what kind of limitations I can see here are:
1. I would like to define that PipelineComponent.execute is a method with singature like: Event|Nothing|Continue ->
Event|Nothing|Continue. By Event|Nothing|Continue I mean case type where you can pass either Nothing object or Continue
object or an object that extends Event. There are no case types in Java so I had to introduce interfaces Continue and
Event. But even this does not solve a problem because Nothing and Continue implementations will not extend specific
event type that component accepts. Actually this is not a problem, as conceptually Nothing and Continue events are
completely different cases and should be handled differently. The problem is how to express this idea in Java in concise
way.

2. Have a look at PipelineImpl. There are two subclasses, what an ugliness right? But try to get rid of them. You would
need something like:

private Pipeline<T, W> pipeline;
private PipelineComponent<W, U> component;

This way we express that components accepts what pipeline produces but W is not defined anywhere. What we really need is
a tuple so you could say something like:
private <W extends Event> (Pipeline<T,W>, PipelineComponent<W, U>) pipelineAndComponent;

Or something like that. Again, I have no idea how to express this in Java in a _concise_ way.

> That adding a component actually returns a different pipeline is an
> interesting approach but I'm not sure I want to declare a new variable
> for each of the new pipelines.

Yep, that's a valid concern. Actually, this kind of construct is a functional-like and it at the same time enforces you
to use it differently. I don't want to go into details but the main idea is that handling of pipeline construction is
handled by various functions and you are just passing around partial pipeline without introducing any of additional
variables. This is similar to method-chaining (or method combining) in Java with a difference that in functional
languages function combining is perceived as a basic programming technique. As we are probably going to stay in Java I
would like to see if this inspires someone else to come up with casual Java counterpart.

> And method chaining is not really me idea of readable code.

Depends on view, but I sort of agree that in most cases it's not readable.

> Also I'm wondering what return type a SAXSerializer would have or what
> event types SAX uses.

We would have to define our own type which simply implements SAX events as simple classes instead of method calls as
it's done in standard way. I know that it's not the best thing to define our own APIs but original idea (even if
influenced by performance considerations) of passing events by method calls wasn't that good. Anyway, we have already
had this kind of discussion when StaX research was discussed.

> Are those event types just for the compiler or are they actually used to
> pass the data around?

They would be used for passing data around. You can see examples implementing reworked interfaces. For example, if
serializer produces an output stream then it just emits *one* event called OutputStreamEvent. Or if we want to have
partial results for this kind of serializer it could emit many events that would contain just fragments of the final
output. If we are at partial results, I remember you have already asked about it in some e-mail.

I would like to explain one nice "side-effect" of my design. I'll show how one functional concept - expression
evaluation laziness can be easily implemented in pipelines. In order to explain it I'll introduce my view on pipelines
and pipeline components.

Pipeline component is just a function f: Event|Nothing|Continue -> Event|Nothing|Continue. Nothing and Continue events
will be explain later. If we have f_1, f_2, ..., f_n, pipeline is just a function composition:
f_n * ... * f_2 * f_1 = f_n(...(f_2(f_1( )))

Now, what makes pipeline different from ordinary function composition is, in my opinion, that each of functions can emit
partial result based on partial input. Partial result/input is just a sequence of events where each is different from
Continue event. Full result of function execution is just a sequence of events ended with Nothing event (which is a
special marking object). The property of returning partial results makes functions (pipeline components) streamable.
This results in, for example, sending browser fragments of HTML page as soon as they are calculated without waiting for
finishing processing of all events.

If you are wondering how generator is defined, then it's just a function g: Nothing -> Event|Nothing. This definition
reflects the fact that generator is a special function that _generates_ events out of nothing from Pipeline point of
view. It does not base it's output on any incoming events but on some external data source that is unknown to pipeline
and is out of its focus. If generator emits all its events, it signalizes it with Nothing so its result is a sequence of
events ended with Nothing.

Now let's discuss Continue. This is a helper object that functions can emit in order to express the fact that they need
more input events in order to produce any portion of result. Think of transformer that replaces some fragment of XML
with another fragment of XML based on what has been in original fragment. Therefore it has to collect all events
repressing original XML fragment in order to produce new events. Here you can recognize that word "collecting" involves
some buffering but I won't go into details as I want to focus on other aspects and not implementation details.

Having rather precise definitions before our eyes we can move to laziness property of pipeline execution. In definition
of function f (pipeline component) is not said precisely when function f can emit Nothing event. Actually, it wasn't
part of definition but f must satisfy a property that f emits Nothing after finite number of receiving Nothing events
(it's reader's exercise to find out why). This means that f can emit Nothing as response to any kind of event.
Let's consider an example:
Pipeline P1: f_1 -> f_2 -> f_3
Pipeline P: P1 -> f_4 -> f5

Now let's assume that f_1 is a generator, generating a large stream of events from a big XML file or some records. Now
let's assume that f_2 is just a simple function doing some simple transformation like text formatting. Now f_3 is a
query function that has a query defined like: NumberOfRecord() <= 20. This means that in pipeline P1 we want to extract
only 20 first records of big file. What f_3 does after consuming 20 records is that it just returns Nothing event to say
that it's the end of result for f_3.

It means to pipeline execution that after Nothing is received from f_3 the whole P1 pipeline can be discarded and
execution should continue with f_4 and f_4. It means that the rest of that big XML file wont' be read.

I won't give you a formal definition of laziness but I'm sure you've got my point. With this kind of design of pipelines
we get laziness almost for free which is a nice addition after all. Isn't it?

                                                             ---- o0o ----


Ok, this e-mail got rather lengthy but I had a chance to explain to you how I see Cocoon Pipelines on paper. That was an
occasion for me to introduce to you a concept of lazy evaluation of pipelines. For you it was an opportunity to see what
have influenced my current view on pipelines design.

Thank you for your attention.

-- 
Best regards,
Grzegorz Kossakowski

Re: [C3] Pipeline component event types

Posted by Steven Dolg <st...@indoqa.com>.

Grzegorz Kossakowski schrieb:
> Reinhard Pötz pisze:
>
>   
>> My assumption is that the developer that uses the pipeline knows what he
>> does.
>>     
>
> This is rather good assumption. The problem I can see is that developer has to check sources of each component in order
> to know what he does as components type does not express what kind of output particular component produces.
>
> Or am I missing something?
>
>   
>>>> How component can be sure that next component (its consumer) is the one
>>>> that accepts right type of events? By checking using instanceof?
>>>> My point is that once we agree to have generic pipelines that can take
>>>> components accepting/producing any kind of events then we need to invent
>>>> some mechanism that check if pipeline is built correctly. It shouldn't
>>>> be a concern of a given component.
>>>>
>>>> If we agree on above point, then my suggestion would be to look for a
>>>> way that pipeline-correctness is ensured by compiler.
>>>>         
>> I don't see any way to express this kind of check with Java and AFAICS
>> your experiments haven't been successful either.
>>     
>
> It depends how you define being successful. I've managed to express this rather simple idea but the code is horrible
> thus I consider it as a failure.
>   
Well cleaning up working code cannot be that hard - you haven't invested 
years of time, have you ;-)
> You probably haven't had a chance to look at my code due to broken GitHub. My question is if there is someone more
> clever than me that could come up with something more elegant.
>
> The check we are discussing is assured by Pipeline interface (look at modified addComponent method):
> http://github.com/gkossakowski/cocoonpipelines2/blob/master/src/org/apache/cocoon/pipeline/Pipeline.java
>   

That adding a component actually returns a different pipeline is an 
interesting approach but I'm not sure I want to declare a new variable 
for each of the new pipelines.
And method chaining is not really me idea of readable code.

Also I'm wondering what return type a SAXSerializer would have or what 
event types SAX uses.
Are those event types just for the compiler or are they actually used to 
pass the data around?

>

Re: [C3] Pipeline component event types

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Reinhard Pötz pisze:

> My assumption is that the developer that uses the pipeline knows what he
> does.

This is rather good assumption. The problem I can see is that developer has to check sources of each component in order
to know what he does as components type does not express what kind of output particular component produces.

Or am I missing something?

>>> How component can be sure that next component (its consumer) is the one
>>> that accepts right type of events? By checking using instanceof?
>>> My point is that once we agree to have generic pipelines that can take
>>> components accepting/producing any kind of events then we need to invent
>>> some mechanism that check if pipeline is built correctly. It shouldn't
>>> be a concern of a given component.
>>>
>>> If we agree on above point, then my suggestion would be to look for a
>>> way that pipeline-correctness is ensured by compiler.
> 
> I don't see any way to express this kind of check with Java and AFAICS
> your experiments haven't been successful either.

It depends how you define being successful. I've managed to express this rather simple idea but the code is horrible
thus I consider it as a failure.

You probably haven't had a chance to look at my code due to broken GitHub. My question is if there is someone more
clever than me that could come up with something more elegant.

The check we are discussing is assured by Pipeline interface (look at modified addComponent method):
http://github.com/gkossakowski/cocoonpipelines2/blob/master/src/org/apache/cocoon/pipeline/Pipeline.java

-- 
Best regards,
Grzegorz Kossakowski

Re: [C3] Pipeline component event types

Posted by Reinhard Pötz <re...@apache.org>.

Grzegorz Kossakowski wrote:
> Grzegorz Kossakowski wrote:
>> Reinhard Pötz wrote:
>>   
>>> I don't believe that pipelines should contain components that support
>>> different event types or that we event need components that have
>>> different input and output events.
>>>   
>>>     
>> What about serializer? Usually, it produces events of a type different
>> from the type of events it consumes.
>>   
>>> If you want to mix your components (e.g. using a SAX component in a
>>> pipeline full of StAX components), you should put your 'alien' component
>>> into a wrapper.
>>>   
>>>     
>> Agreed. How do you know what kind of wrapper do you need if you don't
>> know what kind of events components consume and produce?

My assumption is that the developer that uses the pipeline knows what he
does.

>> How component can be sure that next component (its consumer) is the one
>> that accepts right type of events? By checking using instanceof?
>> My point is that once we agree to have generic pipelines that can take
>> components accepting/producing any kind of events then we need to invent
>> some mechanism that check if pipeline is built correctly. It shouldn't
>> be a concern of a given component.
>>
>> If we agree on above point, then my suggestion would be to look for a
>> way that pipeline-correctness is ensured by compiler.

I don't see any way to express this kind of check with Java and AFAICS
your experiments haven't been successful either.

> One more thing:
> The idea that pipeline does know about event types that components
> process solves the problem with pipeline results as you have different
> event type carrying different data.

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [C3] Pipeline component event types

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Grzegorz Kossakowski wrote:
> Reinhard Pötz wrote:
>   
>> I don't believe that pipelines should contain components that support
>> different event types or that we event need components that have
>> different input and output events.
>>   
>>     
> What about serializer? Usually, it produces events of a type different
> from the type of events it consumes.
>   
>> If you want to mix your components (e.g. using a SAX component in a
>> pipeline full of StAX components), you should put your 'alien' component
>> into a wrapper.
>>   
>>     
> Agreed. How do you know what kind of wrapper do you need if you don't
> know what kind of events components consume and produce?
>
> How component can be sure that next component (its consumer) is the one
> that accepts right type of events? By checking using instanceof?
> My point is that once we agree to have generic pipelines that can take
> components accepting/producing any kind of events then we need to invent
> some mechanism that check if pipeline is built correctly. It shouldn't
> be a concern of a given component.
>
> If we agree on above point, then my suggestion would be to look for a
> way that pipeline-correctness is ensured by compiler.
>   
One more thing:
The idea that pipeline does know about event types that components
process solves the problem with pipeline results as you have different
event type carrying different data.


-- 
Best regards,
Grzegorz Kossakowski

Re: [C3] Pipeline component event types

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Reinhard Pötz wrote:
> I don't believe that pipelines should contain components that support
> different event types or that we event need components that have
> different input and output events.
>   
What about serializer? Usually, it produces events of a type different
from the type of events it consumes.
> If you want to mix your components (e.g. using a SAX component in a
> pipeline full of StAX components), you should put your 'alien' component
> into a wrapper.
>   
Agreed. How do you know what kind of wrapper do you need if you don't
know what kind of events components consume and produce?

How component can be sure that next component (its consumer) is the one
that accepts right type of events? By checking using instanceof?
My point is that once we agree to have generic pipelines that can take
components accepting/producing any kind of events then we need to invent
some mechanism that check if pipeline is built correctly. It shouldn't
be a concern of a given component.

If we agree on above point, then my suggestion would be to look for a
way that pipeline-correctness is ensured by compiler.

-- 
Best regards,
Grzegorz Kossakowski

[C3] Pipeline component event types

Posted by Reinhard Pötz <re...@apache.org>.

Grzegorz Kossakowski wrote:
> Jakob Spörk pisze:
>> Hello,
> 
> Hello Jakob,
> 
>> I just want to give my thoughts to unified pipeline and data
>> conversion topic. In my opinion, the pipeline can't do the data
>> conversion, because it has no information about how to do this.
>> Let's take a simple example: We have a pipeline processing XML
>> documents that describe images. The first components process this
>> xml data while the rest of the components do operations on the
>> actual image. Now is the question, who will transform the xml data
>> to image data in the middle of the pipeline?
> 
> I agree with you that pipeline implementation should not handle data
> conversion because there is no generic way to handle it.
> 
> Now I would like to answer your question: it should be another
> /pipeline component/ that handles data conversion.
> 
>> I believe the pipeline cannot do this, because it simply do not
>> know how to transform, because that’s a custom operation. You would
>> need a component that is on the one hand a XML consumer and on the
>> other hand an image producer. Providing some automatic data
>> conversions directly in the pipeline may help developers that need
>> exactly these default cases but I believe it would be harder for
>> people requiring custom data conversions (and that are most of the
>> cases).
>> 
>> The actual architecture allows to fit any components into the
>> pipeline, and only the components itself have to know if they can
>> work with their predecessor or the component following them. That
>> allow most flexibility when thinking about any possible
>> conversions. If a pipeline should do this, you would need
>> "plug-ins" for the pipeline that are registered and allow the 
>> pipeline to do the conversions. But then, it is the responsibility
>> of the developer to register the right conversion plug-ins and you
>> would have get new problems if a pipeline requires two different
>> converters from the same to the same data type because the pipeline
>> cannot have automatically the information which converter to use in
>> which situation.
> 
> I believe that these problems could be addressed by... compiler. In
> my opinion, pipelines should be type-safe which basically means that
> for a given pipeline fragment you know what it expects on the input
> and what kind of output it gives to you. The same goes for
> components. This eliminates "flexibility" of having a component that
> accepts more than one kind of input or more than one kind of output.
> I believe that having more than one output or one input only adds to 
> complexity and does not solve any problem.
> 
> If component was going to accept more than one kind of input how a
> user could know the list of accepted inputs? I guess the only way to
> find out would be checking source and looking for all "instanceof"
> statements in its code.
> 
> I would prefer situation when components have well-defined type of
> input and output and if you one to combine components for which
> input-output pairs do not match you should add converters as
> intermediate components.
> 
> I've been thinking about generic but at the same time type-safe
> pipelines for some time. I've designed them on paper and everything
> looked quite promising. Then moved to implementation of my ideas and
> got rather disappointing result which can be seen here: 
> http://github.com/gkossakowski/cocoonpipelines/tree/master
> 
> The most interesting files are: 
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/Pipeline.java
> (generic and type-safe pipeline interface)
> 
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/PipelineComponent.java
>  (generic and type-safe component def.)
> 
> http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/demo/RunPipeline.java
>  (shows how to use that thing)
> 
>> The only thing cocoon can help here with is to provide as much
>> "standard" converters for use as possible, but it is still the
>> responsibility of the developer to use the right ones.
> 
> I think Cocoon could define much better, type-safe Pipeline API but
> we are in unfortunate situation that we are using language that makes
> it extremely hard to express this kind of generic solutions.
> 
> Of course, I would like to be proven that I'm wrong and Java is
> powerful enough to let us express our ideas and solve our problems.
> Actually, the whole idea of pipeline is not a rocket science as it's,
> in essence, just ordinary function composition. The only unique
> property of pipelines I can see is that we want to access to
> _partial_ results of pipeline execution so we can make it streamable.
> 
> 
> This become more a brain-dump than a real answer to your e-mail
> Jakob, but I hope you (and others) have got my point.

I don't believe that pipelines should contain components that support
different event types or that we event need components that have
different input and output events.

If you want to mix your components (e.g. using a SAX component in a
pipeline full of StAX components), you should put your 'alien' component
into a wrapper.

-- 
Reinhard Pötz                           Managing Director, {Indoqa} GmbH
                         http://www.indoqa.com/en/people/reinhard.poetz/

Member of the Apache Software Foundation
Apache Cocoon Committer, PMC member                  reinhard@apache.org
________________________________________________________________________

Re: [C3] StAX research reveiled!

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Jakob Spörk pisze:
> Hello,

Hello Jakob,

> I just want to give my thoughts to unified pipeline and data conversion
> topic. In my opinion, the pipeline can't do the data conversion, because it
> has no information about how to do this. Let's take a simple example: We
> have a pipeline processing XML documents that describe images. The first
> components process this xml data while the rest of the components do
> operations on the actual image. Now is the question, who will transform the
> xml data to image data in the middle of the pipeline? 

I agree with you that pipeline implementation should not handle data conversion because there is no generic way to
handle it.

Now I would like to answer your question: it should be another /pipeline component/ that handles data conversion.

> I believe the pipeline cannot do this, because it simply do not know how to
> transform, because that’s a custom operation. You would need a component
> that is on the one hand a XML consumer and on the other hand an image
> producer. Providing some automatic data conversions directly in the pipeline
> may help developers that need exactly these default cases but I believe it
> would be harder for people requiring custom data conversions (and that are
> most of the cases).
> 
> The actual architecture allows to fit any components into the pipeline, and
> only the components itself have to know if they can work with their
> predecessor or the component following them. That allow most flexibility
> when thinking about any possible conversions. If a pipeline should do this,
> you would need "plug-ins" for the pipeline that are registered and allow the
> pipeline to do the conversions. But then, it is the responsibility of the
> developer to register the right conversion plug-ins and you would have get
> new problems if a pipeline requires two different converters from the same
> to the same data type because the pipeline cannot have automatically the
> information which converter to use in which situation.

I believe that these problems could be addressed by... compiler. In my opinion, pipelines should be type-safe which
basically means that for a given pipeline fragment you know what it expects on the input and what kind of output it
gives to you. The same goes for components. This eliminates "flexibility" of having a component that accepts more than
one kind of input or more than one kind of output. I believe that having more than one output or one input only adds to
complexity and does not solve any problem.

If component was going to accept more than one kind of input how a user could know the list of accepted inputs? I guess
the only way to find out would be checking source and looking for all "instanceof" statements in its code.

I would prefer situation when components have well-defined type of input and output and if you one to combine components
for which input-output pairs do not match you should add converters as intermediate components.

I've been thinking about generic but at the same time type-safe pipelines for some time. I've designed them on paper and
everything looked quite promising. Then moved to implementation of my ideas and got rather disappointing result which
can be seen here:
http://github.com/gkossakowski/cocoonpipelines/tree/master

The most interesting files are:
http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/Pipeline.java (generic and
type-safe pipeline interface)

http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/PipelineComponent.java
(generic and type-safe component def.)

http://github.com/gkossakowski/cocoonpipelines/tree/master/src/org/apache/cocoon/pipeline/demo/RunPipeline.java
(shows how to use that thing)

> The only thing cocoon can help here with is to provide as much "standard"
> converters for use as possible, but it is still the responsibility of the
> developer to use the right ones.

I think Cocoon could define much better, type-safe Pipeline API but we are in unfortunate situation that we are using
language that makes it extremely hard to express this kind of generic solutions.

Of course, I would like to be proven that I'm wrong and Java is powerful enough to let us express our ideas and solve
our problems. Actually, the whole idea of pipeline is not a rocket science as it's, in essence, just ordinary function
composition. The only unique property of pipelines I can see is that we want to access to _partial_ results of pipeline
execution so we can make it streamable.

This become more a brain-dump than a real answer to your e-mail Jakob, but I hope you (and others) have got my point.

-- 
Best regards,
Grzegorz Kossakowski

RE: [C3] StAX research reveiled!

Posted by Jakob Spörk <ja...@gmx.at>.

Hello,

I just want to give my thoughts to unified pipeline and data conversion
topic. In my opinion, the pipeline can't do the data conversion, because it
has no information about how to do this. Let's take a simple example: We
have a pipeline processing XML documents that describe images. The first
components process this xml data while the rest of the components do
operations on the actual image. Now is the question, who will transform the
xml data to image data in the middle of the pipeline? 

I believe the pipeline cannot do this, because it simply do not know how to
transform, because thats a custom operation. You would need a component
that is on the one hand a XML consumer and on the other hand an image
producer. Providing some automatic data conversions directly in the pipeline
may help developers that need exactly these default cases but I believe it
would be harder for people requiring custom data conversions (and that are
most of the cases).

The actual architecture allows to fit any components into the pipeline, and
only the components itself have to know if they can work with their
predecessor or the component following them. That allow most flexibility
when thinking about any possible conversions. If a pipeline should do this,
you would need "plug-ins" for the pipeline that are registered and allow the
pipeline to do the conversions. But then, it is the responsibility of the
developer to register the right conversion plug-ins and you would have get
new problems if a pipeline requires two different converters from the same
to the same data type because the pipeline cannot have automatically the
information which converter to use in which situation.

The only thing cocoon can help here with is to provide as much "standard"
converters for use as possible, but it is still the responsibility of the
developer to use the right ones.

Best Regards,
Jakob

-----Original Message-----
From: Grzegorz Kossakowski [mailto:grek@tuffmail.com] 
Sent: Samstag, 10. Januar 2009 23:09
To: dev@cocoon.apache.org
Subject: Re: [C3] StAX research reveiled!

Sylvain Wallez pisze:
> Again I doubt of the real value of a common unified pipeline if all the
> responsibility of ensuring proper compatibility between components
> (including possible data conversion) is delegated to components. This
> leaves a lot of complexity to component implementers (except in the
> simple straightforward push scenario), and the features of the pipeline
> will be limited to linking the components together and caching.

I've been having similar concerns thus I'm eagerly waiting for some non-XML
examples that would change my mind.

> Furthermore, people will have to take great care of choosing components
> that fit together, or they will get exceptions at pipeline execution
> time. Hmm... reminds me of some criticism about the StAX stream API :-D

LOL! :-)

> So let's agree that we disagree. I'll see what you guys come up with and
> hope I'll change my mind then.

I just wanted to say that I share many points with Sylvain even if it's
still hard for me to say which option I prefer.

This will require some more thinking...

-- 
Best regards,
Grzegorz Kossakowski

Re: [C3] StAX research reveiled!

Posted by Grzegorz Kossakowski <gr...@tuffmail.com>.

Sylvain Wallez pisze:
> Again I doubt of the real value of a common unified pipeline if all the
> responsibility of ensuring proper compatibility between components
> (including possible data conversion) is delegated to components. This
> leaves a lot of complexity to component implementers (except in the
> simple straightforward push scenario), and the features of the pipeline
> will be limited to linking the components together and caching.

I've been having similar concerns thus I'm eagerly waiting for some non-XML examples that would change my mind.

> Furthermore, people will have to take great care of choosing components
> that fit together, or they will get exceptions at pipeline execution
> time. Hmm... reminds me of some criticism about the StAX stream API :-D

LOL! :-)

> So let's agree that we disagree. I'll see what you guys come up with and
> hope I'll change my mind then.

I just wanted to say that I share many points with Sylvain even if it's still hard for me to say which option I prefer.

This will require some more thinking...

-- 
Best regards,
Grzegorz Kossakowski

Re: [C3] StAX research reveiled!

Posted by Sylvain Wallez <sy...@apache.org>.

Steven Dolg wrote:
> Sylvain Wallez schrieb:
>> <snip/>
>>
>> Steven Dolg wrote:
>>> Basically you're providing a buffer between every pair of components 
>>> and fill it as needed.
>>
>> Yes. Now this buffer will always contain a very limited number of 
>> events, corresponding to the result of processing an amount of input 
>> data that is convenient to process at once to avoid complex state 
>> management (e.g. an <i18:text> tag with all its children). And so 
>> most often, this buffer will contain just one event.
>>
>> Think of it as being just a bridge between the writer view used by a 
>> producer and the reader view used by its consumer. These are in my 
>> opinion the most convenient views to write StAX components.
>>
>>> But you need to implement both XMLStreamWriter and XMLStreamReader 
>>> and optimize that for any possible thing a transformer might do.
>>> In order to buffer all the data from the components you will have to 
>>> create some objects as well - I guess you will end up with something 
>>> like the XMLEvent and maintaining a list of them in the StaxFIFO.
>>> That's why I think an efficient (as in faster than the Event API)  
>>> implementation of the StaxFIFO is difficult to make.
>>
>> It's certainly less trivial than maitaining a list of events, but 
>> should be doable quite efficiently by using an int FIFO (to store 
>> event types and attribute counts) and a String FIFO (for everything 
>> else). I'll try find a couple of hours to prototype this.
>>
>>> On the other hand I do think that the cursor API is quite a bit 
>>> harder to use.
>>> As stated in the Javadoc of XMLStreamReader it is the lowest level 
>>> for reading XML data - which usually means more logic in the code 
>>> using the API and more knowledge in the head of the developer 
>>> reading/writing the code is required.
>>> So I second Andreas' statement that we will sacrifice simplicity for 
>>> (a small amount of ?) performance.
>>
>> I understand your point, even if I don't totally agree :-) Now it 
>> should be mentioned that if even with events, my proposal still 
>> stands: just replace XMLStream{Reader|Writer} with 
>> XMLEvent{Reader|Writer}.
>>
>>> The other thing is that - at least the way you suggested - we would 
>>> need a special implementation of the Pipeline interface.
>>> That is something that compromises the intention behind having a 
>>> Pipeline API.
>>> Right now we can use the new StAX components and simply put them 
>>> into any of the Pipeline implementations we already have.
>>> Sacrificing this is completely out of the question IMO.
>>
>> Actually, I'm wondering if wanting a single API is not wishful 
>> thinking and will in the end lead to something that is overly 
>> abstract and hence difficult to understand and use, or where 
>> underlying implementations will leak in the high-level abstraction.
>>
>> There is already some impedence mismatch appearing between pull and 
>> push in the code:
>> - a StAXGenerator has to call initiatePullProcessing() on its 
>> consumer, which in turn will have to call it on it's own consumer, 
>> etc until we reach the Finisher that will finally start pulling 
>> events. This moves a responsibility that belongs to the pipeline down 
>> to its components.
>
> Well I don't see the problem with that.
> From the pipeline's point of view those are normal components just 
> like all the other.
> The pipeline was never intended to "care" about the internals of the 
> components - so why bothering that the StAXGenerator calls 
> "initiatePullProcessing" on its consumer instead of calling some other 
> method like e.g. "startDocument".

Hmm... the fact that every implementation has to copy/paste the exact 
same call to initiatePullProcessing (or has extend a common abstract 
class that does it) because the pipeline expects processing to be 
started on the first component is a sign of a design problem to me.

Some responsibilities of the pipeline creep into its components because 
the pipeline is too abstract, or because there's an intermediate 
adaptation layer that's missing.

>> - an AbstractStAXProducer only accepts a StAXConsumer, defeating the 
>> idea of a unified pipeline implementation that will accept everything.
>
> The idea was to have pipelines being capable of processing virtually 
> any data.
> But that is not the same as combining components in an arbitrary way, 
> e.g. there is no sense in linking a FileGenerator with an (not yet 
> existing) ImageTransformer based on Java's Imaging API.
>
> The components must be "compatible" - that is they must understand the 
> data they exchange with each other.
> We may however provide some adapters/converters to make certain 
> "types" of components compatible, e.g. SAX <--> StAX.
>
>> So we should either have several APIs specifically tailored to the 
>> underlying push or pull model, or make sure the unified API and its 
>> implementations accept any kind of component and set the appropriate 
>> conversion bridges between them.
>
> As I tried to state above: that will not be possible for every 
> conceivable combination of components.
> At least not when thinking beyond XML - which I do.

Again I doubt of the real value of a common unified pipeline if all the 
responsibility of ensuring proper compatibility between components 
(including possible data conversion) is delegated to components. This 
leaves a lot of complexity to component implementers (except in the 
simple straightforward push scenario), and the features of the pipeline 
will be limited to linking the components together and caching.

Furthermore, people will have to take great care of choosing components 
that fit together, or they will get exceptions at pipeline execution 
time. Hmm... reminds me of some criticism about the StAX stream API :-D

So let's agree that we disagree. I'll see what you guys come up with and 
hope I'll change my mind then.

Sylvain

-- 
Sylvain Wallez - http://bluxte.net

Re: [C3] StAX research reveiled!

Posted by Steven Dolg <st...@indoqa.com>.

Sylvain Wallez schrieb:
> <snip/>
>
> Steven Dolg wrote:
>> Basically you're providing a buffer between every pair of components 
>> and fill it as needed.
>
> Yes. Now this buffer will always contain a very limited number of 
> events, corresponding to the result of processing an amount of input 
> data that is convenient to process at once to avoid complex state 
> management (e.g. an <i18:text> tag with all its children). And so most 
> often, this buffer will contain just one event.
>
> Think of it as being just a bridge between the writer view used by a 
> producer and the reader view used by its consumer. These are in my 
> opinion the most convenient views to write StAX components.
>
>> But you need to implement both XMLStreamWriter and XMLStreamReader 
>> and optimize that for any possible thing a transformer might do.
>> In order to buffer all the data from the components you will have to 
>> create some objects as well - I guess you will end up with something 
>> like the XMLEvent and maintaining a list of them in the StaxFIFO.
>> That's why I think an efficient (as in faster than the Event API)  
>> implementation of the StaxFIFO is difficult to make.
>
> It's certainly less trivial than maitaining a list of events, but 
> should be doable quite efficiently by using an int FIFO (to store 
> event types and attribute counts) and a String FIFO (for everything 
> else). I'll try find a couple of hours to prototype this.
>
>> On the other hand I do think that the cursor API is quite a bit 
>> harder to use.
>> As stated in the Javadoc of XMLStreamReader it is the lowest level 
>> for reading XML data - which usually means more logic in the code 
>> using the API and more knowledge in the head of the developer 
>> reading/writing the code is required.
>> So I second Andreas' statement that we will sacrifice simplicity for 
>> (a small amount of ?) performance.
>
> I understand your point, even if I don't totally agree :-) Now it 
> should be mentioned that if even with events, my proposal still 
> stands: just replace XMLStream{Reader|Writer} with 
> XMLEvent{Reader|Writer}.
>
>> The other thing is that - at least the way you suggested - we would 
>> need a special implementation of the Pipeline interface.
>> That is something that compromises the intention behind having a 
>> Pipeline API.
>> Right now we can use the new StAX components and simply put them into 
>> any of the Pipeline implementations we already have.
>> Sacrificing this is completely out of the question IMO.
>
> Actually, I'm wondering if wanting a single API is not wishful 
> thinking and will in the end lead to something that is overly abstract 
> and hence difficult to understand and use, or where underlying 
> implementations will leak in the high-level abstraction.
>
> There is already some impedence mismatch appearing between pull and 
> push in the code:
> - a StAXGenerator has to call initiatePullProcessing() on its 
> consumer, which in turn will have to call it on it's own consumer, etc 
> until we reach the Finisher that will finally start pulling events. 
> This moves a responsibility that belongs to the pipeline down to its 
> components.
Well I don't see the problem with that.
 From the pipeline's point of view those are normal components just like 
all the other.
The pipeline was never intended to "care" about the internals of the 
components - so why bothering that the StAXGenerator calls 
"initiatePullProcessing" on its consumer instead of calling some other 
method like e.g. "startDocument".

> - an AbstractStAXProducer only accepts a StAXConsumer, defeating the 
> idea of a unified pipeline implementation that will accept everything.
The idea was to have pipelines being capable of processing virtually any 
data.
But that is not the same as combining components in an arbitrary way, 
e.g. there is no sense in linking a FileGenerator with an (not yet 
existing) ImageTransformer based on Java's Imaging API.

The components must be "compatible" - that is they must understand the 
data they exchange with each other.
We may however provide some adapters/converters to make certain "types" 
of components compatible, e.g. SAX <--> StAX.

>
> So we should either have several APIs specifically tailored to the 
> underlying push or pull model, or make sure the unified API and its 
> implementations accept any kind of component and set the appropriate 
> conversion bridges between them.
As I tried to state above: that will not be possible for every 
conceivable combination of components.
At least not when thinking beyond XML - which I do.

>
> Sylvain
>

Re: [C3] StAX research reveiled!

Posted by Sylvain Wallez <sy...@apache.org>.

<snip/>

Steven Dolg wrote:
> Basically you're providing a buffer between every pair of components 
> and fill it as needed.

Yes. Now this buffer will always contain a very limited number of 
events, corresponding to the result of processing an amount of input 
data that is convenient to process at once to avoid complex state 
management (e.g. an <i18:text> tag with all its children). And so most 
often, this buffer will contain just one event.

Think of it as being just a bridge between the writer view used by a 
producer and the reader view used by its consumer. These are in my 
opinion the most convenient views to write StAX components.

> But you need to implement both XMLStreamWriter and XMLStreamReader and 
> optimize that for any possible thing a transformer might do.
> In order to buffer all the data from the components you will have to 
> create some objects as well - I guess you will end up with something 
> like the XMLEvent and maintaining a list of them in the StaxFIFO.
> That's why I think an efficient (as in faster than the Event API)  
> implementation of the StaxFIFO is difficult to make.

It's certainly less trivial than maitaining a list of events, but should 
be doable quite efficiently by using an int FIFO (to store event types 
and attribute counts) and a String FIFO (for everything else). I'll try 
find a couple of hours to prototype this.

> On the other hand I do think that the cursor API is quite a bit harder 
> to use.
> As stated in the Javadoc of XMLStreamReader it is the lowest level for 
> reading XML data - which usually means more logic in the code using 
> the API and more knowledge in the head of the developer 
> reading/writing the code is required.
> So I second Andreas' statement that we will sacrifice simplicity for 
> (a small amount of ?) performance.

I understand your point, even if I don't totally agree :-) Now it should 
be mentioned that if even with events, my proposal still stands: just 
replace XMLStream{Reader|Writer} with XMLEvent{Reader|Writer}.

> The other thing is that - at least the way you suggested - we would 
> need a special implementation of the Pipeline interface.
> That is something that compromises the intention behind having a 
> Pipeline API.
> Right now we can use the new StAX components and simply put them into 
> any of the Pipeline implementations we already have.
> Sacrificing this is completely out of the question IMO.

Actually, I'm wondering if wanting a single API is not wishful thinking 
and will in the end lead to something that is overly abstract and hence 
difficult to understand and use, or where underlying implementations 
will leak in the high-level abstraction.

There is already some impedence mismatch appearing between pull and push 
in the code:
- a StAXGenerator has to call initiatePullProcessing() on its consumer, 
which in turn will have to call it on it's own consumer, etc until we 
reach the Finisher that will finally start pulling events. This moves a 
responsibility that belongs to the pipeline down to its components.
- an AbstractStAXProducer only accepts a StAXConsumer, defeating the 
idea of a unified pipeline implementation that will accept everything.

So we should either have several APIs specifically tailored to the 
underlying push or pull model, or make sure the unified API and its 
implementations accept any kind of component and set the appropriate 
conversion bridges between them.

Sylvain

-- 
Sylvain Wallez - http://bluxte.net

Re: [C3] StAX research reveiled!

Posted by Steven Dolg <st...@indoqa.com>.

Sylvain Wallez schrieb:
> Andreas Pieber wrote:
>> On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote:
>>  
>>> Michael Seydl wrote:
>>>    
>>>> Hi all!
>>>>
>>>> One more mail for the student group! Behind this lurid topic hides our
>>>> evaluation of the latest XML processing technologies regarding their
>>>> usability in Cocoon3 (especially if there are suited to be used in a
>>>> streaming pipeline).
>>>> As it's commonly know we decided to use StAX as our weapon of choice
>>>> to do the XML, but this paper should explain the whys and hows and
>>>> especially the way we took to come to our decision, which resulted in
>>>> using the very same API.
>>>> Eleven pages should be a to big read and it contains all necessary
>>>> links to all the APIs we evaluated and also line wise our two cents
>>>> about the API we observed. Concludingly we also tried to show the
>>>> difference between the currently used SAX and the of us proposed StAX
>>>> API.
>>>>
>>>> I hope this work sheds some light on our decision making and taking
>>>> and that someone dares to read it.
>>>>
>>>> That's from me, I wish you all a pleasant and very merry Christmas!
>>>>
>>>> Regards,
>>>> Michael Seydl
>>>>       
>>> Good work and interesting read, but don't agree with some of its
>>> statements!
>>>
>>> The big if/else or switch statements mentioned as a drawback of the
>>> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API, 
>>> since
>>> it provides abstract events whose type needs also to be inspected to
>>> decide what to do.
>>>     
>>
>> Of course, you're right!
>>
>>  
>>> The drawbacks of the stream API compared to the event API are, as you
>>> mention, that some methods of XMLStreamReader will throw an exception
>>> depending on the current event's type and that the event is not
>>> represented as a data structure that can be passed directly to the next
>>> element in the pipeline or stored in an event buffer.
>>>
>>> The first point (exceptions) should not happen, unless the code is 
>>> buggy
>>> and tries to get information that doesn't belong to the context. I have
>>> used many times the cursor API and haven't found any usability problems
>>> with it.
>>>     
>>
>> Also here you're right, but IMHO it is not necessary to add another 
>> source for bugs if not required...
>>   
>
> Well, there are so many other sources of bugs... I wouldn't sacrifice 
> efficiency for bad usage of an API. And when dealing with XML, people 
> should know that e.g. calling getAttribute() for a text event is 
> meaningless.
>
>>> The second point (lack of data structure) can be easily solved by using
>>> an XMLEventAllocator [1] that creates an XMLEvent from the current 
>>> state
>>> of an XMLStreamReader.
>>>     
>>
>> Mhm but if we use an XMLEventAllocator, y not directly use the 
>> StAXEvent api?
>>   
>
> Sorry, I wasn't clear: *if* and XMLEvent is needed, then it's easy to 
> get it from a stream.
>
>>> The event API has the major drawback of always creating a new object 
>>> for
>>> every event (since as the javadoc says "events may be cached and
>>> referenced after the parse has completed"). This can lead to a big
>>> strain on the memory system and garbage collection on a busy 
>>> application.
>>>     
>>
>> Thats right, but having in mind to create a pull pipe, where the 
>> serializer pulls each event from the producer through each 
>> transformer and writing it to an output stream we don't have any 
>> other possibility than creating an object for each event.
>>
>> Think about it a little more in detail. To be able to pull each event 
>> you have to have the possibility to call a method looking like:
>>
>> Object next();
>>
>> on the parent of the pipelineComponent. Doing it in a StAX cursor way 
>> means to increase the complexity from one method to 10 or more which 
>> have to be available through the parent...
>>   
>
> Not necessarily, depending on how the API is designed. Let's give it a 
> try:
>
> /** A generator can pull events from somewhere and writes them to an 
> output */
> interface Generator {
>    /** Do we still have something to produce? */
>    boolean hasNext();
>
>    /** Do some processing and produce some output */
>    void pull(XMLStreamWriter output);
> }
>
> /** A transformer is a generator that has an XML input */
> interface Transformer extends Generator {
>    void setInput(XMLStreamReader input);
> }
>
> class StaxFIFO implements XMLStreamReader, XMLStreamWriter {
>    Generator generator;
>    StaxFIFO(Generator generator) {
>        this.generator = generator;
>    }
>
>    // Implement all XMLStreamWriter methods as writing to an
>    // internal stream FIFO buffer
>
>    // Implement all XMLStreamReader methods as reading from an
>    // internal stream FIFO buffer, except hasNext() below:
>
>    boolean hasNext() {
>        while (eventBufferIsEmpty() && generator.hasNext()) {
>            // Ask the generator to produce some events
>            generator.pull(this);
>        }
>        return !eventBufferIsEmpty();
>    }
> }
>
> Building and executing a pipeline is then rather simple :
>
> class Pipeline {
>    Generator generator;
>    Transformer transformers[];
>    XMLStreamWriter serializer;
>
>    void execute() {
>        Generator last = generator;
>        for (Transformer tr : transformers) {
>            tr.setInput(new StaxFIFO(previous);
>            last = tr;
>        }
>
>        // Pull from the whole chain to the serializer
>        while(last.hasNext()) {
>            last.pull(serializer);
>        }
>    }
> }
>
> Every component gets an XMLStreamWriter where to write their output 
> (in a style equivalent to SAX), and transformers get an 
> XMLStreamReader where to get their input from.
>
> The programming model is then very simple: for every call to pull(), 
> read something in, process it and produce the corresponding output 
> (optional, since end of processing is defined by hasNext()). The 
> buffers used to connect the various components allow pull() to read 
> and process a set of related events, resulting in any number of events 
> being written to the buffer.
>
>>> So the cursor API is the most efficient IMO when it comes to consuming
>>> data, since it doesn't require creating useless event objects.
>>>
>>> Now in a pipeline context, we will want to transmit events untouched
>>> from one component to the next one, using some partial buffering as
>>> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be
>>> the natural solution for this, but would require the use of events at
>>> the pipeline API level, with their associated costs mentioned above.
>>>     
>>
>> I'm not sure if I get the point here, but we do not like to 
>> "transmit" events. They are pulled. Therefore in most cases we simply 
>> do not need a buffer, since events could be directly returned.
>>   
>
> In the previous discussions, it was considered that pulling from the 
> previous component would lead that component to process in one pull 
> call a number or related events, to avoid state handling that would 
> make it even more complex than SAX. And processing these related 
> events will certainly mean returning ("transmitting" as I said) 
> several events. So buffering *is* needed in most non-trivial cases.
>
>>> So what should be used for pipelines ? My impression is that we should
>>> stick to the most efficient API and build the simple tools needed to
>>> buffer events from a StreamReader, taking inspiration from the
>>> XMLBytestreamCompiler we already have.
>>>     
>>
>> Maybe some events could be avoided using the cursor API, but IMO the 
>> performance we could get is not worth the simplicity we sacrifice...
>>   
>
> I don't agree that we sacrifice simplicity. With the above, the 
> developer only deals with XMLStreamWriter and XMLStreamReader objects, 
> and never has to implement them.
>
> We just need an efficient StaxFIFO class, which people shouldn't care 
> about since it is completely hidden in the Pipeline object.
>
> Thoughts?
Well the approach outlined about will certainly work.

Basically you're providing a buffer between every pair of components and 
fill it as needed.
But you need to implement both XMLStreamWriter and XMLStreamReader and 
optimize that for any possible thing a transformer might do.
In order to buffer all the data from the components you will have to 
create some objects as well - I guess you will end up with something 
like the XMLEvent and maintaining a list of them in the StaxFIFO.
That's why I think an efficient (as in faster than the Event API)  
implementation of the StaxFIFO is difficult to make.

On the other hand I do think that the cursor API is quite a bit harder 
to use.
As stated in the Javadoc of XMLStreamReader it is the lowest level for 
reading XML data - which usually means more logic in the code using the 
API and more knowledge in the head of the developer reading/writing the 
code is required.
So I second Andreas' statement that we will sacrifice simplicity for (a 
small amount of ?) performance.


The other thing is that - at least the way you suggested - we would need 
a special implementation of the Pipeline interface.
That is something that compromises the intention behind having a 
Pipeline API.
Right now we can use the new StAX components and simply put them into 
any of the Pipeline implementations we already have.
Sacrificing this is completely out of the question IMO.


Steven
>
> Sylvain
>

Re: [C3] StAX research reveiled!

Posted by Sylvain Wallez <sy...@apache.org>.

Sylvain Wallez wrote:

<snip/>

> class Pipeline {
>    Generator generator;
>    Transformer transformers[];
>    XMLStreamWriter serializer;
>
>    void execute() {
>        Generator last = generator;
>        for (Transformer tr : transformers) {
>            tr.setInput(new StaxFIFO(previous);
Should be of course "tr.setInput(new StaxFIFO(last));" -- Thunderbird 
doesn't do Java syntax check ;-)
>            last = tr;
>        }
>
>        // Pull from the whole chain to the serializer
>        while(last.hasNext()) {
>            last.pull(serializer);
>        }
>    }
> }

Sylvain

-- 
Sylvain Wallez - http://bluxte.net

Re: [C3] StAX research reveiled!

Posted by Sylvain Wallez <sy...@apache.org>.

Andreas Pieber wrote:
> On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote:
>   
>> Michael Seydl wrote:
>>     
>>> Hi all!
>>>
>>> One more mail for the student group! Behind this lurid topic hides our
>>> evaluation of the latest XML processing technologies regarding their
>>> usability in Cocoon3 (especially if there are suited to be used in a
>>> streaming pipeline).
>>> As it's commonly know we decided to use StAX as our weapon of choice
>>> to do the XML, but this paper should explain the whys and hows and
>>> especially the way we took to come to our decision, which resulted in
>>> using the very same API.
>>> Eleven pages should be a to big read and it contains all necessary
>>> links to all the APIs we evaluated and also line wise our two cents
>>> about the API we observed. Concludingly we also tried to show the
>>> difference between the currently used SAX and the of us proposed StAX
>>> API.
>>>
>>> I hope this work sheds some light on our decision making and taking
>>> and that someone dares to read it.
>>>
>>> That's from me, I wish you all a pleasant and very merry Christmas!
>>>
>>> Regards,
>>> Michael Seydl
>>>       
>> Good work and interesting read, but don't agree with some of its
>> statements!
>>
>> The big if/else or switch statements mentioned as a drawback of the
>> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API, since
>> it provides abstract events whose type needs also to be inspected to
>> decide what to do.
>>     
>
> Of course, you're right!
>
>   
>> The drawbacks of the stream API compared to the event API are, as you
>> mention, that some methods of XMLStreamReader will throw an exception
>> depending on the current event's type and that the event is not
>> represented as a data structure that can be passed directly to the next
>> element in the pipeline or stored in an event buffer.
>>
>> The first point (exceptions) should not happen, unless the code is buggy
>> and tries to get information that doesn't belong to the context. I have
>> used many times the cursor API and haven't found any usability problems
>> with it.
>>     
>
> Also here you're right, but IMHO it is not necessary to add another source for 
> bugs if not required...
>   

Well, there are so many other sources of bugs... I wouldn't sacrifice 
efficiency for bad usage of an API. And when dealing with XML, people 
should know that e.g. calling getAttribute() for a text event is 
meaningless.

>> The second point (lack of data structure) can be easily solved by using
>> an XMLEventAllocator [1] that creates an XMLEvent from the current state
>> of an XMLStreamReader.
>>     
>
> Mhm but if we use an XMLEventAllocator, y not directly use the StAXEvent api?
>   

Sorry, I wasn't clear: *if* and XMLEvent is needed, then it's easy to 
get it from a stream.

>> The event API has the major drawback of always creating a new object for
>> every event (since as the javadoc says "events may be cached and
>> referenced after the parse has completed"). This can lead to a big
>> strain on the memory system and garbage collection on a busy application.
>>     
>
> Thats right, but having in mind to create a pull pipe, where the serializer 
> pulls each event from the producer through each transformer and writing it to an 
> output stream we don't have any other possibility than creating an object for 
> each event.
>
> Think about it a little more in detail. To be able to pull each event you have 
> to have the possibility to call a method looking like:
>
> Object next();
>
> on the parent of the pipelineComponent. Doing it in a StAX cursor way means to 
> increase the complexity from one method to 10 or more which have to be available 
> through the parent...
>   

Not necessarily, depending on how the API is designed. Let's give it a try:

/** A generator can pull events from somewhere and writes them to an 
output */
interface Generator {
    /** Do we still have something to produce? */
    boolean hasNext();

    /** Do some processing and produce some output */
    void pull(XMLStreamWriter output);
}

/** A transformer is a generator that has an XML input */
interface Transformer extends Generator {
    void setInput(XMLStreamReader input);
}

class StaxFIFO implements XMLStreamReader, XMLStreamWriter {
    Generator generator;
    StaxFIFO(Generator generator) {
        this.generator = generator;
    }

    // Implement all XMLStreamWriter methods as writing to an
    // internal stream FIFO buffer

    // Implement all XMLStreamReader methods as reading from an
    // internal stream FIFO buffer, except hasNext() below:

    boolean hasNext() {
        while (eventBufferIsEmpty() && generator.hasNext()) {
            // Ask the generator to produce some events
            generator.pull(this);
        }
        return !eventBufferIsEmpty();
    }
}

Building and executing a pipeline is then rather simple :

class Pipeline {
    Generator generator;
    Transformer transformers[];
    XMLStreamWriter serializer;

    void execute() {
        Generator last = generator;
        for (Transformer tr : transformers) {
            tr.setInput(new StaxFIFO(previous);
            last = tr;
        }

        // Pull from the whole chain to the serializer
        while(last.hasNext()) {
            last.pull(serializer);
        }
    }
}

Every component gets an XMLStreamWriter where to write their output (in 
a style equivalent to SAX), and transformers get an XMLStreamReader 
where to get their input from.

The programming model is then very simple: for every call to pull(), 
read something in, process it and produce the corresponding output 
(optional, since end of processing is defined by hasNext()). The buffers 
used to connect the various components allow pull() to read and process 
a set of related events, resulting in any number of events being written 
to the buffer.

>> So the cursor API is the most efficient IMO when it comes to consuming
>> data, since it doesn't require creating useless event objects.
>>
>> Now in a pipeline context, we will want to transmit events untouched
>> from one component to the next one, using some partial buffering as
>> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be
>> the natural solution for this, but would require the use of events at
>> the pipeline API level, with their associated costs mentioned above.
>>     
>
> I'm not sure if I get the point here, but we do not like to "transmit" events. 
> They are pulled. Therefore in most cases we simply do not need a buffer, since 
> events could be directly returned.
>   

In the previous discussions, it was considered that pulling from the 
previous component would lead that component to process in one pull call 
a number or related events, to avoid state handling that would make it 
even more complex than SAX. And processing these related events will 
certainly mean returning ("transmitting" as I said) several events. So 
buffering *is* needed in most non-trivial cases.

>> So what should be used for pipelines ? My impression is that we should
>> stick to the most efficient API and build the simple tools needed to
>> buffer events from a StreamReader, taking inspiration from the
>> XMLBytestreamCompiler we already have.
>>     
>
> Maybe some events could be avoided using the cursor API, but IMO the performance 
> we could get is not worth the simplicity we sacrifice...
>   

I don't agree that we sacrifice simplicity. With the above, the 
developer only deals with XMLStreamWriter and XMLStreamReader objects, 
and never has to implement them.

We just need an efficient StaxFIFO class, which people shouldn't care 
about since it is completely hidden in the Pipeline object.

Thoughts?

Sylvain

-- 
Sylvain Wallez - http://bluxte.net

Re: [C3] StAX research reveiled!

Posted by Andreas Pieber <an...@schmutterer-partner.at>.

On Saturday 27 December 2008 10:36:07 Sylvain Wallez wrote:
> Michael Seydl wrote:
> > Hi all!
> >
> > One more mail for the student group! Behind this lurid topic hides our
> > evaluation of the latest XML processing technologies regarding their
> > usability in Cocoon3 (especially if there are suited to be used in a
> > streaming pipeline).
> > As it's commonly know we decided to use StAX as our weapon of choice
> > to do the XML, but this paper should explain the whys and hows and
> > especially the way we took to come to our decision, which resulted in
> > using the very same API.
> > Eleven pages should be a to big read and it contains all necessary
> > links to all the APIs we evaluated and also line wise our two cents
> > about the API we observed. Concludingly we also tried to show the
> > difference between the currently used SAX and the of us proposed StAX
> > API.
> >
> > I hope this work sheds some light on our decision making and taking
> > and that someone dares to read it.
> >
> > That's from me, I wish you all a pleasant and very merry Christmas!
> >
> > Regards,
> > Michael Seydl
>
> Good work and interesting read, but don't agree with some of its
> statements!
>
> The big if/else or switch statements mentioned as a drawback of the
> cursor API (XMLStreamReader) in 1.2.4 also apply to the event API, since
> it provides abstract events whose type needs also to be inspected to
> decide what to do.

Of course, you're right!

>
> The drawbacks of the stream API compared to the event API are, as you
> mention, that some methods of XMLStreamReader will throw an exception
> depending on the current event's type and that the event is not
> represented as a data structure that can be passed directly to the next
> element in the pipeline or stored in an event buffer.
>
> The first point (exceptions) should not happen, unless the code is buggy
> and tries to get information that doesn't belong to the context. I have
> used many times the cursor API and haven't found any usability problems
> with it.

Also here you're right, but IMHO it is not necessary to add another source for 
bugs if not required...

>
> The second point (lack of data structure) can be easily solved by using
> an XMLEventAllocator [1] that creates an XMLEvent from the current state
> of an XMLStreamReader.

Mhm but if we use an XMLEventAllocator, y not directly use the StAXEvent api?

>
> The event API has the major drawback of always creating a new object for
> every event (since as the javadoc says "events may be cached and
> referenced after the parse has completed"). This can lead to a big
> strain on the memory system and garbage collection on a busy application.

Thats right, but having in mind to create a pull pipe, where the serializer 
pulls each event from the producer through each transformer and writing it to an 
output stream we don't have any other possibility than creating an object for 
each event.

Think about it a little more in detail. To be able to pull each event you have 
to have the possibility to call a method looking like:

Object next();

on the parent of the pipelineComponent. Doing it in a StAX cursor way means to 
increase the complexity from one method to 10 or more which have to be available 
through the parent...

>
> So the cursor API is the most efficient IMO when it comes to consuming
> data, since it doesn't require creating useless event objects.
>
> Now in a pipeline context, we will want to transmit events untouched
> from one component to the next one, using some partial buffering as
> mentioned in earlier discussions. A FIFO of XMLEvent object seems to be
> the natural solution for this, but would require the use of events at
> the pipeline API level, with their associated costs mentioned above.

I'm not sure if I get the point here, but we do not like to "transmit" events. 
They are pulled. Therefore in most cases we simply do not need a buffer, since 
events could be directly returned.

>
> So what should be used for pipelines ? My impression is that we should
> stick to the most efficient API and build the simple tools needed to
> buffer events from a StreamReader, taking inspiration from the
> XMLBytestreamCompiler we already have.

Maybe some events could be avoided using the cursor API, but IMO the performance 
we could get is not worth the simplicity we sacrifice...

>
> Sylvain

Andreas

>
> [1]
> https://stax-utils.dev.java.net/nonav/javadoc/api/javax/xml/stream/util/XML
>EventAllocator.html
-- 
SCHMUTTERER+PARTNER Information Technology GmbH

Hiessbergergasse 1
A-3002 Purkersdorf

T   +43 (0) 69911127344
F   +43 (2231) 61899-99
mail to: andreas.pieber@schmutterer-partner.at

Re: [C3] StAX research reveiled!

Posted by Sylvain Wallez <sy...@apache.org>.

Michael Seydl wrote:
> Hi all!
>
> One more mail for the student group! Behind this lurid topic hides our 
> evaluation of the latest XML processing technologies regarding their 
> usability in Cocoon3 (especially if there are suited to be used in a 
> streaming pipeline).
> As it's commonly know we decided to use StAX as our weapon of choice 
> to do the XML, but this paper should explain the whys and hows and 
> especially the way we took to come to our decision, which resulted in 
> using the very same API.
> Eleven pages should be a to big read and it contains all necessary 
> links to all the APIs we evaluated and also line wise our two cents 
> about the API we observed. Concludingly we also tried to show the 
> difference between the currently used SAX and the of us proposed StAX 
> API.
>
> I hope this work sheds some light on our decision making and taking 
> and that someone dares to read it.
>
> That's from me, I wish you all a pleasant and very merry Christmas!
>
> Regards,
> Michael Seydl

Good work and interesting read, but don't agree with some of its statements!

The big if/else or switch statements mentioned as a drawback of the 
cursor API (XMLStreamReader) in 1.2.4 also apply to the event API, since 
it provides abstract events whose type needs also to be inspected to 
decide what to do.

The drawbacks of the stream API compared to the event API are, as you 
mention, that some methods of XMLStreamReader will throw an exception 
depending on the current event's type and that the event is not 
represented as a data structure that can be passed directly to the next 
element in the pipeline or stored in an event buffer.

The first point (exceptions) should not happen, unless the code is buggy 
and tries to get information that doesn't belong to the context. I have 
used many times the cursor API and haven't found any usability problems 
with it.

The second point (lack of data structure) can be easily solved by using 
an XMLEventAllocator [1] that creates an XMLEvent from the current state 
of an XMLStreamReader.

The event API has the major drawback of always creating a new object for 
every event (since as the javadoc says "events may be cached and 
referenced after the parse has completed"). This can lead to a big 
strain on the memory system and garbage collection on a busy application.

So the cursor API is the most efficient IMO when it comes to consuming 
data, since it doesn't require creating useless event objects.

Now in a pipeline context, we will want to transmit events untouched 
from one component to the next one, using some partial buffering as 
mentioned in earlier discussions. A FIFO of XMLEvent object seems to be 
the natural solution for this, but would require the use of events at 
the pipeline API level, with their associated costs mentioned above.

So what should be used for pipelines ? My impression is that we should 
stick to the most efficient API and build the simple tools needed to 
buffer events from a StreamReader, taking inspiration from the 
XMLBytestreamCompiler we already have.

Sylvain

[1] 
https://stax-utils.dev.java.net/nonav/javadoc/api/javax/xml/stream/util/XMLEventAllocator.html

-- 
Sylvain Wallez - http://bluxte.net