You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Daniel Fagerstrom <da...@nada.kth.se> on 2002/12/17 01:43:58 UTC

[RT] Input Pipelines (long)

Input Pipelines
===============

There is, IMO, a need for better support for input handling in
Cocoon. I believe that the introduction of "input pipelines" can be an
important step in this direction. In the rest of this (long) RT I will
discuss use cases for them, a possible definition of input pipelines,
compare them with the existing pipeline concept in Cocoon (henceforth
called output pipelines), discuss what kind of components that would
be useful in them, how they can be used in the sitemap and from
flowscripts, and also relate them to the current discussion about how
to reuse functionality "Cocoon services" between blocks.

Use cases
---------

There is an ongoing trend of packaging all kinds of application as web
applications or to decompose them as sets of web services. At the same
time web browsers are more and more becoming a universal GUI for all
kinds of applications (e.g. XUL).

This leads to an increasing need for handling of structured input data
in web applications. SOAP might be the most important example, we also
have XML-RPC and most certainly numerous home brewn formats, some might
even be binary non-xml legacy formats. WebDAV is another example of
xml-input, and next generation form handling, XForms, use xml as
transport format.

As people are building more and more advanced Cocoon-systems there is
also a growing need for reusing functionality in a structured way,
there have been discussions about how to package and reuse "Cocoon
services" in the context of blocks [1] and [2]. Here there is also a
need for handling xml-input.

The company I work for build data warehouses, some of our customer are
starting to get interested in using the functionality of the data
warehouses, not only from the the web interfaces that we usually build
but also as parts of their own webapps. This means that we want,
besides Cocoons flexibility in presenting data in different forms,
also flexibility in asking for the data through different input
formats.

There is thus a world of input beyond the request parameters, and a
world of rapidly growing importance.

Does Cocoon support the abovementioned use cases? Yes and no: there
are numerous components that implements SOAP, WebDAV, parts of XForms
etc. But while the components designed for publishing are highly
reusable in various context, this is not the case for input
components. IMO the reason for this is that Cocoon as a framework does
not have much support for input handling.

IMO Cocoon could be as good in handling input as it currently is in
creating output, by reusing exactly the same concept: pipelines. We
can however not use the existing "output pipelines" as is, there are
some assymetries in their design that makes them unsuitable for input.

The term "input pipeline" has sometimes been used on the list, it is
time to try to define what it could be.

What is an Input Pipeline
-------------------------

An input pipeline typically starts by reading octet data from the
input stream of the request object. The input data could be xml, tab
separated data, text that is structured according to a certain
grammar, binary legacy formats like Excel or Word or anything else
that could be translated to xml. The first step in the input pipeline
is an adapter from octet data to a sax events. This sounds quite
similar to a generator, we will return to this in the next session.

The structure of the xml from the first step in the pipeline might not
be in a form that is suitable for the data model that we would like to
use internally in the system. Reasons for this can be that the xml
input is supposed to follow some standard or some customer defined
format. Input adapters for legacy formats will probably produce xml
that is similar to the input format and repeat all kinds of
idiosyncrasies from that format. There is thus a need to transform the
input xml to an xml format more suited to our application specific
needs. One or several xslt-transformer steps would therefore be
useful in the input pipeline.

As a last step in the input pipeline the sax events should be adapted
to some binary format so that e.g. the business logic in the system
can be applied to it. The xml input could e.g. be serialized to an
octet stream for storage in a file (as text, xml, pdf, images, ...),
transformed to java objects for storage in the session object, be put
into an xml db or into an relational db.

Isn't this exactly what an output pipeline does?

Comparison to Output Pipelines
------------------------------

Booth an input and an output pipeline consists of a an adaptor from
a binary format to sax events followed by a (possibly empty) sequence
of transformers that take sax events as input as well as output. The
last step is an adaptor from sax events to a binary format. The main
difference (and the one I will focus on) is how the binary input and
output is connected to the pipeline.

Let us look at an example of an output pipeline:

<match pattern="*.html"/>
   <generate type="xml" src="{1}.xml"/>
   <transform type="xsl" src="foo.xsl"/>
   <serialize type="html"/>
</match>

The input to the pipeline is controlled from the sitemap by the src
attribute in the generator, while the output from the serializer can't
be controlled from the sitemap, the context in which the sitemap is
used is responsible for directing the output to an appropriate
place. If the pipeline is used from a servlet, the output will be
directed to the output stream of the response object in the serlet. If
it is used from the command line, the output will be redirected to a
file. If it is used in the cocoon: protocol the output will be
redirected to be used as input from the src attribute of e.g. a
generator or a transformer (cf. with Carstens and mine writings in
[1] about the semantics of the cocoon: protocol).

Here is another example:

<match pattern="bar.pdf"/>
   <generate type="xsp" src="bar.xsp"/>
   <transform type="xsl" src="foo.xsl"/>
   <serialize type="pdf"/>
</match>

In this case the binary input is taken from the object model and the
component manager in Cocoon and the input file to the generator,
"bar.xsp" describes how to extract the input and how to structure it
as an xml document.

If we compare a Cocoon output pipeline with a unix pipeline, it always
ignore standard input and always write to standard output. An input
pipeline would be the opposite: it would always read from standard
input and ignore standard output. In Cocoon this would mean that the
input source would be set by the context. In a servlet, input would be
taken from the input stream of the request object. We could also have
a writable cocoon: protocol where the input stream would be set by the
user of the protocol, more about that later, (see also my post in the
thread [1]).

An example:

<match pattern="**.xls"/>
   <generate type="xls"/>
   <transform type="xsl" src="foo.xsl"/>
   <serialize type="xml" dest="context://repository/{1}.xml"/>
</match>

Here the generator reads an Excel document from the input stream that
is submitted by the context, and translate it to some xml format. The
serializer write its xml input in the file system. I reused the names
generator and serializer partly because I didn't found any good names
(deserializer is the inverse to serializer, but what is the inverse of
a generator?), and partly because it IMO would be the best solution if
the generator and serializer from output pipelines can be extended to
be usable in input pipelines as well. Several of the existing
generators would be highly usable in input pipelines if they were
modified in such a way that they read from "standard input" when no
src attribute is given. There are also some serializers that would be
usefull in the input pipelines as well, in this case the output stream
given i the dest attribute should be used instead of the one that is
supplied by the context. It can of course be problematic to extend the
definition of generators anda serializers as it might lead to back
compabillity problems.

Another example of an input pipeline:

<match pattern="in"/>
   <generate type="textparser">
     <parameter name="grammar" value="example.txt"/>
   </generate>
   <transform type="xsl" src="foo.xsl"/>
   <serialize type="xsp" src="toSql.xsp"/>
</match>

In this example the serializer modify the content of components that
can be found from the object model and the component manager. We use a
hypothetical "output xsp" language to describe how to modify the
environment. Such a language could be a little bit like xslt in the
sense that it recursively applies templates (rules) with matching
xpath patterns. But the template would contain custom tags that have
side effects instead of just emitting xml. Could such a language be
implemented in Jelly? It would be useful with custom tags that modify
the session object, that writes to sql databases, connect with business
logic and so on.

Error Handling
--------------

Error handling in input pipelines is even more important than in
output pipelines: We must protect the system against non well formed
input and the user must be given detailed enough information about
whats wrong, while they in many cases has no access to log files or
access to the internals of the system.

Examples of things that can go wrong is that the input not is parsable
or that it isn't valid with respect to some grammar or scheme. If we
want input pipelines to work in streaming mode, without unnecessary
buffering, it is impossible to know that the input data is correct until all
of it is processed. This means that serializer might already have
stored some parts of the pipeline data when an error is detected. I
think that serializers where faulty input data would be unacceptable,
should use some kind of transactions and that they should be notified
when something goes wrong earlier in the pipeline so that they are
able to roll back the transaction.

I have not studied the error handling system in Cocoon, maybe there
already are mechanisms that could be used in input pipelines as well?

In Sitemaps
-----------

In a sitemap an input pipeline could be used e.g. for implementing a
web service:

<match pattern="myservice">
   <generate type="xml">
     <parameter name="scheme" value="myInputFormat.scm"/>
   </generate>
   <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
   <serialize type="dom-session" non-terminating="true">
     <parameter name="dom-name" value="input"/>
   </serialize>
   <select type="pipeline-state">
     <when test="success">
       <act type="my-business-logic"/>
       <generate type="xsp" src="collectTheResult.xsp"/>
       <serialize type="xml"/>
     </when>
     <when test="non-valid">
       <!-- produce an error document -->
     </when>
   </select>
</match>

Here we have first an input pipeline that reads and validates xml
input, transforms it to some appropriate format and store the result
as a dom-tree in a session attribute. A serializer normally means that
the pipeline should be executed and thereafter an exit from the
sitemap. I used the attribute non-terminating="true", to mark that
the input pipeline should be executed but that there is more to do in
the sitemap afterwards.

After the input pipeline there is a selector that select the output
pipeline depending of if the input pipeline succeed or not. This use
of selection have some relation to the discussion about pipe-aware
selection (see [3] and the references therein). It would solve at
least my main use cases for pipe-aware selection, without having its
drawbacks: Stefano considered pipe-aware selection mix of concern,
selection should be based on meta data (pipeline state) rather than on
data (pipeline content). There were also some people who didn't like
my use of buffering of all input to the pipe-aware selector. IMO the
use of selectors above solves booth of these issues.

The output pipeline start with an action that takes care about the
business logic for the application. This is IMHO a more legitimate use
for actions than the current mix of input handling and business logic.

In Flowscripts
--------------

IIRC the discussion and examples of input for flowscripts this far has
mainly dealed with request parameter based input. If we want to use
flowscripts for describing e.g. web service flow, more advanced input
handling is needed. IMO it would be an excelent SOC to use output
pipelines for the presentation of the data used in the system, input
pipelines for going from input to system data, java objects (or some
other programming language) for describing business logic working on
the data within the system, and flowscripts for connecting all this in
an appropriate temporal order.

For Reuseability Between Blocks
-------------------------------

There have been some discussions about how to reuse functionality
between blocks in Cocoon (see the threads [1] and [2] for
background). IMO (cf. my post in the thread [1]), a natural way of
exporting pipeline functionality is by extending the cocoon pseudo
protocol, so that it accepts input as well as produces output. The
protocol should also be extended so that input as well as output can
be any octet stream, not just xml.

If we extend generators so that their input can be set by the
environment (as proposed in the discussion about input pipelines), we
have what is needed for creating a writable cocoon protocol. The web
service example in the section "In Sitemaps" could also be used as an
internal service, exported from a block.

Booth input and output for the extended cocoon protocol can be booth
xml and non-xml, this give us 4 cases:

xml input, xml output: could be used from a "pipeline"-transformer,
the input to the transformer is redirected to the protocol and the
output from the protocol is redirected to the output of the
transformer.

non-xml input, xml output: could be used from a generator.

xml input, non-xml output: could be used from a serializer.

non-xml input, non-xml output: could be used from a reader if the
input is ignored, from a "writer" if the output is ignored and from a
"reader-writer", if booth are used.

Generators that accepts xml should of course also accept sax-events
for efficiency reasons, and serializer that produces xml should of the
same reason also be able to produce sax-events.

Conclusion
----------

The ability to handle structured input (e.g. xml) in a convenient way,
will probably be an important requirement on webapp frameworks in the
near future.

By removing the asymmetry between generators and serializers, by letting
the input of a generator be set by the context and the output of a
serializer be set from the sitemap, Cocoon could IMO be as good in
handling input as it is today in producing output.

This would also make it possible to introduce a writable as well as
readable Cocoon pseudo protocol, that would be a good way to export
functionality from blocks.

There are of course many open questions, e.g. how to implement those
ideas without introducing to much back incompability.

What do you think?

/Daniel Fagerstrom

References
----------

[1] [RT] Using pipeline as sitemap components (long)
http://marc.theaimsgroup.com/?t=103787330400001&r=1&w=2

[2] [RT] reconsidering pipeline semantics
http://marc.theaimsgroup.com/?t=102562575200001&r=2&w=2

[3] [Contribution] Pipe-aware selection
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=101735848009654&w=2




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines: Storage and Selection (was Re: [RT] Input Pipelines (long))

Posted by Stefano Mazzocchi <st...@apache.org>.

Sorry for taking me so long.

Daniel Fagerstrom wrote:
> The discussion about input pipelines can be divided in two parts:
> 1. Improving the handling of the input stream in Cocoon. This is needed 
> for web services, it is also needed for making it possible to implement 
> a writable cocoon:-protocol, something that IMO would be very useful for 
> reusing functionality in Cocoon, especially from blocks.
> 
> 2. The second part of the proposal is to use two pipelines, executed in 
> sequence, to respond to input in Cocoon. The first pipeline (called 
> input pipeline) is responsible for reading the input and from request 
> parameters or from the input stream, transform it to an appropriate 
> format and store it in e.g. a session parameter, a file or a db. After 
> the input pipeline there is an ordinary (output) pipeline that is 
> responsible for generating the response. The output pipeline is executed 
> after that the execution of the input pipeline is completed, as a 
> consequence actions and selections in the output pipeline can be 
> dependent e.g. on if the handling of input succeeded or not and on the 
> data that was stored by the input pipeline.
> 
> Here I will focus on your comments on the second part of the proposal.

Ok.

I'm leaving a bunch of stuff uncut because I don't know where to cut the 
context.

>  > Daniel Fagerstrom wrote:
> <snip/>
>  >> In Sitemaps
>  >> -----------
>  >>
>  >> In a sitemap an input pipeline could be used e.g. for implementing a
>  >> web service:
>  >>
>  >> <match pattern="myservice">
>  >>   <generate type="xml">
>  >>     <parameter name="scheme" value="myInputFormat.scm"/>
>  >>   </generate>
>  >>   <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>  >>   <serialize type="dom-session" non-terminating="true">
>  >>     <parameter name="dom-name" value="input"/>
>  >>   </serialize>
>  >>   <select type="pipeline-state">
>  >>     <when test="success">
>  >>       <act type="my-business-logic"/>
>  >>       <generate type="xsp" src="collectTheResult.xsp"/>
>  >>       <serialize type="xml"/>
>  >>     </when>
>  >>     <when test="non-valid">
>  >>       <!-- produce an error document -->
>  >>     </when>
>  >>   </select>
>  >> </match>
>  >>
>  >> Here we have first an input pipeline that reads and validates xml
>  >> input, transforms it to some appropriate format and store the result
>  >> as a dom-tree in a session attribute. A serializer normally means that
>  >> the pipeline should be executed and thereafter an exit from the
>  >> sitemap. I used the attribute non-terminating="true", to mark that
>  >> the input pipeline should be executed but that there is more to do in
>  >> the sitemap afterwards.
>  >>
>  >> After the input pipeline there is a selector that select the output
>  >> pipeline depending of if the input pipeline succeed or not. This use
>  >> of selection have some relation to the discussion about pipe-aware
>  >> selection (see [3] and the references therein). It would solve at
>  >> least my main use cases for pipe-aware selection, without having its
>  >> drawbacks: Stefano considered pipe-aware selection mix of concern,
>  >> selection should be based on meta data (pipeline state) rather than on
>  >> data (pipeline content). There were also some people who didn't like
>  >> my use of buffering of all input to the pipe-aware selector. IMO the
>  >> use of selectors above solves booth of these issues.
>  >>
>  >> The output pipeline start with an action that takes care about the
>  >> business logic for the application. This is IMHO a more legitimate use
>  >> for actions than the current mix of input handling and business logic.
>  >
>  >
>  > Wouldn't the following pipeline achieve the same functionality you want
>  > without requiring changes to the architecture?
>  >
>  > <match pattern="myservice">
>  >   <generate type="payload"/>
>  >   <transform type="validator">
>  >     <parameter name="scheme" value="myInputFormat.scm"/>
>  >   </transform>
>  >   <select type="pipeline-state">
>  >     <when test="valid">
>  >       <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>  >       <transform type="my-business-logic"/>
>  >       <serialize type="xml"/>
>  >     </when>
>  >     <otherwise>
>  >       <!-- produce an error document -->
>  >     </otherwise>
>  >   </select>
>  > </match>
> 
> Yes, it would achieve about the same functionality as I want and it 
> could easily be implemented with the help of the small extensions of the 
> sitemap interpreter that I implemented for pipe aware selection [3].
> 
> I think it could be interesting to do a detailed comparison between the 
> differences in our proposals: How the input stream and validation is 
> handled, how the selection based on pipeline state is performed, if 
> storage of the input is done in a serializer or in a transformer, and 
> how the new output is created.

Ok, let's go.

> Input Stream
> ------------
> 
> For input stream handling you used
> 
>   <generate type="payload"/>
> 
> Is the payload generator equivalent to the StreamGenerator? Or does it 
> something more, like switching parser depending on mime type for the 
> input stream?

I really don't think this is important. We are basically discussing if 
the current sitemap architecture is good enough for what you want.

Once the Cocoon Environment is more balanced toward input, you can have 
a uber-payload-generator that does everything and brews beer, or you can 
have your own small personal generator that does what you want.

My point was: why asking for two pipelines when you can do the same 
thing with one?

> I used
> 
>   <generate type="xml"/>
> 
> The idea is that if no src attribute is given the sitemap interpreter 
> automatically connect the generator to the input stream of the 
> environment (the input stream from the http request in the servlet case, 
> in other cases it is more unclear). This behavior was inspired by the 
> handling of std input in unix pipelines.

Hmmm, interesting concept indeed, but I wonder if it's really meaninful 
in our context. I mean, maybe there are generators that don't need src 
and don't rely on input. But an idiotic TimeGenerator is the only one I 
can think of... and that really doesn't stand up as an argument, does it?

> Nicola Ken proposed:
> 
>   <generate type="xml" src="inputstream://"/>
> 
> I prefer this solution compared to mine as it doesn't require any change 
> of the sitemap interpreter, I also believe that it it easier to 
> understand as it is more explicit. It also (as Nicola Ken has explained) 
> gives a good SoC, the uri in the src attribute describes where to read 
> the resource from, e.g. input stream, file, cvs, http, ftp, etc and the 
> generator is responsible for how to parse the resource. If we develop a 
> input stream protocol, all the work invested in the existing generators, 
> can immediately be reused in web services.

It is true that reduces the number of required generators. But there is 
something about this that disturbs me even if I can't really tell you 
what it is rationally... hmmm...

> Validation
> ----------
> 
> Should validation be part of the parsing of input as in:
> 
>   <generate type="xml">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </generate>
> 
> or should it be a separate transformation step:
> 
>   <transform type="validator">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </transform>
> 
> or maybe the responsibility of the protocol as Nicola Ken proposed in 
> one of his posts:
> 
>   <generate type="xml" src="inputstream:myInputFormat.scm"/>
> 
> This is not a question about architecture but rather one about finding 
> "best practices".
> 
> I don't think validation should be part of the protocol. 

I disagree. Quite strongly, actually. Consider xinclude or any xml 
expansion that changes the stream infoset. You could have valid 
templates and valid fragments and still have invalid results (namespaces 
make the whole thing very tricky... and in the future we'll need the 
ability to mix tons of them, think FO+SVG+MathML for a normal example)

Now, if our xml-processing architecture is balanced enough, people might 
want to use xinclude transformers to juice-up their SOAP-processing 
pipelines. At that point, where do you validate?

Keeping the validation at a separate level helps because:

  1) validation becomes explicit and infoset-transparent, in the spirit 
of RelaxNG.

  2) multiple validation is possible (in the spirit of Xpipe)

  3) pipeline authors are more aware of validation issues as pipeline 
processing stages.

> It means that 
> the protocol has to take care of the parsing and that would mumble the 
> SoC where the protocol is responsible for locating and delivering the 
> stream and the generator is responsible for parsing it, that Nicola Ken 
> have argued for in his other posts.

Well, the problem is that relating the concept of validation to the 
concept of parsing and infoset production/augmentation is a *MISTAKE* 
that the XML specification perpetuated from the SGML days.

Please, let's stop it once for all. Putting validation as an implicit 
stage of parsing would set us back at least 5 years in markup 
technologies design.

> Should validation be part of the generator or a transform step? I don't 
> know. 

Transformation, for the simple reason that you might need to validate a 
pipeline more than once.

 > If the input not is xml as for the ParserGenerator, I guess that
> the validation must take place in the generator. If the xml parser 
> validates the input as a part of the parsing it is more practical to let 
> the generator be responsible for validation (IIRC Xerces2 has an 
> internal pipeline structure and performs validation in a transformer 
> like way, so for Xerces2 it would probably be as efficient to do 
> validation in a transformer as in a generator).

Note that the fact of including the *location* of a schema inside a 
document is another huge mistake perpetuated because XML failed to 
describe schema catalogs.

A document should indicate what "type" of document it is (something like 
the public DTD identifier) and let the system find out *how* to validate 
that document type.

> Otherwise it seem to 
> give better SoC to separate the parsing and the validation step, so that 
> we can have one validation transformer for each scheme language.

No, if the description of the document is done properly (NOTE: even 
JClark hasn't still figured out a way to address the issue )

I would do it like this

  <?xml version="1.0"?>
  <document xml:type="http://apache.org/document/1.1/">
   ...
  </document>

and then it's up to the processor to understand how to validate a 
document type indicated by that URI.

NOTE: it's not a namespace URI, but an indentifier for the type of 
document that we are using. Of course, the same identifier can be used 
in both cases. For example

  <?xml version="1.0"?>
  <d:document
    xml:type="http://apache.org/document/1.1/"
    xmlns:d="http://apache.org/document/1.1/">
   ...
  </d:document>

> In some cases it might be practical to augment the xml document with 
> error information to be able to give more exact user feedback on where 
> the errors are located. For such applications it seem more natural to me 
> to have validation in a transformer.
> 
> A question that might have architectural consequences is how the 
> validation step should report validation errors.

Agreed.

> If the input is not 
> parseable at all there is not much more to do than throwing an exception 
> and letting the ordinary internal error handler report the situation. If 
> some of the elements or attributes in the input has the wrong type we 
> probably want to return more detailed feedback than just the internal 
> error page. Some possible validation error report mechanisms are: 
> storing an error report object in the environment e.g. in the object 
> model, augmenting the xml document with error reporting attributes or 
> elements, throwing an exception object that contains a detailed error 
> description object or a combination of some of these mechanisms.
> 
> Mixing data and state information was considered to be a bad practice in 
>  the discussion about pipe-aware selection (se references in [3]), that 
> rules out using only augmentation of the xml document as error reporting 
> mechanism. Throwing an exeption would AFAIU lead to difficulties in 
> giving customized error reports. So I believe it would be best to put 
> some kind of state describing object in the environment and possibly 
> combine this whith augmentation of the xml document.

Yes, that would be my assumption too. And in case there is the need to 
incorporate those validation mistakes back into the content, a 
transformer (maybe even an XSLT stylesheet) can do that.

This seems the cleanest solution to me.

> Pipe State Dependent Selection
> ------------------------------
> 
> For selecting response based on if the input document is valid or not 
> you suggest the following:
> 
> ...
>   <transform type="validator">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </transform>
>   <select type="pipeline-state">
>     <when test="valid">
>       <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
> ...
> 
> As I mentioned earlier this could easily be implemented with the 
> "pipe-aware selection" code I submitted in [3]. Let us see how it would 
> work:
> 
> The PipelineStateSelector can not be executed at pipeline construction 
> time as for ordinary selectors. 

Gosh, you're right, I didn't think about that.

> The pipeline before the selector 
> including the ValidatorTransformer must have been executed before the 
> selection is performed. This can be implemented by letting the 
> PipelineStateSelector implement a special marker interface, say 
> PipelineStateAware, so that it can have special treatment in the 
> selection part of the sitemap interpreter.

yes

> When the sitemap interpreter gets a PipelineStateAware selector it first 
> ends the currently constructed pipeline with a serializer that store its 
> sax input in e.g. a dom-tree and the pipeline is processed and the dom 
> tree thith the cashed result is stored in e.g. the object model. In the 
> next step the selector is executed and it can base its decision on 
> result from the first part of the pipeline. If the ValidationTransformer 
> puts a validation result descriptor in the object model, the 
> PipelineStateSelector can perform tests on this result descriptor. In 
> the last step a new pipeline is constructed where the generator reads 
> from the stored dom tree, and in the example above, the first 
> transformer will be an XSLTransformer.

we are reaching the point where pipeline selection cannot be processed 
"a-priori" but must include information on the run-time environment.

As much as I didn't like pipe-aware selection, I do agree that 
validation-aware selection is a special pipe-aware selection but it *IS* 
very important and must be taken in to consideration.

Hmmm, this kinda shades a totally different light on the concept of 
selection. (which has an interesting side effect in making selectors and 
matchers even more different than they are today).

> An alternative and more explicit way to describe the pipeline state 
> dependent selection above, is:
> 
> ...
>   <transform type="validator">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </transform>
>   <serialize type="object-model-dom" non-terminating="true">
>     <parameter name="name" value="validated-input"/>
>   </serialize>
>   <select type="pipeline-state">
>     <when test="valid">
>       <generate type="object-model-dom">
>         <parameter name="name" value="validated-input"/>
>       </generate>
>       <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
> ...
> 
> Here the extensions to the current Cocoon semantics is put in the 
> serializer instead of the selector. The sitemap interpreter treats a 
> non-terminating serializer as ordinary serializer in the sense that it 
> puts the serializer in the end of the current pipeline and executes it. 
> The difference is that it instead of returning to the caller of the 
> sitemap interpreter, it creates a new current pipeline and continue to 
> interpret the component after the serializer, in this case a selector. 
> The sitemap interpreter will also ignore the output stream of the 
> serializer, the serializer is suposed to have side effects. The new 
> current pipeline will then get a ObjectModelDOMGenerator as generator 
> and an XSLTTransformer as its first transformer.

No, I'm sorry but I don't like this. I totally don't like the abuse of 
serialiers for this concept of 'intermetiate-non-sax-stream' components. 
It's potentially very dangerous, I see an incredible potential for abuse.

What do others think about this concept of pipelining pipelines? isn't 
this kind of recursion the mark of FS?

> I prefer this construction compared to the more implicit one because it 
> is more obvious what it does and also as it gives more freedom about how 
> to store the user input. 

True, but it also gives people more ability to abuse the system. Think 
about internal pipelines, and views, and resources and aggregation... 
have you thought about all the potential uses of these pipeline 
pipelining on all current sitemap usecases?

you are, in fact, proposing a *MAJOR* change in the way the pipelines 
are setup. In short, more freedom and less pipeline granularity... but 
sometimes it's good to make it harder for them to come up with 
something... so they *THINK* about it.

Maybe I'm being too conservative, but I'm very afraid of all those 
unplanned (and unwanted) changes that these new chained pipelines could 
produce... (besides, how do you stop them from wanting more than two 
pipelines? should we? would you also like to chain a pipeline with a 
reader and then another pipeline?)

> Some people seem to prefer to store user input 
> in Java beans, in some applications session parameters might be a better 
> place then the object model.

I've seen the ugliest sitemaps coming out of exactly that concept of 
storing everything in the sitemap and then parsing it back into the 
pipeline... believe me, it's more abused than used correctly as it is 
right now.


> 
> Pipelines with Side Effects
> ---------------------------
> 
> A common pattern in pipelines that handle input (at least in the 
> application that I write) is that the first half of the pipeline takes 
> care of the input and ends with a transformer that stores the input. The 
> transformer can be e.g. the SQLTransformer (with insert or update 
> statements), the WriteDOMSessionTransformer, the 
> SourceWritingTransformer. These transformers has side effects, they 
> store something, and returns an xml document that tells if it succeeded 
> or not. A conclusion from the threads about pipe aware selection was 
> that sending meta data, like if the operation succeeded or not, in the 
> pipeline is a bad practice and especially that we don't should allow 
> selection based on such content. Given that these transformers basically 
> translate xml input to a binary format and generates an xml output that 
> we are supposed to ignore, it would IMO be more natural to see them as 
> some kind of serializer.
> 
> The next half of the pipeline creates the response, here it is less 
> obvious what transformer to use. I normally use an XSLTTransformer and 
> typically ignore its input stream and only create an xml document that 
> is rendered into e.g. html in a sub sequent transformer.
> 
> I think that it would be more natural to replace the pattern:
> 
>   ...
>   <transform type="store something, return state info"/>
>   <transform type="create a response document, ignore input"/>
>   ...
> 
> with
> 
>   ...
>   <serialize type="store something, put state info in the environment"
>              non-terminating="true"/>
>   <generate type="create a response document" src="response document"/>
>   ...
> 
> If we give the serializer a destination attribute as well, all the 
> existing serializers could be used for storing input in files etc as well.
> 
>   ...
>   <serialize type="xml" dest="xmldb://..." non-terminating="true"/>

Now, let me ask you something: how much have you been playing with the 
FlowScript?

A while ago I proposed the ability to call a pipeline from the 
flowscript but specifying the outputstream that the serializer should 
use. Basically, the flow now can use a pipeline as a tool to do stuff 
without necessarely be tied to the client.

In all your discussion you have been placing a bunch of flow logic (how 
to move from one pipeline to the next) into the sitemap. I'd suggest to 
move it where it belongs (the flow) and let the sitemap do its job 
(defining pipelines that others can use).

Why? well, while the concept of stateless output is inherently 
declerative, the concept of stateless input + output is declarative for 
the match and procedural for its internals.

So, I wonder, why don't we leave the declarative part to the sitemaps 
and use the flow as our procedural glue?

>   ...
> 
> This would give the same SoC that i argued in favour of in the context 
> of input: The serializer is responsible for how to serialize from xml to 
> the binary data format and the destination is responsible for where to 
> store the data.

This can be achieved with a flow method that includes a way to specific 
the output stream (or a WriteableSource, probably better) that the 
serializer has to use.

> Conclusion
> ----------
> 
> I am afraid that I put more question than I answer in this RT. Many of 
> them are of "best practice" character, and do not have any architectural 
> consequences, and does not have to be answered right now. There are 
> however some questions that need an answer:
> 
> How should pipeline components, like the validation transformer, report 
> state information? Placing some kind of state object in the object model 
> would be one possibility, but I don't know.

The real problem is not where to store the data, IMO, but the fact that 
you showed that there is a serious need for run-time selection that 
can't be addressed with our today's architecture.

> We seem to agree about that there is a need for selection in pipelines 
> based on the state of the computation in the pipeline that precedes the 
> selection. 

Yes. I finally got to this conclusion.

> Here we have two proposals:
> 
> 1. Introduce pipeline state aware selectors (e.g. by letting the 
> selector implement a marker interface), and give such selectors special 
> treatment in the sitemap interpreter.
> 
> 2. Extend the semantics of serializers so that the sitemap interpreter 
> can continue to interpret the sitemap after a serializer, (e.g. by a new 
> non-terminating attribute for serializers).
> 
> I prefer the second proposal.

I prefer the first :)

> Booth proposals can be implemented with no back compatibility problems 
> at all by requiring the selectors or serializer that need the extended 
> semantics, to implement a special marker interface, and by adding code 
> that reacts on the marker interface in the sitemap interpreter.

Yes, I see that.

> To use serializers more generally for storing things, as I propsed 
> above, the Serializer interface would need to extend the 
> SitemapModelComponent interface.

Don't know about that. I like serializers the way they are, but I'd like 
to be able to detach them from the client output stream but using the 
flowscript.

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

[RT] Input Pipelines: Storage and Selection (was Re: [RT] Input Pipelines (long))

Posted by Daniel Fagerstrom <da...@nada.kth.se>.

Stefano Mazzocchi wrote:
 > Hmmm, maybe deep architectural discussions are good during holydays
 > seasons... we'll see :)
Not for me, I've been away from computers for a while. But you and 
Nicola Ken seem to have had an interesting discussion :)

The discussion about input pipelines can be divided in two parts:
1. Improving the handling of the input stream in Cocoon. This is needed 
for web services, it is also needed for making it possible to implement 
a writable cocoon:-protocol, something that IMO would be very useful for 
reusing functionality in Cocoon, especially from blocks.

2. The second part of the proposal is to use two pipelines, executed in 
sequence, to respond to input in Cocoon. The first pipeline (called 
input pipeline) is responsible for reading the input and from request 
parameters or from the input stream, transform it to an appropriate 
format and store it in e.g. a session parameter, a file or a db. After 
the input pipeline there is an ordinary (output) pipeline that is 
responsible for generating the response. The output pipeline is executed 
after that the execution of the input pipeline is completed, as a 
consequence actions and selections in the output pipeline can be 
dependent e.g. on if the handling of input succeeded or not and on the 
data that was stored by the input pipeline.

Here I will focus on your comments on the second part of the proposal.

 > Daniel Fagerstrom wrote:
<snip/>
 >> In Sitemaps
 >> -----------
 >>
 >> In a sitemap an input pipeline could be used e.g. for implementing a
 >> web service:
 >>
 >> <match pattern="myservice">
 >>   <generate type="xml">
 >>     <parameter name="scheme" value="myInputFormat.scm"/>
 >>   </generate>
 >>   <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
 >>   <serialize type="dom-session" non-terminating="true">
 >>     <parameter name="dom-name" value="input"/>
 >>   </serialize>
 >>   <select type="pipeline-state">
 >>     <when test="success">
 >>       <act type="my-business-logic"/>
 >>       <generate type="xsp" src="collectTheResult.xsp"/>
 >>       <serialize type="xml"/>
 >>     </when>
 >>     <when test="non-valid">
 >>       <!-- produce an error document -->
 >>     </when>
 >>   </select>
 >> </match>
 >>
 >> Here we have first an input pipeline that reads and validates xml
 >> input, transforms it to some appropriate format and store the result
 >> as a dom-tree in a session attribute. A serializer normally means that
 >> the pipeline should be executed and thereafter an exit from the
 >> sitemap. I used the attribute non-terminating="true", to mark that
 >> the input pipeline should be executed but that there is more to do in
 >> the sitemap afterwards.
 >>
 >> After the input pipeline there is a selector that select the output
 >> pipeline depending of if the input pipeline succeed or not. This use
 >> of selection have some relation to the discussion about pipe-aware
 >> selection (see [3] and the references therein). It would solve at
 >> least my main use cases for pipe-aware selection, without having its
 >> drawbacks: Stefano considered pipe-aware selection mix of concern,
 >> selection should be based on meta data (pipeline state) rather than on
 >> data (pipeline content). There were also some people who didn't like
 >> my use of buffering of all input to the pipe-aware selector. IMO the
 >> use of selectors above solves booth of these issues.
 >>
 >> The output pipeline start with an action that takes care about the
 >> business logic for the application. This is IMHO a more legitimate use
 >> for actions than the current mix of input handling and business logic.
 >
 >
 > Wouldn't the following pipeline achieve the same functionality you want
 > without requiring changes to the architecture?
 >
 > <match pattern="myservice">
 >   <generate type="payload"/>
 >   <transform type="validator">
 >     <parameter name="scheme" value="myInputFormat.scm"/>
 >   </transform>
 >   <select type="pipeline-state">
 >     <when test="valid">
 >       <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
 >       <transform type="my-business-logic"/>
 >       <serialize type="xml"/>
 >     </when>
 >     <otherwise>
 >       <!-- produce an error document -->
 >     </otherwise>
 >   </select>
 > </match>

Yes, it would achieve about the same functionality as I want and it 
could easily be implemented with the help of the small extensions of the 
sitemap interpreter that I implemented for pipe aware selection [3].

I think it could be interesting to do a detailed comparison between the 
differences in our proposals: How the input stream and validation is 
handled, how the selection based on pipeline state is performed, if 
storage of the input is done in a serializer or in a transformer, and 
how the new output is created.

Input Stream
------------

For input stream handling you used

   <generate type="payload"/>

Is the payload generator equivalent to the StreamGenerator? Or does it 
something more, like switching parser depending on mime type for the 
input stream?

I used

   <generate type="xml"/>

The idea is that if no src attribute is given the sitemap interpreter 
automatically connect the generator to the input stream of the 
environment (the input stream from the http request in the servlet case, 
in other cases it is more unclear). This behavior was inspired by the 
handling of std input in unix pipelines.

Nicola Ken proposed:

   <generate type="xml" src="inputstream://"/>

I prefer this solution compared to mine as it doesn't require any change 
of the sitemap interpreter, I also believe that it it easier to 
understand as it is more explicit. It also (as Nicola Ken has explained) 
gives a good SoC, the uri in the src attribute describes where to read 
the resource from, e.g. input stream, file, cvs, http, ftp, etc and the 
generator is responsible for how to parse the resource. If we develop a 
input stream protocol, all the work invested in the existing generators, 
can immediately be reused in web services.

Validation
----------

Should validation be part of the parsing of input as in:

   <generate type="xml">
     <parameter name="scheme" value="myInputFormat.scm"/>
   </generate>

or should it be a separate transformation step:

   <transform type="validator">
     <parameter name="scheme" value="myInputFormat.scm"/>
   </transform>

or maybe the responsibility of the protocol as Nicola Ken proposed in 
one of his posts:

   <generate type="xml" src="inputstream:myInputFormat.scm"/>

This is not a question about architecture but rather one about finding 
"best practices".

I don't think validation should be part of the protocol. It means that 
the protocol has to take care of the parsing and that would mumble the 
SoC where the protocol is responsible for locating and delivering the 
stream and the generator is responsible for parsing it, that Nicola Ken 
have argued for in his other posts.

Should validation be part of the generator or a transform step? I don't 
know. If the input not is xml as for the ParserGenerator, I guess that 
the validation must take place in the generator. If the xml parser 
validates the input as a part of the parsing it is more practical to let 
the generator be responsible for validation (IIRC Xerces2 has an 
internal pipeline structure and performs validation in a transformer 
like way, so for Xerces2 it would probably be as efficient to do 
validation in a transformer as in a generator). Otherwise it seem to 
give better SoC to separate the parsing and the validation step, so that 
we can have one validation transformer for each scheme language.

In some cases it might be practical to augment the xml document with 
error information to be able to give more exact user feedback on where 
the errors are located. For such applications it seem more natural to me 
to have validation in a transformer.

A question that might have architectural consequences is how the 
validation step should report validation errors. If the input is not 
parseable at all there is not much more to do than throwing an exception 
and letting the ordinary internal error handler report the situation. If 
some of the elements or attributes in the input has the wrong type we 
probably want to return more detailed feedback than just the internal 
error page. Some possible validation error report mechanisms are: 
storing an error report object in the environment e.g. in the object 
model, augmenting the xml document with error reporting attributes or 
elements, throwing an exception object that contains a detailed error 
description object or a combination of some of these mechanisms.

Mixing data and state information was considered to be a bad practice in 
  the discussion about pipe-aware selection (se references in [3]), that 
rules out using only augmentation of the xml document as error reporting 
mechanism. Throwing an exeption would AFAIU lead to difficulties in 
giving customized error reports. So I believe it would be best to put 
some kind of state describing object in the environment and possibly 
combine this whith augmentation of the xml document.

Pipe State Dependent Selection
------------------------------

For selecting response based on if the input document is valid or not 
you suggest the following:

...
   <transform type="validator">
     <parameter name="scheme" value="myInputFormat.scm"/>
   </transform>
   <select type="pipeline-state">
     <when test="valid">
       <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
...

As I mentioned earlier this could easily be implemented with the 
"pipe-aware selection" code I submitted in [3]. Let us see how it would 
work:

The PipelineStateSelector can not be executed at pipeline construction 
time as for ordinary selectors. The pipeline before the selector 
including the ValidatorTransformer must have been executed before the 
selection is performed. This can be implemented by letting the 
PipelineStateSelector implement a special marker interface, say 
PipelineStateAware, so that it can have special treatment in the 
selection part of the sitemap interpreter.

When the sitemap interpreter gets a PipelineStateAware selector it first 
ends the currently constructed pipeline with a serializer that store its 
sax input in e.g. a dom-tree and the pipeline is processed and the dom 
tree thith the cashed result is stored in e.g. the object model. In the 
next step the selector is executed and it can base its decision on 
result from the first part of the pipeline. If the ValidationTransformer 
puts a validation result descriptor in the object model, the 
PipelineStateSelector can perform tests on this result descriptor. In 
the last step a new pipeline is constructed where the generator reads 
from the stored dom tree, and in the example above, the first 
transformer will be an XSLTransformer.

An alternative and more explicit way to describe the pipeline state 
dependent selection above, is:

...
   <transform type="validator">
     <parameter name="scheme" value="myInputFormat.scm"/>
   </transform>
   <serialize type="object-model-dom" non-terminating="true">
     <parameter name="name" value="validated-input"/>
   </serialize>
   <select type="pipeline-state">
     <when test="valid">
       <generate type="object-model-dom">
         <parameter name="name" value="validated-input"/>
       </generate>
       <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
...

Here the extensions to the current Cocoon semantics is put in the 
serializer instead of the selector. The sitemap interpreter treats a 
non-terminating serializer as ordinary serializer in the sense that it 
puts the serializer in the end of the current pipeline and executes it. 
The difference is that it instead of returning to the caller of the 
sitemap interpreter, it creates a new current pipeline and continue to 
interpret the component after the serializer, in this case a selector. 
The sitemap interpreter will also ignore the output stream of the 
serializer, the serializer is suposed to have side effects. The new 
current pipeline will then get a ObjectModelDOMGenerator as generator 
and an XSLTTransformer as its first transformer.

I prefer this construction compared to the more implicit one because it 
is more obvious what it does and also as it gives more freedom about how 
to store the user input. Some people seem to prefer to store user input 
in Java beans, in some applications session parameters might be a better 
place then the object model.

Pipelines with Side Effects
---------------------------

A common pattern in pipelines that handle input (at least in the 
application that I write) is that the first half of the pipeline takes 
care of the input and ends with a transformer that stores the input. The 
transformer can be e.g. the SQLTransformer (with insert or update 
statements), the WriteDOMSessionTransformer, the 
SourceWritingTransformer. These transformers has side effects, they 
store something, and returns an xml document that tells if it succeeded 
or not. A conclusion from the threads about pipe aware selection was 
that sending meta data, like if the operation succeeded or not, in the 
pipeline is a bad practice and especially that we don't should allow 
selection based on such content. Given that these transformers basically 
translate xml input to a binary format and generates an xml output that 
we are supposed to ignore, it would IMO be more natural to see them as 
some kind of serializer.

The next half of the pipeline creates the response, here it is less 
obvious what transformer to use. I normally use an XSLTTransformer and 
typically ignore its input stream and only create an xml document that 
is rendered into e.g. html in a sub sequent transformer.

I think that it would be more natural to replace the pattern:

   ...
   <transform type="store something, return state info"/>
   <transform type="create a response document, ignore input"/>
   ...

with

   ...
   <serialize type="store something, put state info in the environment"
              non-terminating="true"/>
   <generate type="create a response document" src="response document"/>
   ...

If we give the serializer a destination attribute as well, all the 
existing serializers could be used for storing input in files etc as well.

   ...
   <serialize type="xml" dest="xmldb://..." non-terminating="true"/>
   ...

This would give the same SoC that i argued in favour of in the context 
of input: The serializer is responsible for how to serialize from xml to 
the binary data format and the destination is responsible for where to 
store the data.

Conclusion
----------

I am afraid that I put more question than I answer in this RT. Many of 
them are of "best practice" character, and do not have any architectural 
consequences, and does not have to be answered right now. There are 
however some questions that need an answer:

How should pipeline components, like the validation transformer, report 
state information? Placing some kind of state object in the object model 
would be one possibility, but I don't know.

We seem to agree about that there is a need for selection in pipelines 
based on the state of the computation in the pipeline that precedes the 
selection. Here we have two proposals:

1. Introduce pipeline state aware selectors (e.g. by letting the 
selector implement a marker interface), and give such selectors special 
treatment in the sitemap interpreter.

2. Extend the semantics of serializers so that the sitemap interpreter 
can continue to interpret the sitemap after a serializer, (e.g. by a new 
non-terminating attribute for serializers).

I prefer the second proposal.

Booth proposals can be implemented with no back compatibility problems 
at all by requiring the selectors or serializer that need the extended 
semantics, to implement a special marker interface, and by adding code 
that reacts on the marker interface in the sitemap interpreter.

To use serializers more generally for storing things, as I propsed 
above, the Serializer interface would need to extend the 
SitemapModelComponent interface.

------

What do you think?

Daniel Fagerstrom

<snip/>

[3] [Contribution] Pipe-aware selection
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=101735848009654&w=2

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Stefano Mazzocchi wrote:
> Nicola Ken Barozzi wrote:
> 
>>> This said, do we really want to abstract our Environment objects so 
>>> that they are capable of handling all web, CLI and mail environments? 
>>> Isn't this FS?
>>
>> Is the Environment itself FS?
> 
> Good question.
> 
> Many people that don't like Cocoon told me so. I'm still debating in 
> between myself since we came out with that concept two years ago. I 
> still haven't decided.

IMHO it's useful, but it still needs a bit of work for the new 
environments that will be done now.

>> We have been using it just to make a CLI that users seems to hate 
>> because it's slow
> 
> *some* users.

Yes, some.

>> while making angry many developers that had to change all the objects 
>> the has HttpXXX servlet APIs hardcoded to use our abstraction.
> 
> No, I don't buy that. The reason why we provided a way to obtain the 
> original Servlet request was exactly to avoid them having to do it.

Yeah, but *if* they found out how to do it (not everyone did easily, if 
ever), they had to change the code that got that, because originally 
they just got the Request, that after was not the right object...

>> And now, we should let the dependency leak in again?
> 
> Nicola, just because you didn't know the dependency was there it doesn't 
> mean that it's *leaking* in. It has been that way since the day we 
> created the environment.

Yup, but it was deemed as a minor hack just to get a specific feature be 
used. But now it's not so specific, since getting the inputstream from 
the Request is not only from servlets. I don't want this hack that was 
used specifically for one case leak in the general abstraction 
definition and be used as a normal worksforme.

> Looks hacky? well, yes and no. When the JDK introduced Java2D they came 
> out with a new Graphic2D object but the paint() method passed a Graphic 
> object. So it's up to you to up-cast it.
> 
>  void paint(Graphic g) {
>    Graphic2D g2d = (Graphic2D) g;
>    ...
>  }
> 
> It's terrible, I know. It hurts my elegance feeling like nuts.
> 
> So it does our abstracted Environment.... but the Servlet API are not 
> abstracted enough and the abstraction job is a stinking hard one! 
> expecially if you have to provide back-compatibility.
> 
> I'm all in favor of adding input capabilities to the environment, but 
> only after a sound and well-thoughtout discussion.

I agree. That's why I was discussing with Vadim, and now with you :-)

>> If we really want an evvironment, we should make it as generic as 
>> *reasonably* possible.
> 
> I agree.
> 
>> Now the HTTPServlet  request has a getInputStream. We don't.
>> The day we will make Cocoon work directly in Avalon, we will break 
>> every Cocoon app using it, unless the Avalon container implements the 
>> same HTTPServlet  classes... which simply makes out environment 
>> abstraction unnecessary, since the HTTPServlet classes become the used 
>> abstraction.
> 
> 
> Look, I agree. I just don't want to add things to such a critical 
> contract without *extremely* careful thinking.

Same here.

>> Aha, here you say it too.
>>
>>   Environment = Request + ServletRequest
> 
> 
> Oh, yes. I've always known that Request didn't have a way to get 
> input... but people stated that it was *impossible* to do so and this is 
> where I got nervous.
> 
>> So Cocoon is instrincially asymetric unless we are in a servlet 
>> environment?
> 
> 
> Today? yes.
> 
> Must it be so? no.
> 
> Is it easy to abstract the input out of any possible client/server 
> architecture environment? god no!

Yup. In Morphos I abstracted it by using "Object", but it's really a 
leaky abstraction generally speaking.

> Is it true that all client/server architectures are symmetric? NO!!!
> 
>> Why *servlet* and not *web*? Shall we decide that all symmetric 
>> environments give a ServletRequest? Is the ServletRequest then part of 
>> the contract?
> 
> No, I much rather see input abstracted in our Environment. I'm just 
> concerned about careful thinking.
> 

[...]

>> The fact is that Generators should not care where the source comes 
>> from, just take an object and transform it to xml.
> 
> 
> If that was the case, we shouldn't need plaggable generators, but just 
> different sources and one parsing generator. But we would be back to the 
> same thing, just with different names and sitemap semantics.

I disagree. The above text is correct only if the source gives you xml 
data, which is not necessarily the case.

A source can give be a stream, that can contain xml, html, pdf, doc, 
whatever, and all of these need different generators.

>> By mixing the locator phase with the generator phase, we loose easy to 
>> get flexibility.
> 
> Careful here: I agree that the difference between a source and generator 
> is subdle, expecially since we added a method for a source to generate 
> sax events directly.
> 
> But I find the concept of 'locating' a resource is very weak in our 
> current sitemap context.

I don't understand what you mean here.

>> In fact, I would not see as bad this:
>>
>>   <map:locate src="blah.xml"/>
>>   <map:generate type="xml"/>
>>   <map:transform src="cocoon:/givemetheinput">
>>   <map:serialize/>
>>
>> This has come out of the Morphos effort, where it has been more than 
>> evident that locating a resource and injecting it into the pipeline 
>> are different concerns.
> 
> I don't this this 'more than evident'-ness.

Errr, the 'more than evident'-ness came from creating Morphos, not from 
the above snippet.

Let me try to explain.

When I want to generate a SAX event stream, I have to do two things:

  1) get the stuff
  2) transform the stuff to xml

For example, if I want to do it with an XML parser, I can do:

   //what to get
   String urlString = "...";

   //get it
   URL location = new URL(urlString);

   //parse it
   xmlparser.parse(location);

Imagine that I want the parser to parse from a xmldb.
I just need to be able to make the URL open the correct stream, and give 
it to him.

The URL is not *the* data, but a *handle* to the data.
The string, instead, is nothing. Just a string.

The URL, the "locator", is what takes the string and is able to get the 
stream that that string points to.

The Parser, just take a URL and generates SAX events from the stuff that 
the URL (locator) gets for him.

 > What does your above locator do? what is the difference between
 > that and a Reader?

Good question. Not much, other than the fact that a locator should only 
get the source, while a reader can be made to be a stream "transformer".
It's quite easy, if we want, to make multiple readers in a pipeline, and 
that would be really different from a locator.

Anyway the sources are good enough, no real need for a "locator", I just 
put it there to try and explain the separation from locating a resource 
and generating SAX events from it.

>> The cocoon protocol is roughly the equivalent of the locator.
> 
> Maybe I'm dumb, but I don't get this.

Errr I meant the Cocoon Sources, sorry.

>> The mailet wrapper is something I'm writing now, since I'm using james 
>> in my intranet, and I see the pain of not having it easy to make a 
>> Cocoon mailet.
> 
> That's great. We were waiting for people to be willing to use cocoon in 
> their mail system before attacking the SMTP part of the Environment 
> abstraction.
> 
> And *that* will require careful thinking about input, since that's where 
> SMTP is focused on. Unlike HTTP that is focused on output.

Yup.

>> Let's not talk about using it as a bean! How can I simply give cocoon 
>> a stream to process!
> 
> I'm in favor of a discussion about abstracting the Environment futher to 
> be more input-friendly also for mail environments, but this must come 
> out of a deep discussion *and* after some *real-life* requirements.
> 
> What I'm opposed to is symmetry-driven architectural design.

Listen, my needs came from real use-cases, not symmetry-driven 
architectural design. I just happened to chime in this thread because 
part of what was discussed here matched with my needs.

>>> Interface Elegance driven design is one step too close to FS from 
>>> where I stand.
>>>
>>> But if there are *real* needs (means stuff that can't be done nicely 
>>> today with what we have), I'm more than welcome to discuss how to 
>>> move forward.
>>
>> As I said, moving from a servlet container to a non-servlet web 
>> container would break things, unless we have it implement the 
>> httpservlet methods.
>>
>> You say that not all evnironments have the need of it, and it's true, 
>> but a *class* of environments do.
> 
> 
> Correct.
> 
> Summarizing this thread a little:
> 
>  1) I don't think Cocoon pipelines are asymetric.

It's irrelevant anyway. Even if they were, who cares, as long as it 
works well.

>  2) I agree that the Environment is asymettric.
> 
>  3) I would like to see an effort to make Environment more symmetric in 
> respect of input

(*)
Actually, the input pipelines discussion, as I understood it, is simply 
about the possibility of executing two pipelines per request, with the 
flow in the middle.

Cocoon2 is something that gets data from a Request, mainly the URL and 
some params, and generates a response with some xml stuff.

In web services though, the "Request", needs to be actually created from 
an xml stream coming in.

Hence the talk about input pipelines, that would be those pipelines that 
work on the request stream to generate the request that would drive the 
normal Cocoon process.

By separating the processing in two steps, it has been shown how we 
fulfill the need that is needed for selecting based on pipeline content 
by not doing it: first we process the xml with an input pipeline, create 
an intermediate "Request", and then select based on that data.

This two stepped process made Cocoon seem asymmetric because now it 
cannot explicitly do it, and this two step thing seems more symmetric, 
etc etc etc.

>  4) I would like to see Environment abstract enough to work in a Mailet 
> environment
> 
>  5) I would like this effort to be driven by real-life needs rather than 
> purity and symmetry-driven architectural design (since we've seen that 
> it often leads to very bad mistakes!)

Ok, wait for a new thread on this. Let's keep this thread for the real 
input pipeline discussion. (*)

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Stefano Mazzocchi <st...@apache.org>.

Nicola Ken Barozzi wrote:

>> This said, do we really want to abstract our Environment objects so 
>> that they are capable of handling all web, CLI and mail environments? 
>> Isn't this FS?
> 
> 
> Is the Environment itself FS?

Good question.

Many people that don't like Cocoon told me so. I'm still debating in 
between myself since we came out with that concept two years ago. I 
still haven't decided.

> We have been using it just to make a CLI that users seems to hate 
> because it's slow

*some* users.

> while making angry many developers that had to change 
> all the objects the has HttpXXX servlet APIs hardcoded to use our 
> abstraction.

No, I don't buy that. The reason why we provided a way to obtain the 
original Servlet request was exactly to avoid them having to do it.

> And now, we should let the dependency leak in again?

Nicola, just because you didn't know the dependency was there it doesn't 
mean that it's *leaking* in. It has been that way since the day we 
created the environment.

Looks hacky? well, yes and no. When the JDK introduced Java2D they came 
out with a new Graphic2D object but the paint() method passed a Graphic 
object. So it's up to you to up-cast it.

  void paint(Graphic g) {
    Graphic2D g2d = (Graphic2D) g;
    ...
  }

It's terrible, I know. It hurts my elegance feeling like nuts.

So it does our abstracted Environment.... but the Servlet API are not 
abstracted enough and the abstraction job is a stinking hard one! 
expecially if you have to provide back-compatibility.

I'm all in favor of adding input capabilities to the environment, but 
only after a sound and well-thoughtout discussion.

> If we really want an evvironment, we should make it as generic as 
> *reasonably* possible.

I agree.

> Now the HTTPServlet  request has a getInputStream. We don't.
> The day we will make Cocoon work directly in Avalon, we will break every 
> Cocoon app using it, unless the Avalon container implements the same 
> HTTPServlet  classes... which simply makes out environment abstraction 
> unnecessary, since the HTTPServlet classes become the used abstraction.

Look, I agree. I just don't want to add things to such a critical 
contract without *extremely* careful thinking.

> Aha, here you say it too.
> 
>   Environment = Request + ServletRequest

Oh, yes. I've always known that Request didn't have a way to get 
input... but people stated that it was *impossible* to do so and this is 
where I got nervous.

> So Cocoon is instrincially asymetric unless we are in a servlet 
> environment?

Today? yes.

Must it be so? no.

Is it easy to abstract the input out of any possible client/server 
architecture environment? god no!

Is it true that all client/server architectures are symmetric? NO!!!

> Why *servlet* and not *web*? Shall we decide that all 
> symmetric environments give a ServletRequest? Is the ServletRequest then 
> part of the contract?

No, I much rather see input abstracted in our Environment. I'm just 
concerned about careful thinking.

>> Hmmm, between
>>
>>  <map:generate type="file" src="input:web:/"/>
>>
>> and
>>
>>  <map:generate type="payload"/>
>>
>> I would choose the second.
>>
>> A full URI scheme for simply getting an input stream is too much and 
>> it might be *very* dangerous since people will very easily abuse it 
>> like this
>>
>>  <map:generate src="blah.xml"/>
>>  <map:transform src="input:web:/">
>>
>> which might lead to *serious* security concerns and cross-site 
>> scripting problems with injected XSLT
> 
> 
> Nobody prevents us from making it usable only from generators.

Yes, something does: usability coherence. you can't make a protocol 
available everywhere and another available only in some spots.

> And BTW, 
> what you call a *serious* security concern is in fact something that has 
> been asked for. 

This is not an argument.

> But you still cannot prevent people from shooting 
> themselves in the foot, and use
> 
>    <map:generate type="payload"/>
>    <map:serialize/>
> 
> and then calling it here
> 
>   <map:generate src="blah.xml"/>
>   <map:transform src="cocoon:/givemetheinput">
>   <map:serialize/>

Oh, totally. I can't prevent people from doing something I consider a 
mistake, but I can avoid making it *easy* for them to do it.

This is what designing a framework is all about: it's not the "there is 
always more than one way of doing it" FS-inflated paradigm, it's the 
"this is the way we consider best, if you don't like it, use something 
else or convince us of a better alternative".

> The fact is that Generators should not care where the source comes from, 
> just take an object and transform it to xml.

If that was the case, we shouldn't need plaggable generators, but just 
different sources and one parsing generator. But we would be back to the 
same thing, just with different names and sitemap semantics.

> By mixing the locator phase with the generator phase, we loose easy to 
> get flexibility.

Careful here: I agree that the difference between a source and generator 
is subdle, expecially since we added a method for a source to generate 
sax events directly.

But I find the concept of 'locating' a resource is very weak in our 
current sitemap context.

> In fact, I would not see as bad this:
> 
>   <map:locate src="blah.xml"/>
>   <map:generate type="xml"/>
>   <map:transform src="cocoon:/givemetheinput">
>   <map:serialize/>
> 
> This has come out of the Morphos effort, where it has been more than 
> evident that locating a resource and injecting it into the pipeline are 
> different concerns.

I don't this this 'more than evident'-ness.

What does your above locator do? what is the difference between that an 
a Reader?

> The cocoon protocol is roughly the equivalent of the locator.

Maybe I'm dumb, but I don't get this.

> The mailet wrapper is something I'm writing now, since I'm using james 
> in my intranet, and I see the pain of not having it easy to make a 
> Cocoon mailet.

That's great. We were waiting for people to be willing to use cocoon in 
their mail system before attacking the SMTP part of the Environment 
abstraction.

And *that* will require careful thinking about input, since that's where 
SMTP is focused on. Unlike HTTP that is focused on output.

> Let's not talk about using it as a bean! How can I simply give cocoon a 
> stream to process!

I'm in favor of a discussion about abstracting the Environment futher to 
be more input-friendly also for mail environments, but this must come 
out of a deep discussion *and* after some *real-life* requirements.

What I'm opposed to is symmetry-driven architectural design.

>> Interface Elegance driven design is one step too close to FS from 
>> where I stand.
>>
>> But if there are *real* needs (means stuff that can't be done nicely 
>> today with what we have), I'm more than welcome to discuss how to move 
>> forward.
> 
> 
> As I said, moving from a servlet container to a non-servlet web 
> container would break things, unless we have it implement the 
> httpservlet methods.
> 
> You say that not all evnironments have the need of it, and it's true, 
> but a *class* of environments do.

Correct.

Summarizing this thread a little:

  1) I don't think Cocoon pipelines are asymetric.

  2) I agree that the Environment is asymettric.

  3) I would like to see an effort to make Environment more symmetric in 
respect of input

  4) I would like to see Environment abstract enough to work in a Mailet 
environment

  5) I would like this effort to be driven by real-life needs rather 
than purity and symmetry-driven architectural design (since we've seen 
that it often leads to very bad mistakes!)

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Stefano Mazzocchi wrote:
> Nicola Ken Barozzi wrote:
> 
>>
>> (my comments, based on the discussions that are going on lately and my 
>> work on the blocks move and doc writing)
> 
> 
> cool. thanks for sharing.
> 
>> Stefano Mazzocchi wrote:
>> [...]
>>
>>>> If we compare a Cocoon output pipeline with a unix pipeline, it always
>>>> ignore standard input and always write to standard output.
>>>
>>>
>>>
>>>
>>> Sorry, but this is plain wrong.
>>>
>>> Cocoon already ships generators that do *NOT* ignore the request input. 
>>
>> Look at the Request interface.
>> There is no method to get the input.
> 
> Right. But the above sentence remains wrong, maybe Cocoon Request object 
> doesn't have a a method to get input which is abstrated from the 
> context, but it's *wrong* to say that there is no way to get input from 
> the user.

Being pricky, but "There is no method to get the input." is correct in 
the sense that there is no class method to get the input.

> The fact that the Request object doesnt' contain input is because we 
> couldn't agree on what *input* meant in a context-abstracted situation.
> 
> So, as Vadim, I agree that we should get it in only *after* we know what 
> context-abstracted input means.

Yup, Vadim has easily convinced me too :-)

[...]
>> Vadim has proposed, after some discussion, to add the possibility of 
>> returning n streams, that can be used for example in mails or in any 
>> system that inputs multitype data.
> 
> 
> There are two ways of implementing an API:
> 
>  1) forcing common ground: that is creating a sufficiently abstracted 
> way to look at the problem
> 
>  2) leaving context-specific hooks: the component connects to the 
> context-specific hooks.
> 
> Java has a known history of using patter #1, but recently this has been 
> challenged very seriuosly (see Eclipse SWT vs. Swing) and with some 
> *great* achievements.

Ok, then let's ditch the environment alltogether 8->

> This said, do we really want to abstract our Environment objects so that 
> they are capable of handling all web, CLI and mail environments? Isn't 
> this FS?

Is the Environment itself FS?
We have been using it just to make a CLI that users seems to hate 
because it's slow, while making angry many developers that had to change 
all the objects the has HttpXXX servlet APIs hardcoded to use our 
abstraction.
And now, we should let the dependency leak in again?

If we really want an evvironment, we should make it as generic as 
*reasonably* possible.

Now the HTTPServlet  request has a getInputStream. We don't.
The day we will make Cocoon work directly in Avalon, we will break every 
Cocoon app using it, unless the Avalon container implements the same 
HTTPServlet  classes... which simply makes out environment abstraction 
unnecessary, since the HTTPServlet classes become the used abstraction.

> I'm not stating, just asking.

[...]

>> The fact is that the request is (down-to-earth) a URI, and a response 
>> is a stream. This is not symmetry.
> 
> 
> ??? what about those PUT WebDAV requests that might have a 10Mb payload 
> and return a simple two line http response with an error code?
> 
> lack of simmetry is perceived because of the way the wep currently works 
> that is 90% of the HTTP requests are problably GET, 9.99% POST and 0.01% 
> all the other HTTP actions.
> 
> But there is *nothing* instrincially asymetric in the web nor in how 
> cocoon pipelines work (if you consider your Environment as Request + 
> ServletRequest)

Aha, here you say it too.

   Environment = Request + ServletRequest

So Cocoon is instrincially asymetric unless we are in a servlet 
environment? Why *servlet* and not *web*? Shall we decide that all 
symmetric environments give a ServletRequest? Is the ServletRequest then 
part of the contract?

>>> 2) what is this pipeline returning to the requesting client? This is 
>>> not SMTP, we have to return something. Sure, we might simply return 
>>> an HTTP header with some error code depending on the result of the 
>>> serialization, but then people will ask how to control that part.
>>
>> [...]
>>
>>>> Several of the existing
>>>> generators would be highly usable in input pipelines if they were
>>>> modified in such a way that they read from "standard input" when no
>>>> src attribute is given.
>>>
>>> I lost you here.
>>
>> My take: If you use a Generator with a source protocol, it's more 
>> flexible. Add a protocol that gets data from request input, and you're 
>> done.
> 
> Hmmm, between
> 
>  <map:generate type="file" src="input:web:/"/>
> 
> and
> 
>  <map:generate type="payload"/>
> 
> I would choose the second.
> 
> A full URI scheme for simply getting an input stream is too much and it 
> might be *very* dangerous since people will very easily abuse it like this
> 
>  <map:generate src="blah.xml"/>
>  <map:transform src="input:web:/">
> 
> which might lead to *serious* security concerns and cross-site scripting 
> problems with injected XSLT

Nobody prevents us from making it usable only from generators. And BTW, 
what you call a *serious* security concern is in fact something that has 
been asked for. But you still cannot prevent people from shooting 
themselves in the foot, and use

    <map:generate type="payload"/>
    <map:serialize/>

and then calling it here

   <map:generate src="blah.xml"/>
   <map:transform src="cocoon:/givemetheinput">
   <map:serialize/>

The fact is that Generators should not care where the source comes from, 
just take an object and transform it to xml.
By mixing the locator phase with the generator phase, we loose easy to 
get flexibility.

In fact, I would not see as bad this:

   <map:locate src="blah.xml"/>
   <map:generate type="xml"/>
   <map:transform src="cocoon:/givemetheinput">
   <map:serialize/>

This has come out of the Morphos effort, where it has been more than 
evident that locating a resource and injecting it into the pipeline are 
different concerns.

The cocoon protocol is roughly the equivalent of the locator.

>> [...]
>>
>>
>>> Wouldn't the following pipeline achieve the same functionality you 
>>> want without requiring changes to the architecture?
>>>
>>>  <match pattern="myservice">
>>>   <generate type="payload"/>
>>>   <transform type="validator">
>>>     <parameter name="scheme" value="myInputFormat.scm"/>
>>>   </transform>
>>>   <select type="pipeline-state">
>>>    <when test="valid">
>>>     <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>>>     <transform type="my-business-logic"/>
>>>     <serialize type="xml"/>
>>>    </when>
>>>    <otherwise>
>>>     <!-- produce an error document -->
>>>    </otherwise>
>>>   </select>
>>>  </match>
>>
>>
>>
>> I basically asked the same thing... but we cannot have a generic 
>> payload generator yet.
> 
> 
> Who said we should? Is there a *real* (non theory-driven) need for such 
> a thing?

> I've been using the request generator with good satisfaction even for 
> web services-like stuff and I don't need to send any input from the 
> command line (do you?) and a Mailet wrapper for Cocoon is yet to be seen.

The mailet wrapper is something I'm writing now, since I'm using james 
in my intranet, and I see the pain of not having it easy to make a 
Cocoon mailet.
Let's not talk about using it as a bean! How can I simply give cocoon a 
stream to process!

> Interface Elegance driven design is one step too close to FS from where 
> I stand.
> 
> But if there are *real* needs (means stuff that can't be done nicely 
> today with what we have), I'm more than welcome to discuss how to move 
> forward.

As I said, moving from a servlet container to a non-servlet web 
container would break things, unless we have it implement the 
httpservlet methods.

You say that not all evnironments have the need of it, and it's true, 
but a *class* of environments do.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Stefano Mazzocchi <st...@apache.org>.

Nicola Ken Barozzi wrote:
> 
> (my comments, based on the discussions that are going on lately and my 
> work on the blocks move and doc writing)

cool. thanks for sharing.

> Stefano Mazzocchi wrote:
> [...]
> 
>>> If we compare a Cocoon output pipeline with a unix pipeline, it always
>>> ignore standard input and always write to standard output.
>>
>>
>>
>> Sorry, but this is plain wrong.
>>
>> Cocoon already ships generators that do *NOT* ignore the request input. 
> 
> 
> Look at the Request interface.
> There is no method to get the input.

Right. But the above sentence remains wrong, maybe Cocoon Request object 
doesn't have a a method to get input which is abstrated from the 
context, but it's *wrong* to say that there is no way to get input from 
the user.

The fact that the Request object doesnt' contain input is because we 
couldn't agree on what *input* meant in a context-abstracted situation.

So, as Vadim, I agree that we should get it in only *after* we know what 
context-abstracted input means.

>> Extending those components to perform higher-level functionality is 
>> *NOT* an architectural problem. Or at least, I don't see why it should 
>> be.
> 
> 
> If a Request has input, we should at least put it in the interface.

See above.

> Vadim has proposed, after some discussion, to add the possibility of 
> returning n streams, that can be used for example in mails or in any 
> system that inputs multitype data.

There are two ways of implementing an API:

  1) forcing common ground: that is creating a sufficiently abstracted 
way to look at the problem

  2) leaving context-specific hooks: the component connects to the 
context-specific hooks.

Java has a known history of using patter #1, but recently this has been 
challenged very seriuosly (see Eclipse SWT vs. Swing) and with some 
*great* achievements.

This said, do we really want to abstract our Environment objects so that 
they are capable of handling all web, CLI and mail environments? Isn't 
this FS?

I'm not stating, just asking.

> [...]
> 
>>> In a servlet, input would be
>>> taken from the input stream of the request object. We could also have
>>> a writable cocoon: protocol where the input stream would be set by the
>>> user of the protocol, more about that later, (see also my post in the
>>> thread [1]).
>>>
>>> An example:
>>>
>>> <match pattern="**.xls"/>
>>>   <generate type="xls"/>
>>>   <transform type="xsl" src="foo.xsl"/>
>>>   <serialize type="xml" dest="context://repository/{1}.xml"/>
>>> </match>
>>
>>
>>
>> I see two things here:
>>
>> 1) the current pipeline components don't seem to be asymmetric (and 
>> this goes somewhat against what you wrote at the beginning of your 
>> email), the asymmetry is in the fact that the serializer output is 
>> *always* bound to the client response. Am I right on this assumption?
> 
> 
> The fact is that the request is (down-to-earth) a URI, and a response is 
> a stream. This is not symmetry.

??? what about those PUT WebDAV requests that might have a 10Mb payload 
and return a simple two line http response with an error code?

lack of simmetry is perceived because of the way the wep currently works 
that is 90% of the HTTP requests are problably GET, 9.99% POST and 0.01% 
all the other HTTP actions.

But there is *nothing* instrincially asymetric in the web nor in how 
cocoon pipelines work (if you consider your Environment as Request + 
ServletRequest)

>> 2) what is this pipeline returning to the requesting client? This is 
>> not SMTP, we have to return something. Sure, we might simply return an 
>> HTTP header with some error code depending on the result of the 
>> serialization, but then people will ask how to control that part.
> 
> 
> [...]
> 
> 
>>> Several of the existing
>>> generators would be highly usable in input pipelines if they were
>>> modified in such a way that they read from "standard input" when no
>>> src attribute is given.
>>
>>
>> I lost you here.
> 
> 
> My take: If you use a Generator with a source protocol, it's more 
> flexible. Add a protocol that gets data from request input, and you're 
> done.

Hmmm, between

  <map:generate type="file" src="input:web:/"/>

and

  <map:generate type="payload"/>

I would choose the second.

A full URI scheme for simply getting an input stream is too much and it 
might be *very* dangerous since people will very easily abuse it like this

  <map:generate src="blah.xml"/>
  <map:transform src="input:web:/">

which might lead to *serious* security concerns and cross-site scripting 
problems with injected XSLT

> [...]
> 
> 
>> Wouldn't the following pipeline achieve the same functionality you 
>> want without requiring changes to the architecture?
>>
>>  <match pattern="myservice">
>>   <generate type="payload"/>
>>   <transform type="validator">
>>     <parameter name="scheme" value="myInputFormat.scm"/>
>>   </transform>
>>   <select type="pipeline-state">
>>    <when test="valid">
>>     <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>>     <transform type="my-business-logic"/>
>>     <serialize type="xml"/>
>>    </when>
>>    <otherwise>
>>     <!-- produce an error document -->
>>    </otherwise>
>>   </select>
>>  </match>
> 
> 
> I basically asked the same thing... but we cannot have a generic payload 
> generator yet.

Who said we should? Is there a *real* (non theory-driven) need for such 
a thing?

I've been using the request generator with good satisfaction even for 
web services-like stuff and I don't need to send any input from the 
command line (do you?) and a Mailet wrapper for Cocoon is yet to be seen.

Interface Elegance driven design is one step too close to FS from where 
I stand.

But if there are *real* needs (means stuff that can't be done nicely 
today with what we have), I'm more than welcome to discuss how to move 
forward.

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

(my comments, based on the discussions that are going on lately and my 
work on the blocks move and doc writing)

Stefano Mazzocchi wrote:
[...]
>> If we compare a Cocoon output pipeline with a unix pipeline, it always
>> ignore standard input and always write to standard output.
> 
> 
> Sorry, but this is plain wrong.
> 
> Cocoon already ships generators that do *NOT* ignore the request input. 

Look at the Request interface.
There is no method to get the input.

> Extending those components to perform higher-level functionality is 
> *NOT* an architectural problem. Or at least, I don't see why it should be.

If a Request has input, we should at least put it in the interface.

Vadim has proposed, after some discussion, to add the possibility of 
returning n streams, that can be used for example in mails or in any 
system that inputs multitype data.

[...]
>> In a servlet, input would be
>> taken from the input stream of the request object. We could also have
>> a writable cocoon: protocol where the input stream would be set by the
>> user of the protocol, more about that later, (see also my post in the
>> thread [1]).
>>
>> An example:
>>
>> <match pattern="**.xls"/>
>>   <generate type="xls"/>
>>   <transform type="xsl" src="foo.xsl"/>
>>   <serialize type="xml" dest="context://repository/{1}.xml"/>
>> </match>
> 
> 
> I see two things here:
> 
> 1) the current pipeline components don't seem to be asymmetric (and this 
> goes somewhat against what you wrote at the beginning of your email), 
> the asymmetry is in the fact that the serializer output is *always* 
> bound to the client response. Am I right on this assumption?

The fact is that the request is (down-to-earth) a URI, and a response is 
a stream. This is not symmetry.

> 2) what is this pipeline returning to the requesting client? This is not 
> SMTP, we have to return something. Sure, we might simply return an HTTP 
> header with some error code depending on the result of the 
> serialization, but then people will ask how to control that part.

[...]


>> Several of the existing
>> generators would be highly usable in input pipelines if they were
>> modified in such a way that they read from "standard input" when no
>> src attribute is given.
> 
> I lost you here.

My take: If you use a Generator with a source protocol, it's more 
flexible. Add a protocol that gets data from request input, and you're done.

[...]


> Wouldn't the following pipeline achieve the same functionality you want 
> without requiring changes to the architecture?
> 
>  <match pattern="myservice">
>   <generate type="payload"/>
>   <transform type="validator">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </transform>
>   <select type="pipeline-state">
>    <when test="valid">
>     <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>     <transform type="my-business-logic"/>
>     <serialize type="xml"/>
>    </when>
>    <otherwise>
>     <!-- produce an error document -->
>    </otherwise>
>   </select>
>  </match>

I basically asked the same thing... but we cannot have a generic payload 
generator yet.

[...]

>> The ability to handle structured input (e.g. xml) in a convenient way,
>> will probably be an important requirement on webapp frameworks in the
>> near future.
> 
> 
> Agreed.
> 
>> By removing the asymmetry between generators and serializers, by letting
>> the input of a generator be set by the context and the output of a
>> serializer be set from the sitemap, Cocoon could IMO be as good in
>> handling input as it is today in producing output.
> 
> 
> I don't understand what you mean by 'setting the input by the context'.
> 
> As far as allowing the serializer to have a destination semantic in the 
> sitemap, I'd be against it because I see it more harmful than useful.
> 
> I do agree that serializers should not be connected only to the servlet 
> output stream, but this is not a concern of the pipeline itself, but of 
> who assembles the pipeline... and, IMO, the flow logic is what is 
> closest to that that we have today.
> 
>> This would also make it possible to introduce a writable as well as
>> readable Cocoon pseudo protocol, that would be a good way to export
>> functionality from blocks.
> 
> I agree that a writeable cocoon: protocol is required, expecially for 
> blocks, but this doesn't mean we have to change the sitemap semantics 
> for that.
> 
>> There are of course many open questions, e.g. how to implement those
>> ideas without introducing to much back incompability.
> 
> The best idea is to avoid changing what it doesn't require changes and 
> work to minimize architectural changes from that point on.

Yup, exactly.

> But enough for now.
> 
> And thanks for keeping up with the input-oriented discussions :-)

Indeed.


-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Stefano Mazzocchi <st...@apache.org>.

Hmmm, maybe deep architectural discussions are good during holydays 
seasons... we'll see :)

Daniel Fagerstrom wrote:
> Input Pipelines
> ===============
> 
> There is, IMO, a need for better support for input handling in
> Cocoon. I believe that the introduction of "input pipelines" can be an
> important step in this direction. In the rest of this (long) RT I will
> discuss use cases for them, a possible definition of input pipelines,
> compare them with the existing pipeline concept in Cocoon (henceforth
> called output pipelines), discuss what kind of components that would
> be useful in them, how they can be used in the sitemap and from
> flowscripts, and also relate them to the current discussion about how
> to reuse functionality "Cocoon services" between blocks.

Cool, let's rock and roll.

> Use cases
> ---------
> 
> There is an ongoing trend of packaging all kinds of application as web
> applications or to decompose them as sets of web services. At the same
> time web browsers are more and more becoming a universal GUI for all
> kinds of applications (e.g. XUL).
> 
> This leads to an increasing need for handling of structured input data
> in web applications. SOAP might be the most important example, we also
> have XML-RPC and most certainly numerous home brewn formats, some might
> even be binary non-xml legacy formats. WebDAV is another example of
> xml-input, and next generation form handling, XForms, use xml as
> transport format.
> 
> As people are building more and more advanced Cocoon-systems there is
> also a growing need for reusing functionality in a structured way,
> there have been discussions about how to package and reuse "Cocoon
> services" in the context of blocks [1] and [2]. Here there is also a
> need for handling xml-input.
> 
> The company I work for build data warehouses, some of our customer are
> starting to get interested in using the functionality of the data
> warehouses, not only from the the web interfaces that we usually build
> but also as parts of their own webapps. This means that we want,
> besides Cocoons flexibility in presenting data in different forms,
> also flexibility in asking for the data through different input
> formats.
> 
> There is thus a world of input beyond the request parameters, and a
> world of rapidly growing importance.

I acknowledge that and I think everybody here does.

> Does Cocoon support the abovementioned use cases? Yes and no: there
> are numerous components that implements SOAP, WebDAV, parts of XForms
> etc. But while the components designed for publishing are highly
> reusable in various context, this is not the case for input
> components. 

Stop.

Before we go on I would like to point out that there is a *huge* 
difference between poor 'reusability of components' depending on their 
implementation or depending on architectural limitations of the 
component framework.

> IMO the reason for this is that Cocoon as a framework does
> not have much support for input handling.

This is obviously debetable, but I do agree with you that it's worth 
considering to challenge the very architecture of the framework and test 
its balance toward input and output.

So, no matter what result this discussion will bring, it will be a good 
design challenge.

> IMO Cocoon could be as good in handling input as it currently is in
> creating output, by reusing exactly the same concept: pipelines. We
> can however not use the existing "output pipelines" as is, there are
> some assymetries in their design that makes them unsuitable for input.

I fail to see the asymmetries, but let's keep going.

> The term "input pipeline" has sometimes been used on the list, it is
> time to try to define what it could be.
> 
> What is an Input Pipeline
> -------------------------
> 
> An input pipeline typically starts by reading octet data from the
> input stream of the request object. The input data could be xml, tab
> separated data, text that is structured according to a certain
> grammar, binary legacy formats like Excel or Word or anything else
> that could be translated to xml. The first step in the input pipeline
> is an adapter from octet data to a sax events. This sounds quite
> similar to a generator, we will return to this in the next session.

This sounds so similar to a generator that I fail to see any difference 
to what a generator is... that is: whould you need any additional method 
in an interface that describes such a 'generator for input pipelines'? 
I'm not being ironic, but honestly curious.

> The structure of the xml from the first step in the pipeline might not
> be in a form that is suitable for the data model that we would like to
> use internally in the system. Reasons for this can be that the xml
> input is supposed to follow some standard or some customer defined
> format. Input adapters for legacy formats will probably produce xml
> that is similar to the input format and repeat all kinds of
> idiosyncrasies from that format. There is thus a need to transform the
> input xml to an xml format more suited to our application specific
> needs. One or several xslt-transformer steps would therefore be
> useful in the input pipeline.

And these sounds like transformers to me, unless I'm really missing a 
big piece of the puzzle.

> As a last step in the input pipeline the sax events should be adapted
> to some binary format so that e.g. the business logic in the system
> can be applied to it. The xml input could e.g. be serialized to an
> octet stream for storage in a file (as text, xml, pdf, images, ...),
> transformed to java objects for storage in the session object, be put
> into an xml db or into an relational db.

Ah, now I'm starting to get it: you want to detach the pipeline output 
to the response!

Yes, I've been thinking about this a lot and I think I do have a 
solution (more below)

> Isn't this exactly what an output pipeline does?
> 
> Comparison to Output Pipelines
> ------------------------------
> 
> Booth an input and an output pipeline consists of a an adaptor from
> a binary format to sax events followed by a (possibly empty) sequence
> of transformers that take sax events as input as well as output. The
> last step is an adaptor from sax events to a binary format. The main
> difference (and the one I will focus on) is how the binary input and
> output is connected to the pipeline.
> 
> Let us look at an example of an output pipeline:
> 
> <match pattern="*.html"/>
>   <generate type="xml" src="{1}.xml"/>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="html"/>
> </match>
> 
> The input to the pipeline is controlled from the sitemap by the src
> attribute in the generator, while the output from the serializer can't
> be controlled from the sitemap, the context in which the sitemap is
> used is responsible for directing the output to an appropriate
> place. If the pipeline is used from a servlet, the output will be
> directed to the output stream of the response object in the serlet. If
> it is used from the command line, the output will be redirected to a
> file. If it is used in the cocoon: protocol the output will be
> redirected to be used as input from the src attribute of e.g. a
> generator or a transformer (cf. with Carstens and mine writings in
> [1] about the semantics of the cocoon: protocol).
> 
> Here is another example:
> 
> <match pattern="bar.pdf"/>
>   <generate type="xsp" src="bar.xsp"/>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="pdf"/>
> </match>
> 
> In this case the binary input is taken from the object model and the
> component manager in Cocoon and the input file to the generator,
> "bar.xsp" describes how to extract the input and how to structure it
> as an xml document.
> 
> If we compare a Cocoon output pipeline with a unix pipeline, it always
> ignore standard input and always write to standard output.

Sorry, but this is plain wrong.

Cocoon already ships generators that do *NOT* ignore the request input. 
Extending those components to perform higher-level functionality is 
*NOT* an architectural problem. Or at least, I don't see why it should be.

> An input
> pipeline would be the opposite: it would always read from standard
> input and ignore standard output. In Cocoon this would mean that the
> input source would be set by the context.

What context? do you imply that input pipelines don't work out of 
request parameter matching?

> In a servlet, input would be
> taken from the input stream of the request object. We could also have
> a writable cocoon: protocol where the input stream would be set by the
> user of the protocol, more about that later, (see also my post in the
> thread [1]).
> 
> An example:
> 
> <match pattern="**.xls"/>
>   <generate type="xls"/>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="xml" dest="context://repository/{1}.xml"/>
> </match>

I see two things here:

1) the current pipeline components don't seem to be asymmetric (and this 
goes somewhat against what you wrote at the beginning of your email), 
the asymmetry is in the fact that the serializer output is *always* 
bound to the client response. Am I right on this assumption?

2) what is this pipeline returning to the requesting client? This is not 
SMTP, we have to return something. Sure, we might simply return an HTTP 
header with some error code depending on the result of the 
serialization, but then people will ask how to control that part.

> Here the generator reads an Excel document from the input stream that
> is submitted by the context, and translate it to some xml format. The
> serializer write its xml input in the file system. I reused the names
> generator and serializer partly because I didn't found any good names
> (deserializer is the inverse to serializer, but what is the inverse of
> a generator?)

There is none, because the opposite of generation would be destruction 
and you are definately *not* distructing something, but still *generate* 
it. Where the data the generator uses comes from is *not* an 
architectural concern and should not modify the component's name.

>, and partly because it IMO would be the best solution if
> the generator and serializer from output pipelines can be extended to
> be usable in input pipelines as well.

I don't see the need to change anything in pipeline components. IoC 
keeps serializers totally unaware of where they are writing and 
Generators already have access to all request input.

> Several of the existing
> generators would be highly usable in input pipelines if they were
> modified in such a way that they read from "standard input" when no
> src attribute is given.

I lost you here.

> There are also some serializers that would be
> usefull in the input pipelines as well, in this case the output stream
> given i the dest attribute should be used instead of the one that is
> supplied by the context. It can of course be problematic to extend the
> definition of generators anda serializers as it might lead to back
> compabillity problems.

Please, tell me what kind of changes to those interfaces you think you'd 
require to implement what you are proposing. It will be much easier to 
follow.

> Another example of an input pipeline:
> 
> <match pattern="in"/>
>   <generate type="textparser">
>     <parameter name="grammar" value="example.txt"/>
>   </generate>
>   <transform type="xsl" src="foo.xsl"/>
>   <serialize type="xsp" src="toSql.xsp"/>
> </match>
> 
> In this example the serializer modify the content of components that
> can be found from the object model and the component manager. We use a
> hypothetical "output xsp" language to describe how to modify the
> environment. Such a language could be a little bit like xslt in the
> sense that it recursively applies templates (rules) with matching
> xpath patterns. But the template would contain custom tags that have
> side effects instead of just emitting xml. Could such a language be
> implemented in Jelly? It would be useful with custom tags that modify
> the session object, that writes to sql databases, connect with business
> logic and so on.

This example is a security nightmare.

> Error Handling
> --------------
> 
> Error handling in input pipelines is even more important than in
> output pipelines: We must protect the system against non well formed
> input and the user must be given detailed enough information about
> whats wrong, while they in many cases has no access to log files or
> access to the internals of the system.
> 
> Examples of things that can go wrong is that the input not is parsable
> or that it isn't valid with respect to some grammar or scheme. If we
> want input pipelines to work in streaming mode, without unnecessary
> buffering, it is impossible to know that the input data is correct until 
> all
> of it is processed. This means that serializer might already have
> stored some parts of the pipeline data when an error is detected. I
> think that serializers where faulty input data would be unacceptable,
> should use some kind of transactions and that they should be notified
> when something goes wrong earlier in the pipeline so that they are
> able to roll back the transaction.
> 
> I have not studied the error handling system in Cocoon, maybe there
> already are mechanisms that could be used in input pipelines as well?

It's entirely possible to have 'ValidationTransformers' that trigger an 
exception if something is wrong, and this exception will be picked up by 
the usual error handler.

> 
> In Sitemaps
> -----------
> 
> In a sitemap an input pipeline could be used e.g. for implementing a
> web service:
> 
> <match pattern="myservice">
>   <generate type="xml">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </generate>
>   <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>   <serialize type="dom-session" non-terminating="true">
>     <parameter name="dom-name" value="input"/>
>   </serialize>
>   <select type="pipeline-state">
>     <when test="success">
>       <act type="my-business-logic"/>
>       <generate type="xsp" src="collectTheResult.xsp"/>
>       <serialize type="xml"/>
>     </when>
>     <when test="non-valid">
>       <!-- produce an error document -->
>     </when>
>   </select>
> </match>
> 
> Here we have first an input pipeline that reads and validates xml
> input, transforms it to some appropriate format and store the result
> as a dom-tree in a session attribute. A serializer normally means that
> the pipeline should be executed and thereafter an exit from the
> sitemap. I used the attribute non-terminating="true", to mark that
> the input pipeline should be executed but that there is more to do in
> the sitemap afterwards.
> 
> After the input pipeline there is a selector that select the output
> pipeline depending of if the input pipeline succeed or not. This use
> of selection have some relation to the discussion about pipe-aware
> selection (see [3] and the references therein). It would solve at
> least my main use cases for pipe-aware selection, without having its
> drawbacks: Stefano considered pipe-aware selection mix of concern,
> selection should be based on meta data (pipeline state) rather than on
> data (pipeline content). There were also some people who didn't like
> my use of buffering of all input to the pipe-aware selector. IMO the
> use of selectors above solves booth of these issues.
> 
> The output pipeline start with an action that takes care about the
> business logic for the application. This is IMHO a more legitimate use
> for actions than the current mix of input handling and business logic.

Wouldn't the following pipeline achieve the same functionality you want 
without requiring changes to the architecture?

  <match pattern="myservice">
   <generate type="payload"/>
   <transform type="validator">
     <parameter name="scheme" value="myInputFormat.scm"/>
   </transform>
   <select type="pipeline-state">
    <when test="valid">
     <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
     <transform type="my-business-logic"/>
     <serialize type="xml"/>
    </when>
    <otherwise>
     <!-- produce an error document -->
    </otherwise>
   </select>
  </match>

> In Flowscripts
> --------------
> 
> IIRC the discussion and examples of input for flowscripts this far has
> mainly dealed with request parameter based input. If we want to use
> flowscripts for describing e.g. web service flow, more advanced input
> handling is needed. IMO it would be an excelent SOC to use output
> pipelines for the presentation of the data used in the system, input
> pipelines for going from input to system data, java objects (or some
> other programming language) for describing business logic working on
> the data within the system, and flowscripts for connecting all this in
> an appropriate temporal order.

A while ago, I proposed the addition of a new flowscript method that 
would be something like this

  invoquePipeline(uri, parameters, outputStream)

that means that the flow will be calling the pipeline associated with 
the given URI, but the serializer will write on the given outputStream.

Since there were already too many irons in the fire, I wanted to see the 
flowscript settle down before starting to push for this again, but your 
RT brings back pressure on this concept and I think this is all we need 
to remove the asymmetry from cocoon pipelines.

> For Reuseability Between Blocks
> -------------------------------
> 
> There have been some discussions about how to reuse functionality
> between blocks in Cocoon (see the threads [1] and [2] for
> background). IMO (cf. my post in the thread [1]), a natural way of
> exporting pipeline functionality is by extending the cocoon pseudo
> protocol, so that it accepts input as well as produces output. The
> protocol should also be extended so that input as well as output can
> be any octet stream, not just xml.

The above flowscript method could use the URI to connect to 
block-contained pipelines.... but I'm not sure if this would solve the 
entire solution space.

> If we extend generators so that their input can be set by the
> environment (as proposed in the discussion about input pipelines), we
> have what is needed for creating a writable cocoon protocol. The web
> service example in the section "In Sitemaps" could also be used as an
> internal service, exported from a block.
> 
> Booth input and output for the extended cocoon protocol can be booth
> xml and non-xml, this give us 4 cases:
> 
> xml input, xml output: could be used from a "pipeline"-transformer,
> the input to the transformer is redirected to the protocol and the
> output from the protocol is redirected to the output of the
> transformer.
> 
> non-xml input, xml output: could be used from a generator.
> 
> xml input, non-xml output: could be used from a serializer.
> 
> non-xml input, non-xml output: could be used from a reader if the
> input is ignored, from a "writer" if the output is ignored and from a
> "reader-writer", if booth are used.
> 
> Generators that accepts xml should of course also accept sax-events
> for efficiency reasons, and serializer that produces xml should of the
> same reason also be able to produce sax-events.

I still can't see any difference between a reader and a writer (or an 
input-generator vs. output-generator) in terms of interface methods. 
They look totally similar to me. It's the way the sitemap uses them that 
changes their behavior. IoC should enforce that.

> Conclusion
> ----------
> 
> The ability to handle structured input (e.g. xml) in a convenient way,
> will probably be an important requirement on webapp frameworks in the
> near future.

Agreed.

> By removing the asymmetry between generators and serializers, by letting
> the input of a generator be set by the context and the output of a
> serializer be set from the sitemap, Cocoon could IMO be as good in
> handling input as it is today in producing output.

I don't understand what you mean by 'setting the input by the context'.

As far as allowing the serializer to have a destination semantic in the 
sitemap, I'd be against it because I see it more harmful than useful.

I do agree that serializers should not be connected only to the servlet 
output stream, but this is not a concern of the pipeline itself, but of 
who assembles the pipeline... and, IMO, the flow logic is what is 
closest to that that we have today.

> This would also make it possible to introduce a writable as well as
> readable Cocoon pseudo protocol, that would be a good way to export
> functionality from blocks.

I agree that a writeable cocoon: protocol is required, expecially for 
blocks, but this doesn't mean we have to change the sitemap semantics 
for that.

> There are of course many open questions, e.g. how to implement those
> ideas without introducing to much back incompability.

The best idea is to avoid changing what it doesn't require changes and 
work to minimize architectural changes from that point on.

But enough for now.

And thanks for keeping up with the input-oriented discussions :-)

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------



---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Daniel Fagerstrom <da...@nada.kth.se>.

Nicola Ken Barozzi wrote:
> 
> Daniel Fagerstrom wrote:
> [...]
> 
> Cocoon is symmetric, if you see it as it really is, a system that 
> transforms a Request in a Response.
> 
> The problem arises in the way we have defined the request and the 
> response: The Request is a URL, the response is a Stream.
> 
> So actually Cocoon transforms URIs in a stream.
> 
> The sitemap is the system that demultiplexes URIs by associating them 
> with actual source of the data. This makes cocoon richer than a system 
> that just hands an entity to transform: Cocoon uses indirect references 
> (URLs) instead.
> 
> The Stream as an input is a specialization, so I can say in the request 
> to get stuff from the stream.
> 
> More on this later.
> 
>> In a sitemap an input pipeline could be used e.g. for implementing a
>> web service:
>>
>> <match pattern="myservice">
>>   <generate type="xml">
>>     <parameter name="scheme" value="myInputFormat.scm"/>
>>   </generate>
>>   <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>>   <serialize type="dom-session" non-terminating="true">
>>     <parameter name="dom-name" value="input"/>
>>   </serialize>
>>   <select type="pipeline-state">
>>     <when test="success">
>>       <act type="my-business-logic"/>
>>       <generate type="xsp" src="collectTheResult.xsp"/>
>>       <serialize type="xml"/>
>>     </when>
>>     <when test="non-valid">
>>       <!-- produce an error document -->
>>     </when>
>>   </select>
>> </match>
> 
> 
> What you correctly point out, is that the Generation phase not get the 
> source, but just transform it to SAX.
<snip/>
> But IMHO this has a deficiency of fixing the source from the input.
My intension was that that when not using src attribute, the generator 
should read the input stream.

> Think about having good Source Protocols.
> 
> We could write:
> 
>  <match pattern="myservice">
>    <generate type="xml" src="inputstream:myInputFormat.scm"/>
>    ...
>  </match>
> 
> This can easily make all my Generators able to work with the new system 
> right away.
This seem to be a better solution. Can please expand about why you put 
the scheme in the inputstream: protocol.

> 
>> Here we have first an input pipeline that reads and validates xml
>> input, transforms it to some appropriate format and store the result
>> as a dom-tree in a session attribute. A serializer normally means that
>> the pipeline should be executed and thereafter an exit from the
>> sitemap. I used the attribute non-terminating="true", to mark that
>> the input pipeline should be executed but that there is more to do in
>> the sitemap afterwards.
> 
> 
> Pipelines can already call one another.
> We add the serializer at the end, but it's basically skipped, thus 
> making your pipeline example.
The idea is using two pipelines, executed in sequence, for processing a 
post. First the input pipeline that is responsible for reading the input 
data, trandform it to an appropriate format and store it, after that the 
stored data can be used for the business logic that can be called from 
an action, after the action an ordinary output pipeline is executed for 
publishing the result of the business logic, for sending the next form 
page etc.

In this scenario the serializer in the input pipeline is responsible for 
storing the input data and can thus not be skipped. Furthermore as we 
are going to execute two pipelines in sequence, the first serializer 
must not mean an exit from the sitemap as it normally would do.

I think it is better SoC and reuse of components, to let a serializer be 
responsible for storing input data than to use transformers for that. 
Write DOM session transformer, source writing transformer, 
SQLTransformer used for inserting data and the session transformer would 
IMHO be more natural as serializers.

> I would think that with the blocks discussion there has been some 
> advancement on the definition of pipeline fragments.
> I didn't follow it closely though, anyone care to comment?
> 
>> After the input pipeline there is a selector that select the output
>> pipeline depending of if the input pipeline succeed or not. This use
>> of selection have some relation to the discussion about pipe-aware
>> selection (see [3] and the references therein). It would solve at
>> least my main use cases for pipe-aware selection, without having its
>> drawbacks: Stefano considered pipe-aware selection mix of concern,
>> selection should be based on meta data (pipeline state) rather than on
>> data (pipeline content). There were also some people who didn't like
>> my use of buffering of all input to the pipe-aware selector. IMO the
>> use of selectors above solves booth of these issues.
> 
> 
> I don't see this. Can you please expand here?
1. Selection should be based on pipeline state instead of pipeline data. 
   First the input pipeline is executed and is able to set the state of 
the pipeline. After that ordinary selects can be used for deciding how 
to construct the output pipeline. The selectors for the output pipeline 
has no access some pipeline content and are used in exactly the same way 
as selector allwys are used.

2. No use of buffering within the pipeline. IIRC some people were 
concerned with that pipe aware selection based on buffering of the sax 
events before the selection, could be very inefficient if there is much 
data in the pipeline. As my main use case for pipe aware selection was 
to use it after transformers with side effects, and after validation of 
user submitted input data. I never saw it as problem as the amount of 
data in the mentioned cases typically is quite small. Anyway, with input 
pipelines selection is restricted to cases where the input was going to 
be stored by the system anyhow.

> [...]
> 
>> In Flowscripts
>> --------------
>>
>> IIRC the discussion and examples of input for flowscripts this far has
>> mainly dealed with request parameter based input. If we want to use
>> flowscripts for describing e.g. web service flow, more advanced input
>> handling is needed. IMO it would be an excelent SOC to use output
>> pipelines for the presentation of the data used in the system, input
>> pipelines for going from input to system data, java objects (or some
>> other programming language) for describing business logic working on
>> the data within the system, and flowscripts for connecting all this in
>> an appropriate temporal order.
> 
> 
> Hmmm, this seems like a compelling use case.
> Could you please add a concrete use-case/example for this?
> Thanks :-)
One use case, (if combined with persistent storage of continuations), 
would be a workflow system.

Besides that, input pipelines are IMO very usefull for handling request 
parameters from forms as well. In all webapps that we build at my 
company, we use absolute xpaths as request parameter names and then use 
a generator that builds a xml document from  the name/value pairs. This 
xml input is then possibly transformed to another format and therafter 
stored in a db or as a dom tree in a session attribute.

A flowscript that uses input pipelines might look like:

handleForm("formPage1.html", "storeData1");
if (objectModel["state"] == "succees")
   doBusinessLogic1(...);
...

Where formPage1.html is an output pipeline that produces a form and 
storeData handles and store the input.
> 
>> For Reuseability Between Blocks
>> -------------------------------
>>
>> There have been some discussions about how to reuse functionality
>> between blocks in Cocoon (see the threads [1] and [2] for
>> background). IMO (cf. my post in the thread [1]), a natural way of
>> exporting pipeline functionality is by extending the cocoon pseudo
>> protocol, so that it accepts input as well as produces output. The
>> protocol should also be extended so that input as well as output can
>> be any octet stream, not just xml.
>>
>> If we extend generators so that their input can be set by the
>> environment (as proposed in the discussion about input pipelines), we
>> have what is needed for creating a writable cocoon protocol. The web
>> service example in the section "In Sitemaps" could also be used as an
>> internal service, exported from a block.
>>
>> Booth input and output for the extended cocoon protocol can be booth
>> xml and non-xml, this give us 4 cases:
>>
>> xml input, xml output: could be used from a "pipeline"-transformer,
>> the input to the transformer is redirected to the protocol and the
>> output from the protocol is redirected to the output of the
>> transformer.
>>
>> non-xml input, xml output: could be used from a generator.
>>
>> xml input, non-xml output: could be used from a serializer.
>>
>> non-xml input, non-xml output: could be used from a reader if the
>> input is ignored, from a "writer" if the output is ignored and from a
>> "reader-writer", if booth are used.
>>
>> Generators that accepts xml should of course also accept sax-events
>> for efficiency reasons, and serializer that produces xml should of the
>> same reason also be able to produce sax-events.
> 
> 
> Also this seems interesting.
> 
> Please add concrete examples here to, possibly applied to blocks.
> I know it's hard, but it would really help.
What I tried to describe is just a somewhat different approach to how to 
describe reusable pipeline fragments between blocks, so for use cases 
please see Sylvains and Stefanos original posts in the threads [1] and [2].

Lets take a look on an example from Sylvains post (in [1]) to illustrate 
what I have in mind:

    <map:match pattern="a_page">
      <map:generate src="an_xdoc.xml"/>
      <map:transform type="pipeline" src="xdoc2skinnedHtml"/>
      <map:serialize type="html"/>
    </map:match>

    <map:match pattern="xdoc2skinnedHtml">
      <map:generate type="dont_care"/>
      <map:transform type="i18n"/>
      <map:transform type="xdoc2html.xsl"/>
      <map:transform type="htmlskin.xsl"/>
      <map:serialize type="dont_care"/>
    </map:match>

Here the idea is that when xdoc2skinnedHtml is used from a pipeline 
transformer the generator and the serializer is not used and only the 
sub pipeline consisting of the three transformers in the middle is used. 
This behaviour is inspired by the cocoon: protocol where the serializer 
is skipped.

Several people thought that the removal of parts of generators and 
serializer depending on the usage context of the pipeline, confusing. 
Carsten wrote that:
"It is correct, that internally in most cases the serializer
of a pipeline is ignored, when the cocoon protocol is used.
But this is only because of performance."
And that a pipeline used from the cocoon protocol is supose to end with 
an xml serializer. I agree with this and think that it would be better 
to express the example above as (cf with my post in [1]):

    <map:match pattern="a_page">
      <map:generate src="an_xdoc.xml"/>
      <map:transform type="pipeline" src="cocoon:xdoc2skinnedHtml"/>
      <map:serialize type="html"/>
    </map:match>

    <map:match pattern="xdoc2skinnedHtml">
      <map:generate src="inputstream:xdoc.scm"/>
      <map:transform type="i18n"/>
      <map:transform type="xdoc2html.xsl"/>
      <map:transform type="htmlskin.xsl"/>
      <map:serialize type="xml"/>
    </map:match>

Here the cocoon: protocol is suposed to be a writable source. The 
function of the pipeline transformer is that it serializes its xml 
input, redirect it to the writable source in the src attribute, parses 
the xml output stream from the source and output the result from the 
parser as sax events. Of course the serialize-parse steps should be 
optimzed away, but this should be considered an implementation detail 
not part of the semantics.

By further generalizing the cocoon: protocol so that it allows non-xml 
output (and input) it can be used for the pipeline serializer that 
Sylvain proposed as well. For the pipeline generator the cocoon: 
protocol can be used as is.

> 
> It seems that what you propose Cocoon already mostly has, but it's more 
> the use-case and some minor additions that have to be put forward.
> 
>> Conclusion
>> ----------
>>
>> The ability to handle structured input (e.g. xml) in a convenient way,
>> will probably be an important requirement on webapp frameworks in the
>> near future.
>>
>> By removing the asymmetry between generators and serializers, by letting
>> the input of a generator be set by the context and the output of a
>> serializer be set from the sitemap, Cocoon could IMO be as good in
>> handling input as it is today in producing output.
> 
> 
> Cocoon already does this, no?
> Can't we use the cocoon:// protocol to get the results of a pipeline 
> from another one? What would change?
As said above, the cocoon protocol should be writable as well as 
readable and allow for non xml input and output. The block protocol 
could use the same ideas and thus give a good way of exporting 
functionality.

To realize the above ideas we would need to implement the inputstream 
protocol that in turn would require that the Request interface is 
extended with a getInputStream() method. The cocoon protocol should be 
extended as described. The proposed extension of the serializer for the 
use in input pipelines would require serializers to implement 
SitemapModelComponent.

Thank you for your comments.

/Daniel Fagerstrom

<snip/>

>> References
>> ----------
>>
>> [1] [RT] Using pipeline as sitemap components (long)
>> http://marc.theaimsgroup.com/?t=103787330400001&r=1&w=2
>>
>> [2] [RT] reconsidering pipeline semantics
>> http://marc.theaimsgroup.com/?t=102562575200001&r=2&w=2
>>
>> [3] [Contribution] Pipe-aware selection
>> http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=101735848009654&w=2
> 
> 
> 




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: [RT] Input Pipelines (long)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Daniel Fagerstrom wrote:
[...]

Cocoon is symmetric, if you see it as it really is, a system that 
transforms a Request in a Response.

The problem arises in the way we have defined the request and the 
response: The Request is a URL, the response is a Stream.

So actually Cocoon transforms URIs in a stream.

The sitemap is the system that demultiplexes URIs by associating them 
with actual source of the data. This makes cocoon richer than a system 
that just hands an entity to transform: Cocoon uses indirect references 
(URLs) instead.

The Stream as an input is a specialization, so I can say in the request 
to get stuff from the stream.

More on this later.

> In a sitemap an input pipeline could be used e.g. for implementing a
> web service:
> 
> <match pattern="myservice">
>   <generate type="xml">
>     <parameter name="scheme" value="myInputFormat.scm"/>
>   </generate>
>   <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
>   <serialize type="dom-session" non-terminating="true">
>     <parameter name="dom-name" value="input"/>
>   </serialize>
>   <select type="pipeline-state">
>     <when test="success">
>       <act type="my-business-logic"/>
>       <generate type="xsp" src="collectTheResult.xsp"/>
>       <serialize type="xml"/>
>     </when>
>     <when test="non-valid">
>       <!-- produce an error document -->
>     </when>
>   </select>
> </match>

What you correctly point out, is that the Generation phase not get the 
source, but just transform it to SAX.

  <match pattern="myservice">
    <generate type="xml">
      <parameter name="scheme" value="myInputFormat.scm"/>
    </generate>
    <transform type="xsl" src="myInputFormat2MyStorageFormat.xsl"/>
    <serialize type="dom-session" non-terminating="true">
      <parameter name="dom-name" value="input"/>
    </serialize>
    <select type="pipeline-state">
      <when test="success">
        <act type="my-business-logic"/>
        <generate type="xsp" src="collectTheResult.xsp"/>
        <serialize type="xml"/>
      </when>
      <when test="non-valid">
        <!-- produce an error document -->
      </when>
    </select>
  </match>

But IMHO this has a deficiency of fixing the source from the input.
Think about having good Source Protocols.

We could write:

  <match pattern="myservice">
    <generate type="xml" src="inputstream:myInputFormat.scm"/>
    ...
  </match>

This can easily make all my Generators able to work with the new system 
right away.

> Here we have first an input pipeline that reads and validates xml
> input, transforms it to some appropriate format and store the result
> as a dom-tree in a session attribute. A serializer normally means that
> the pipeline should be executed and thereafter an exit from the
> sitemap. I used the attribute non-terminating="true", to mark that
> the input pipeline should be executed but that there is more to do in
> the sitemap afterwards.

Pipelines can already call one another.
We add the serializer at the end, but it's basically skipped, thus 
making your pipeline example.

I would think that with the blocks discussion there has been some 
advancement on the definition of pipeline fragments.
I didn't follow it closely though, anyone care to comment?

> After the input pipeline there is a selector that select the output
> pipeline depending of if the input pipeline succeed or not. This use
> of selection have some relation to the discussion about pipe-aware
> selection (see [3] and the references therein). It would solve at
> least my main use cases for pipe-aware selection, without having its
> drawbacks: Stefano considered pipe-aware selection mix of concern,
> selection should be based on meta data (pipeline state) rather than on
> data (pipeline content). There were also some people who didn't like
> my use of buffering of all input to the pipe-aware selector. IMO the
> use of selectors above solves booth of these issues.

I don't see this. Can you please expand here?

[...]
> In Flowscripts
> --------------
> 
> IIRC the discussion and examples of input for flowscripts this far has
> mainly dealed with request parameter based input. If we want to use
> flowscripts for describing e.g. web service flow, more advanced input
> handling is needed. IMO it would be an excelent SOC to use output
> pipelines for the presentation of the data used in the system, input
> pipelines for going from input to system data, java objects (or some
> other programming language) for describing business logic working on
> the data within the system, and flowscripts for connecting all this in
> an appropriate temporal order.

Hmmm, this seems like a compelling use case.
Could you please add a concrete use-case/example for this?
Thanks :-)

> For Reuseability Between Blocks
> -------------------------------
> 
> There have been some discussions about how to reuse functionality
> between blocks in Cocoon (see the threads [1] and [2] for
> background). IMO (cf. my post in the thread [1]), a natural way of
> exporting pipeline functionality is by extending the cocoon pseudo
> protocol, so that it accepts input as well as produces output. The
> protocol should also be extended so that input as well as output can
> be any octet stream, not just xml.
> 
> If we extend generators so that their input can be set by the
> environment (as proposed in the discussion about input pipelines), we
> have what is needed for creating a writable cocoon protocol. The web
> service example in the section "In Sitemaps" could also be used as an
> internal service, exported from a block.
> 
> Booth input and output for the extended cocoon protocol can be booth
> xml and non-xml, this give us 4 cases:
> 
> xml input, xml output: could be used from a "pipeline"-transformer,
> the input to the transformer is redirected to the protocol and the
> output from the protocol is redirected to the output of the
> transformer.
> 
> non-xml input, xml output: could be used from a generator.
> 
> xml input, non-xml output: could be used from a serializer.
> 
> non-xml input, non-xml output: could be used from a reader if the
> input is ignored, from a "writer" if the output is ignored and from a
> "reader-writer", if booth are used.
> 
> Generators that accepts xml should of course also accept sax-events
> for efficiency reasons, and serializer that produces xml should of the
> same reason also be able to produce sax-events.

Also this seems interesting.

Please add concrete examples here to, possibly applied to blocks.
I know it's hard, but it would really help.

It seems that what you propose Cocoon already mostly has, but it's more 
the use-case and some minor additions that have to be put forward.

> Conclusion
> ----------
> 
> The ability to handle structured input (e.g. xml) in a convenient way,
> will probably be an important requirement on webapp frameworks in the
> near future.
> 
> By removing the asymmetry between generators and serializers, by letting
> the input of a generator be set by the context and the output of a
> serializer be set from the sitemap, Cocoon could IMO be as good in
> handling input as it is today in producing output.

Cocoon already does this, no?
Can't we use the cocoon:// protocol to get the results of a pipeline 
from another one? What would change?

> This would also make it possible to introduce a writable as well as
> readable Cocoon pseudo protocol, that would be a good way to export
> functionality from blocks.

Please expand on this.

> There are of course many open questions, e.g. how to implement those
> ideas without introducing to much back incompability.

If we see the use cases, it would be much easier.

Your ideas are interesting, and I see too this asymmetry.
If you expand in the aboive areas, it would be really of help for me.

Thanks :-)

> 
> References
> ----------
> 
> [1] [RT] Using pipeline as sitemap components (long)
> http://marc.theaimsgroup.com/?t=103787330400001&r=1&w=2
> 
> [2] [RT] reconsidering pipeline semantics
> http://marc.theaimsgroup.com/?t=102562575200001&r=2&w=2
> 
> [3] [Contribution] Pipe-aware selection
> http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=101735848009654&w=2


-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org