You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Andreas Hochsteger <e9...@student.tuwien.ac.at> on 2003/02/09 21:47:51 UTC
[PROPOSAL] Cocoon Science Fiction
Hi Cocooners!
Sorry for this (very) long proposal below, but I think it's definitely worth a
read. If not, at least you can give me some feedback about your opinion ;-)
Bye,
Andreas Hochsteger
1 Contents
==========
1 Contents
2 Prologue
3 Introduction
4 Pipeline Types
5 Data Formats
5.1 Data Format Definition
5.2 Inheritance
5.3 A word about MIME Types
5.4 Data Handlers
5.5 Data Format Determination
6 Pipeline Components
6.1 Producers
6.2 Consumers
6.3 Converters
6.4 Filters
6.5 Aggregators
6.6 Actions
6.7 Redirectors
6.8 Matchers
6.9 Branches
6.10 Exceptions
7 Protocol Independence
7.1 Web Services
7.2 Mail Server
7.3 Mailing List Manager
7.4 What else?
8 Protocol Handler
8.1 Component Definition
8.2 Protocol Binding
8.3 The Handler's Task
8.4 Mapping to Pipelines
9 Pipelines as Pipeline Components
9.1 Producer Pipelines
9.2 Consumer Pipelines
9.3 Converter Pipelines
9.4 Filter Pipelines
9.5 Action Pipelines
10 Configuration Files
10.1 cocoon.xconf
10.2 components.xconf
10.3 protocols.xconf
10.4 bindings.xconf
10.5 protocol-mappings.xconf
10.6 data-formats.xconf
10.7 sitemap.xmap
10.8 Config File Hierarchy
11 Converting old sitemaps to new sitemaps
11.1 Generators
11.2 Transformers
11.3 Readers
11.4 Serializers
11.5 Selectors
12 Use Cases
12.1 File Upload
12.2 Combining several pipelines
12.3 Unix Pipes
12.4 Image Processing
12.5 PDF decompiling
12.6 Music Processing
13 Conclusion
14 TODO
15 References
16 Appendix
16.1 Data Formats
16.2 Pipeline Components
2 Prologue
==========
I wrote most of this proposal and some other unfinished one while I had to
stay in hospital for two weeks in the end of November 2002. Luckily I was
armed with my notebook loaded with a CVS snapshot of Cocoon and the great
Cocoon book from Matthew Langham and Carsten Ziegeler.
So I could finally do something productive ;-)
After returning home I had no time to finish it submit it to the public. In
the mean time some discussion on similar topics arrived on the cocoon-dev
mailing list (see [1]) and I forced myself to find some time again to work on
this proposal and finally publish it on the cocoon-dev mailing list.
Perhaps I'll find some time to convert it to an XML format (e.g. Docbook) and
write a converter to publish it on the Cocoon Documentation Wiki, but first
let's discuss a bit on the mailing list.
WARNING:
I have to say that this proposal is intended for open-minded people only,
which aren't afraid to take a look beyond the limits. Anything I'm writing
here might be totally crap for you, so fell free to ignore it, or send your
flames to /dev/null ;-)
If you are still interested, please join this journey to a world, where no man
has gone before ...
3 Introduction
==============
I like the Cocoon pipeline processing concept very much.
I like it so much, that I think it is a pitty, to limit it only to XML
processing (although I agree, that this is the most interresting
application).
I'm sure some of you wanted to be able to build applications the same way like
Unix shell pipes work. Cocoon was a big step in this direction, but it was
only applicable for processing XML data. There are so many cases where
pipeline processing of data (no matter if it is XML, plain text or binary
data) is done today but we are lacking a generic and declarative way to unify
these processing steps. Cocoon is best suited for this task through it's
clean and easy to understand yet powerful pipeline concept.
4 Pipeline Types
================
I tried to design several pipelines variants but after thinking a while they
all were still too limited for the way I wanted them to work.
So here's another try by giving some hypotheses first:
1. A pipeline can produce data
2. A pipeline can consume data
3. A pipeline can convert data
4. A pipeline can filter data
5. A pipeline can accept a certain data format as input
6. A pipeline can produce a certain data format as output
7. Pipeline components follow the same hypotheses (1-6)
8. Only pipeline components with compatible data formats can be arranged next
to each other
Based on these hypotheses you can construct pipelines, which just consume
data, just produce data, both consume and produce data or even neither
consume nor produce data (even this can make sense, as you'll see in section
"9.5 Action Pipelines").
I think these hypotheses are simple enough to understand and flexible enough
to base this further proposal on. So let's try ...
To define a pipeline we need to be able to specify the input and output
format.
We can do this by the help of these two attributes:
- input-format="..."
- output-format="..."
They additionally specify the default input format for the first processing
component and the default output format for the last processing component.
Example:
<map:pipeline input-format="format1" output-format="format2">
...
</map:pipeline>
This pipeline consumes the data format "format1" and produces the data format
"format2". Which data formats are possible and how they are specified is
shown in the next section.
5 Data Formats
==============
With "data format" I mean something like XML, plain text, png, mp3, ...
I'm not yet really sure here, how we should specify data formats, so I'll try
to start with some requirements:
1. They should be easy to remember and to specify ;-)
2. It should be possible to create derived data formats (-> inheritance)
3. It should be possible to specify additional information (e.g. MIME type,
DTD/Schema for XML, ...)
4. Pipelines which accept a certain data format as input can be fed with
derived data formats
5. We should not reinvent standards, which are already suited for this task
(but I fear, there does not yet exist something suitable)
To make it easier for us to begin with the task of defining data formats,
let's assume, we have three basic data formats called "abstract", "binary"
and "text". The format "abstract" will be explained later, but "binary" and
"text" should be clear to everyone.
5.1 Data Format Definition
--------------------------
Here's a try to specify a hierarchy of data formats:
<data:formats>
<!-- #### Super data format #### -->
<!--
The following format is the base for all other formats (-> compare to
java.lang.Object)
Although it is called 'any' data format this name is not prepended to the
derived data formats like this is the case for all
-->
<data:format name="any"
impl="org.apache.cocoon.data.handler.text.DefaultHandler">
<data:param-def name="mime-type" default="application/octet-stream"/>
<data:param-def name="spec" default=""/> <!-- URL to the specification of
this data format -->
</data:format>
<!-- #### Abstract data formats #### -->
<data:format name="abstract"
impl="org.apache.cocoon.data.handler.abstract.DefaultHandler"/>
<data:format name="image" extends="/abstract"
impl="org.apache.cocoon.data.handler.abstract.ImageHandler">
<data:param-def name="depth" default=""/>
<data:param-def name="width" default=""/>
<data:param-def name="height" default=""/>
</data:format>
<data:format name="music" extends="/abstract"
impl="org.apache.cocoon.data.handler.abstract.MusicHandler">
<data:param-def name="channels" default=""/>
</data:format>
<data:format name="sound" extends="/abstract"
impl="org.apache.cocoon.data.handler.abstract.SoundHandler">
<data:param-def name="samplesize" default=""/>
<data:param-def name="samplerate" default=""/>
<data:param-def name="channels" default=""/>
</data:format>
<!--
Multiple inheritance is used for video, wich extends image and sound.
Is there a better way to specify multiple base formats? -->
<data:format name="video" extends="/abstract/image /abstract/sound"
impl="org.apache.cocoon.data.handler.abstract.VideoHandler">
<data:param-def name="framerate" default=""/>
</data:format>
<data:format name="vector" extends="/abstract"
impl="org.apache.cocoon.data.handler.abstract.VectorHandler">
<data:param-def name="unit" default=""/>
<data:param-def name="width" default=""/>
<data:param-def name="height" default=""/>
</data:format>
<data:format name="3d" extends="/abstract/vector"
impl="org.apache.cocoon.data.handler.abstract.3DHandler">
<data:param-def name="depth" default=""/>
</data:format>
<!-- #### Binary based data formats #### -->
<data:format name="binary"
impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
<data:param-def name="endian" default="little"/>
</data:format>
<!-- MS OLE based data formats -->
<data:format name="ole" extends="/binary"
impl="org.apache.cocoon.data.handler.binary.ole.DefaultHandler"/>
<data:format name="msword" extends="/binary/ole"
impl="org.apache.cocoon.data.handler.binary.ole.MSWordHandler"/>
<data:format name="msexcel" extends="/binary/ole"
impl="org.apache.cocoon.data.handler.binary.ole.MSExcelHandler"/>
<!-- Linux ELF based data formats -->
<data:format name="binary"
impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
<data:param-def name="endian" default="little"/>
</data:format>
<data:format name="elf" extends="/binary"
impl="org.apache.cocoon.data.handler.binary.elf.DefaultHandler">
<data:param-def name="architecture" default="x86"/>
</data:format>
<data:format name="executable" extends="/binary/elf"
impl="org.apache.cocoon.data.handler.binary.elf.ExecutableHandler"/>
<data:format name="shared" extends="binary/elf"
impl="org.apache.cocoon.data.handler.binary.elf.SharedLibraryHandler"/>
<!-- #### Text based data formats #### -->
<data:format name="text"
impl="org.apache.cocoon.data.handler.text.DefaultHandler">
<data:param-def name="encoding" default="UTF-8"/>
<data:parameter name="mime-type" value="text/plain"/>
</data:format>
<data:format name="xml" extends="/text"
impl="org.apache.cocoon.data.handler.xml.DefaultHandler">
<!-- this handler deals with SAX events inside the pipeline -->
<data:param-def name="schema-type" default="xsd"/> <!-- other possible
values: dtd, rng, schematron, ... -->
<data:param-def name="schema" default=""/>
<data:parameter name="mime-type" value="text/xml"/>
</data:format>
<data:format name="xhtml" extends="/text/xml"
impl="org.apache.cocoon.data.handler.xml.XHTMLHandler">
<data:parameter name="mime-type" value="text/html"/>
<data:parameter name="schema"
value="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
</data:format>
</data:formats>
It's just a first sketch, but I think you got the idea.
Above you can see the super data format 'any', some abstract, text and binary
data formats, which show you how to specify inherited data formats. If no
extends="..." attribute is given, it is automatically derived from the data
format 'any'.
References to data formats are done by using a path which specifies the
respective data format. This path is built by appending the specified data
format name to the path of the parent data format, separated by a slash. The
super data format is an exception to this rule and is just called 'any'. It
is not part of the path for derived data formats to make them shorter. It is
possible to use relative data format paths too. E.g. a pipeline consumes
/text/xml, a converter generates XHTML from it an thus can use
output-format="xhtml" instead of output-format="/text/xml/xhtml". The name
'any' is reserved only for the super data format and it is not allowed to
name derived data formats after it.
'none' is an other reserved name which is used, if a pipeline does not consume
data (input-format="none") or produce data (output-format="none"). It is the
default for all pipelines, if it is not overwritten by pipelines or their
components.
The examples from above can be used by using the following strings for
specifying data formats:
- any
- /abstract/image
- /abstract/music
- /abstract/sound
- /abstract/video
- /abstract/vector
- /abstract/vector/3d
- /binary
- /binary/ole
- /binary/ole/msword
- /binary/ole/msexcel
- /binary/elf
- /binary/elf/executable
- /binary/elf/shared
- /text
- /text/xml
- /text/xml/xhtml
See section "16.1 Data Formats" for more examples.
One enhancement of this scheme might be useful: Specification of version
numbers or format variants.
One way might be to append the version number to the end separated by a slash,
but I think this will mix different concerns. My suggestion would be to
specify them by appending the version information in brackets as the
following shows:
- /text/xml/xhtml[1.0]
- /text/xml/xhtml[1.1]
Instead of:
- /text/xml/xhtml/1.0
- /text/xml/xhtml/1.1
5.2 Inheritance
---------------
A pipeline which consumes a certain data format can be fed with derived data
formats too.
Take the following pipeline as example:
<map:pipeline input-format="/text/xml">
...
</map:pipeline>
This pipeline would consume the data format "/text/xml/xhtml" without
problems, but leads to an exception if you feed it with the data format
"/text".
5.3 A word about MIME Types
---------------------------
If you ask me, why don't I use the standardized MIME types (see [2]) to
specify data formats, I can give you the following reasons:
MIME types fulfill the requirements from above just partly. They just support
two levels of classification and they are purpose-oriented. The data formats
I suggest are therefore content-oriented (/text/xml/svg vs. image/svg-xml).
So both serve different purposes.
I know the importance of supporting the MIME type standard, and so the
parameter 'mime-type' is part of the super data format 'any' and thus is
available for every other data format too. By specifying a certain data
format, you always have a MIME type associated, in the worst case the MIME
type from the super data format 'any' (application/octet-stream) is used.
5.4 Data Handlers
-----------------
I'm not very sure, what the data handlers actually do, but I can think of
either defining an interface, which must be implemented by the pipeline
components which operate with a certain data format (do we need two handlers
here: input-handler and output-handler?) or they are concrete components
which can be used by the pipeline components to consume or produce this data
format. I think some discussion on this topic might not be bad.
5.5 Data Format Determination
-----------------------------
In many cases, I've written the input- and output-format along with the
pipeline components, but it is also possible to specify them in the
<map:components/> section or implicitely by implementing a certain component
interface and therefore omitting it in the pipeline.
Here's a suggested order of data format determination:
1. Input-/output-Format specified directly with a pipeline component
<map:produce type="uri" ref="docs/file.xml" output-format="/text/xml"/>
2. Input-/output-Format specified by the component declaration
<map:filters>
<map:filter name="prettyxml" input-format="/text/xml"
output-format="/text/xml" ... />
</map:filters>
3. Output-/input-Format specified by the previous or following pipeline
component
<map:produce type="uri" ref="docs/file.xhtml"
output-format="/text/xml/xhtml"/>
<!-- input- and output-format="/text/xml/xhtml" from previous pipeline
component -->
<map:filter type="prettyxml"/>
4. Input-/output-Format specified directly with a pipeline
<map:pipeline input-format="/text/xml" output-format="/text/xml">
<map:filter type="prettyxml"/>
...
</map:pipeline>
5. If nothing from above matches then assume "none".
6 Pipeline Components
=====================
Now that we have a big picture of the pipelines and a flexible way to specify
data formats which flow through the pipelines we can move on to specify the
pipeline components.
To allow a fresh and clean design, abandon all known pipeline components like
generators, transformers, serializers, ... and what you know about their
functionality. I'll use the same names where this makes sense, but keep in
mind, that we are not only talking about processing XML data, so their
functionality may be different.
Currently Cocoon pipeline components are all working with XML data. In this
proposal the components are meant to process any data format available and
I'm sure you'll agree that great care has to be taken to manage the huge
ammount of possible pipeline components. One problem here is the flat
specification of component names. As a solution for this I'd suggest to use
hierarchical path names to specify component names and group related
components under the same path.
6.1 Producers
-------------
They simply produce a data stream, possibly by reading data from a data
repository. Producers are used if no data is consumed from the pipeline and
are usually placed at the beginning of a pipeline.
Component definition:
<map:producers default="uri">
...
<!--
The following producer is similar to the old file generator but can produce
any data format.
I renamed 'file' to 'uri' since it does not only read files, but any
resource,
which can be expressed by an URI and the protocol is known.
-->
<map:producer name="uri"
impl="org.apache.cocoon.pipeline.producer.URIProducer" output-format="any"/>
<!-- The next producer might be identical to the old file generator. -->
<map:producer name="xml/uri"
impl="org.apache.cocoon.pipeline.producer.xml.URIProducer"
output-format="/text/xml"/>
...
</map:producers>
Usage examples:
<map:produce type="uri" output-format="/binary/ole/ms-word"
ref="docs/{1}.doc"/>
<map:produce type="xml/uri" ref="xmldb:xindice://localhost:4080/db/{1}"/>
6.2 Consumers
-------------
They consume a data stream, possibly by writing it to a data repository.
Consumers are used if no data should be produced by the pipeline and are
usually placed at the end of a pipeline.
For a typical use of consumers in a web environment, some result has to be
sent back to the client. Here I'd suggest to use <map:redirect/> to redirect
to another pipeline (perhaps depending on the result of the producer ->
success/error).
Component definition:
<map:consumers default="uri">
...
<map:consumer name="uri"
impl="org.apache.cocoon.pipeline.consumer.URIConsumer" input-format="any"/>
<map:consumer name="xml/uri"
impl="org.apache.cocoon.pipeline.consumer.xml.URIConsumer"
input-format="/text/xml"/>
<map:consumer name="http/response"
impl="org.apache.cocoon.pipeline.consumer.http.ResponseConsumer"
input-format="/text/xml"/>
...
</map:consumers>
Usage example (with redirection):
<map:consume type="xml/uri" ref="xmldb:xindice://localhost:4080/db/{1}"/>
<!-- map:branch is explained below under "Branches" -->
<map:branch type="status">
<map:case match="success">
<map:redirect-to ref="success-page"/>
</map:case>
<map:default>
<map:redirect-to ref="error-page"/>
</map:default>
</map:branch>
6.3 Converters
--------------
They convert a data stream from one data format into an other one.
Component definition:
<map:converters default="http/response">
...
<map:converter name="http/response"
impl="org.apache.cocoon.pipeline.converter.http.ResponseConverter"
input-format="any" output-format="/text/http/response"/>
<map:converter name="xhtml2html"
impl="org.apache.cocoon.pipeline.converter.xml.XHTML2HTMLConverter"
input-format="/text/xml/xhtml" output-format="/text/sgml/html"/>
...
</map:converters>
This example converts XHTML to HTML:
<map:convert type="xhtml2html">
This example converts any data format to a HTTP response (without delivering
it; this is the task of the consumer "http/response"!):
<map:convert type="http/response">
6.4 Filters
-----------
They modify a data stream while keeping the data format.
Component definition:
<map:filters default="xml/xslt">
...
<map:filter name="xml/xslt"
impl="org.apache.cocoon.pipeline.filter.XSLTFilter" input-format="/text"
output-format="/text"/>
<!-- unix grep (regular expression filter) -->
<map:filter type="text/grep"
impl="org.apache.cocoon.pipeline.filter.text.GrepFilter" input-format="/text"
output-format="/text"/>
<!-- unix wc (word count) -->
<map:filter type="text/wc"
impl="org.apache.cocoon.pipeline.filter.text.WordCount" input-format="/text"
output-format="/text"/>
...
</map:filters>
Usage examples:
<map:filter type="xml/xslt" ref="stylesheets/news2page.xsl">
<map:filter type="xml/xslt" ref="stylesheets/page2xhtml.xsl"
output-format="/text/xml/xhtml">
<map:filter type="text/grep">
<map:parameter name="pattern" value="my grep pattern"/>
</map:filter>
<map:filter type="text/wc">
<map:parameter name="mode" value="linecount"/>
</map:filter>
The second filter might seem to you like a converter, but the output format is
still compatible to "/text/xml" ("/text/xml/xhtml" is derived from
"/text/xml") and thus can be treated as filters.
Theoretically you can do the same work of a filter by using a converter, but
it's often not that what people intend to do. Why should they use a converter
when they want to filter the data? Practically a Filter is a special case of
a converter, where the input- and output-format are equivalent. So it might
be possible, that a filter with the data format "/text/xml" is just an alias
for <map:convert input-format="/text/xml" output-format="/text/xml" .../>
while keeping the sitemap simpler to understand.
6.5 Aggregators
---------------
They aggregate multiple data streams of the same format into one data stream.
There can be multiple implementations of aggregators just like this is the
case for producers.
Component definition:
<map:aggregators default="append">
...
<map:aggregator name="append"
impl="org.apache.cocoon.pipeline.aggregator.AppendAggregator"
input-format="any" output-format="any"/>
<map:aggregator name="sound/mixer"
impl="org.apache.cocoon.pipeline.aggregator.sound.MixerAggregator"
input-format="/abstract/sound" output-format="/abstract/sound"/>
...
</map:aggregators>
Here's an example, how to aggregate different sound tracks into one:
<map:aggregate type="sound/mixer">
<!-- All parts have the same output-format ("/abstract/sound") -->
<map:part ref="song/drums">
<map:parameter name="volume" value="0.8"/>
</map:part>
<map:part ref="song/keyboard">
<map:parameter name="volume" value="0.7"/>
</map:part>
<map:part ref="song/guitar">
<map:parameter name="volume" value="0.8"/>
</map:part>
<map:part ref="song/bass">
<map:parameter name="volume" value="0.7"/>
</map:part>
<map:part ref="song/voice">
<map:parameter name="volume" value="1.0"/>
</map:part>
</map:aggregate>
6.6 Actions
-----------
They are somewhat similar to the actions already existing in Cocoon. They
neither produce data nor consume data and therefore don't directly affect the
data stream. They only affect the way the pipeline components work.
6.7 Redirectors
---------------
They are the same like those already in existing in Cocoon with the exception
of renaming the attribute 'uri' to 'ref' for consistency.
Example:
<map:redirect-to ref="redirected-page"/>
6.8 Matchers
------------
They have practically the same functionality. I'd suggest one extension
though, to provide a kind of polymorphy for URLs. This way it's possible to
write pipelines for different input data formats while using identical URLs.
Component definition:
<map:matchers default="wildcard">
<map:matcher name="wildcard"
impl="org.apache.cocoon.pipeline.matcher.WildcardURIMatcher"/>
...
</map:matchers>
Example with polymorphic URI matching:
<map:pipeline input-format="/text/xml">
<map:match pattern="upload/*">
<map:consume ref="xmldb:xindice://localhost:4080/db/{1}"/>
</map:match>
</map:pipeline>
<map:pipeline input-format="/binary">
<map:match pattern="upload/*">
<map:consume ref="files/binaries/{1}"/>
</map:match>
</map:pipeline>
6.9 Branches
------------
They affect the way of the data stream through the pipeline. Branches are
somewhat similar to selectors, but they are more like control structures like
in Java (if, switch, ... ). Matching works similar to <map:match/>
constructs.
The expression you want to test is represented by the attribute 'test'. The
type of test is specified by the attribute 'type' where 'xpath' may be the
most useful type and therefore the default. You can use other types like
'browser' for browser dependant branching.
The following example tests one value and compares it to different cases to
determine the right choice. Every matching case will be tested and executed
(depending on the attribute continue). If neither case matches, then the
<map:default/> path is taken, if available. The case matcher can be compared
to the <map:match/> component, thus different pattern types are possible
(pattern, regexp, ...).
The <map:branch> element uses several attributes which are explained below:
- type: Type of branch to use
- test: Information about what should be used for branching
- data-type: XML Schema based data type (see [3]) for correct comparison
(esp. for dates)
- continue: Specifies, if matching should be continued after a successful
match
Component definition:
<map:branches default="value">
<!-- This selector uses the value of the attribute 'test' for branching -->
<map:selector name="value"
impl="org.apache.cocoon.pipeline.branch.ValueBranch">
<!-- This selector uses the user agent string for branching -->
<map:selector name="browser"
impl="org.apache.cocoon.pipeline.branch.BrowserBranch">
<!-- This selector uses an XPath expression for branching -->
<map:selector name="xpath"
impl="org.apache.cocoon.pipeline.branch.XPathBranch">
<!-- This selector uses the error status of the last called component for
branching -->
<map:selector name="status"
impl="org.apache.cocoon.pipeline.branch.StatusBranch">
...
</map:branches>
Example:
<map:branch type="xpath" test="/document/metadata/status"
data-type="xsd:string" continue="false">
<map:case match="archive">
<map:consume type="uri"
ref="xmldb:xindice://localhost:4080/db/archive/{1}"/>
</map:case>
<map:case match="live">
<map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/live/{1}"/>
</map:case>
<map:default>
<map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/draft/{1}"/>
</map:default>
</map:branch>
The next example allows more flexible tests by specifying different conditions
in the attribute 'test' for every test case. Theoretically it's possible,
that multiple case statements match. You can control the behavior by the
attribute 'continue' which by default is 'false' and means, that the first
matching case gets executed and the <map:branch>...</map:branch> block is
left. If you set it to true, then it means, that when executing this case it
does not leave the <map:branch/> block but also evaluates the following case
statements. The level of granularity is left up to you: You can set
'continue' directly in the <map:branch> element, thus setting the default
behavior for all <map:case> elements. Additionally you can set it for certain
<map:case> statements which should be treated special.
<map:produce ... output-format="/text/xml"/>
<map:branch>
<map:case type="xpath" test="/document/metadata/online-date < date()"
continue="true" data-type="xsd:date">
<map:consume type="uri" ref="xmldb:xindice://localhost:4080/db/live/{1}"/>
</map:case>
<map:case ...>
...
</map:case>
</map:branch>
6.10 Exceptions
---------------
If some error in the pipeline occurs, you can throw and catch exceptions. This
is necessary, since the introduction of data formats can cause problems when
feeding a pipeline with the wrong data format. But there are many other
cases, where exception handling in the sitemap can be useful. To make it
easier to understand, I'll base them on the Java exceptions.
To throw an exception you can use <map:throw type="some type" message="some
message"/> where type stands for an exception type and message an optional
description for the exception. If you have to pass values to the exception
you want to throw, you can use <map:parameter name="..." value="..."/> inside
the <map:throw>...</map:throw> block. The excaption can then be caught with
<map:catch type="some type">...</map:catch> which can be located in different
scopes as you can see below.
Component definition:
<map:exceptions>
<map:exception name="data-format"
impl="org.apache.cocoon.pipeline.exception.DataFormatException"/>
...
</map:exceptions>
The order in which the scopes of the exception handlers are searched can be
seen from the following examples:
1. Local exception handlers
<map:pipeline>
<map:match pattern="exception-test">
...
<map:throw type="sometype" message="This is a message explaining the
error."/>
...
<map:catch type="sometype">
...
</map:catch>
</map:match>
</map:pipeline>
2. Pipeline exception handlers
<map:pipeline>
<map:match pattern="exception-test">
...
<map:throw type="sometype" message="This is a message explaining the
error."/>
...
</map:match>
...
<map:exception-handlers>
<map:catch type="sometype">
...
</map:catch>
</map:exception-handlers>
</map:pipeline>
3. Global exception handlers
<map:pipeline>
<map:match pattern="exception-test">
...
<map:throw type="sometype" message="This is a message explaining the
error."/>
...
</map:match>
</map:pipeline>
<map:exception-handlers>
<map:catch type="sometype">
...
</map:catch>
</map:exception-handlers>
7 Protocol Independence
=======================
Currently Cocoon is tightly bound to certain protocols by running an instance
of it in a certain environment (servlet, CLI) and it's not (easy) possible to
handle different invocation protocols from the same instance. To abstract the
transport protocols (through the use of certain consumers or producers) we
already have a good working base. What is missing is binding a protocol to a
certain port, but we should not duplicate work here, which is better left to
other software like Apache or Tomcat. We just need to find a way (which I'm
sure, that already exists somewhere) to serve different ports with different
protocols. I think the Servlet specification is general enough to not only
support HTTP/HTTPS and can help us here.
Given the case, that we have solved the port binding issue, we need some
abstraction of the transport protocol. What I mean here is that I'd like to
use pipelines independent from the way the request has been sent to Cocoon
and how it has to be sent back to the client.
To solve this we need something like a protocol handler, which maps requests
from certain protocols to certain pipelines. The mapping itself is a very
abstract thing and heavily depends on the used protocol.
Let's assume, we even solved the protocol handler issue, I'd like to sketch
some possible use cases below, before we continue.
7.1 Web Services
----------------
As many of you know there are existing two popular styles to use Web Services:
SOAP and REST.
Both have their own advantages and disadvantages but I'd like to concentrate
on SOAP and on it's transport protocol independence, because REST-style Web
Services are already possible to do with Cocoon.
SOAP allows us to use any transport protocol to deliver SOAP messages. Mostly
HTTP(S) is used therefore, but there are many cases, where you have to use
other protocols (like SMTP, FTP, ...).
Whatever protocol you chose to invoke your Web Services the result should be
always the same and the response should be delivered back through (mostly)
the same protocol. Here is one of the greatest advantages of the protocol
independance.
What you want to do now is to implement the Web Service as a bunch of
pipelines and let the protocol handler be responsible for invoking the same
pipeline no matter which protocol has been used.
7.2 Mail Server
---------------
Nothing hinders you to implement a mail server, which has the possibility to
integrate various data sources and to expose it's functionality via the
traditional protocols (SMTP, POP, IMAP) but also via HTTP, WAP, as Web
Service, and what ever you want.
7.3 Mailing List Manager
------------------------
Mailing list managers typically provide several functions (subscribe,
unsubscribe, deliver mail, suspend, archive, search, ...) and manage a list
of subscribed users. Once again, you can write such a service once and expose
it's functionality through traditional protocols (HTTP, SMTP, ...) but also
as Web Service.
7.4 What else?
--------------
Perhaps you realize that this way you are free to implement every application
you want by the use of the easy declarative pipeline processing concept. How
to connect your application to the world outside is a seperate issue which
you can decide later and specify independant from the application.
8 Protocol Handler
==================
This component has been mentioned several times now, so it is time to try to
explain it in more detail.
Currently Cocoon pipelines are primary written for HTTP communication. A
request is sent from a client to the server and enters a certain pipeline via
the <map:match/> statements. The end of a pipeline always generates the
response which is sent back to the client. As you can see, even if you can
run Cocoon theoretically in several environments, the servlet environment
with the HTTP(S) protocol is the one which used in most cases. So most
pipelines are dependant on the HTTP protocol.
I'd suggest to introduce an abstraction layer between direct pipeline
invocation and the request from the client through a certain protocol. I'll
try my best to make this as clear as possible ...
8.1 Component Definition
------------------------
Let's begin by defining the protocol handlers in the <map:components/>
section:
<map:protocols default="http">
<map:protocol name="http" impl="org.apache.cocoon.protocol.HTTPProtocol"/>
<map:protocol name="https" impl="org.apache.cocoon.protocol.HTTPSProtocol"/>
<map:protocol name="ftp" impl="org.apache.cocoon.protocol.FTPProtocol"/>
<map:protocol name="smtp" impl="org.apache.cocoon.protocol.SMTPProtocol"/>
<map:protocol name="pop3" impl="org.apache.cocoon.protocol.POP3Protocol"/>
<map:protocol name="imap" impl="org.apache.cocoon.protocol.IMAPProtocol"/>
...
</map:protocols>
8.2 Protocol Binding
--------------------
After we have all possible protocols defined, we have to bind them to certain
ports.
Here I'd suggest the following:
<map:bindings>
<map:bind protocol="http" port="80"/>
<map:bind protocol="http" port="8080"/>
<map:bind protocol="ftp" port="21"/>
<map:bind protocol="https" port="443"/>
<map:bind protocol="smtp" port="25"/>
<map:bind protocol="pop3" port="110"/>
<map:bind protocol="pop3s" port="995"/>
<map:bind protocol="imap" port="143"/>
<map:bind protocol="imaps" port="993"/>
</map:bindings>
Tomcat, for example, already does such kind of binding in the config file
server.xml. Perhaps we don't really need this protocol mapping in Cocoon, but
we should check first, if we can get all the information we need from the
servlet container in a portable way (without depending on Tomcat!).
8.3 The Handler's Task
----------------------
Well, what does a protocol handler actually do?
First it knows how to communicate with a certain protocol. That's obviously
the most important thing but that's not enough for us.
The second task is to determine which pipeline has to be invoked. It does this
on the basis of the information it gets from the request and decides by the
use of certain mapping rules which pipeline has to be invoked.
The third task is to automatically provide a producer or consumer, depending
on the request or response and the pipeline which has to be invoked.
8.4 Mapping to Pipelines
------------------------
Mapping a request from a certain protocol to a certain pipeline can be a
difficult task and depends heavily on the protocol itself. So I can only give
you an example of a possibile solution.
<map:mappings>
<map:protocol name="http">
<!-- maps the URI of all http requests directly to all pipelines -->
<map:map type="request-uri" from="**" to="**"/>
<map:pipeline type="request"> <!-- The components of this pipeline are
executed before the sitemap pipeline components -->
<map:produce type="http/request" output-format="/text/http/request"/>
<map:convert type="http/request2any" inpput-format="/text/http/request"
output-format="any"/>
</map:pipeline>
<map:pipeline type="response"> <!-- The components of this pipeline are
executed before the sitemap pipeline components -->
<map:convert type="http/any2response" input-format="any"
output-format="/text/http/response"/>
<map:consume type="http/response" input-format="/text/http/response"/>
</map:pipeline>
</map:protocol>
<map:protocol name="smtp">
<!-- maps content of the mail header "Cocoon-Pipeline" directly to all
pipelines -->
<map:map type="header" from="Cocoon-Pipeline: **" to="post/**"/>
<map:pipeline type="post"> <!-- The components of this pipeline are
executed after the sitemap pipeline components -->
<map:convert type="smtp/any2post" input-format="any"
output-format="/text/smtp"/>
<map:consume type="smtp" input-format="/text/smtp"/>
</map:pipeline>
</map:protocol>
<map:protocol name="pop3">
<!-- maps content of the mail header "Cocoon-Pipeline" directly to all
pipelines -->
<map:map type="header" from="Cocoon-Pipeline: **" to="**"/>
<map:pipeline type="deliver"> <!-- The components of this pipeline are
executed before the sitemap pipeline components -->
<map:produce type="pop3" output-format="/text/pop[3]"/>
<map:convert type="pop3/pop2any" input-format="/text/pop[3]"
output-format="any"/>
</map:pipeline>
</map:protocol>
<map:protocol name="ftp">
<!-- maps the upload of a file under /home/ftp-user/upload/ to the
pipelines starting with "upload/" -->
<map:map type="put" from="/home/ftp-user/upload/**" to="upload/**"/>
<!-- maps the download of a file under /home/ftp-user/ directly to all
pipelines -->
<map:map type="get" from="/home/ftp-user/**" to="**"/>
<map:pipeline type="put"> <!-- The components of this pipeline are executed
before the sitemap pipeline components -->
<map:produce type="ftp-put" output-format="/text/ftp/put"/>
</map:pipeline>
<map:pipeline type="get"> <!-- The components of this pipeline are executed
before the sitemap pipeline components -->
<map:consume type="ftp-get" input-format="/text/ftp/get"/>
</map:pipeline>
</map:protocol>
</map:mappings>
The only thing I don't like here is to use <map:map/> because I'm sure that
this will cause misunderstandings. I'd suggest to use an other namespace
prefix.
9 Pipelines as Pipeline Components
==================================
Based on the assumptions taken so far we can define rules for pipelines, which
implicitly make them to pipeline components themselves:
9.1 Producer Pipelines
----------------------
Pipelines which produce data and don't consume anything are called producer
pipelines. The following example produces data in the format "/text/xml", but
does not consume any data, so it must have a producer component at the
beginning of the pipeline but no consumer at the end.
Example:
<map:pipeline output-format="/text/xml">
<map:match pattern="producer-pipeline">
<map:produce ... />
...
</map:match>
</map:pipeline>
You can use this pipeline as a producer in other pipelines by writing:
<map:produce ref="cocoon:/producer-pipeline"/>
9.2 Consumer Pipelines
----------------------
Pipelines which consume data and don't produce data are called consumer
pipelines. The following example consumes data in the format "/text/xml", but
does not produce any data, so it must have a consumer component at the end of
the pipeline but no producer at the beginning.
Example:
<map:pipeline input-format="/text/xml">
<map:match pattern="consumer-pipeline">
...
<map:consume ... />
</map:match>
</map:pipeline>
You can use this pipeline as a consumer in other pipelines by writing:
<map:consume ref="cocoon:/consumer-pipeline"/>
9.3 Converter Pipelines
-----------------------
Pipelines which consume a certain data format and produce a certain
(different) data format are called converter pipelines. The following example
converts data from the format "/text/xml/xhtml" to "/text/sgml/html", so it
neither has a producer at the beginning of the pipeline nor a consumer at the
end of the pipeline.
Example:
<map:pipeline input-format="/text/xml/xhtml" output-format="/text/sgml/html">
<map:match pattern="converter-pipeline">
...
</map:match>
</map:pipeline>
You can use this pipeline as a converter in other pipelines by writing:
<map:convert ref="cocoon:/consumer-pipeline"/>
9.4 Filter Pipelines
--------------------
Pipelines which consume a certain data format and produce a the same (or a
compatible) data format are called converter pipelines. The following example
filters data with the format "/text/xml", so it neither has a producer at the
beginning of the pipeline nor a consumer at the end of the pipeline.
Example:
<map:pipeline input-format="/text/xml" output-format="/text/xml">
<map:match pattern="filter-pipeline">
...
</map:match>
</map:pipeline>
You can use this pipeline as a filter in other pipelines by writing:
<map:filter ref="cocoon:/filter-pipeline"/>
9.5 Action Pipelines
--------------------
Pipelines which neither consume nor produce data are called action pipelines.
They can produce data internally through a producer and consume it again with
a consumer, but no data from outside of the pipeline is flowing in or out.
Example:
<map:pipeline>
<map:match pattern="action-pipeline">
<map:produce ... />
...
<map:consume ... />
</map:match>
</map:pipeline>
You can use this pipeline as an action in other pipelines by writing:
<map:act ref="cocoon:/action-pipeline"/>
10 Configuration Files
======================
With so many new sitemap declarations it is hard to keep the sitemap
managable. To solve this problem I'd suggest to split it up in different
files, which all deal with separate concerns.
10.1 cocoon.xconf
-----------------
This configuration file has the same functionality like in current cocoon
versions. It's main purpose is to register and configure avalon components.
10.2 components.xconf
---------------------
In this file all the pipeline components are defined (see section "6 Pipeline
Components").
It uses it's own namespace (e.g. http://apache.org/cocoon/component/1.0).
10.3 protocols.xconf
--------------------
In this file all the protocols are defined (see section "8 Protocol Handler").
It uses it's own namespace (e.g. http://apache.org/cocoon/protocol/1.0).
10.4 bindings.xconf
-------------------
In this file all the protocol port bindings are defined (see section "8
Protocol Handler").
It uses it's own namespace (e.g. http://apache.org/cocoon/binding/1.0).
10.5 protocol-mappings.xconf
----------------------------
In this file the mapping to sitemap pipelines are defined (see section "8
Protocol Handler").
It uses it's own namespace (e.g. http://apache.org/cocoon/mapping/1.0).
10.6 data-formats.xconf
----------------------
In this file all the data formats are defined (see section "5 Data Formats").
It uses it's own namespace (e.g. http://apache.org/cocoon/format/1.0).
10.7 sitemap.xmap
-----------------
This file holds all the pipelines (see section "6 Pipeline Components").
It uses it's own namespace (e.g. http://apache.org/cocoon/sitemap/3.0).
To be more flexible the content of the configuration files can be placed
inside the sitemap. This will make it easier for small sitemaps. For large
sitemaps I'd suggest to use references to those files instead, to keep the
configuration managable. This way you can even share the same files for
different sitemaps just by referencing the same config file.
Here's a rough sketch of the structure from sitemap.xmap:
<map:sitemap xmlns:map="http://apache.org/cocoon/sitemap/3.0">
...
<map:components> <!-- optional: ref="components.xconf" -->
<map:protocols ref="protocols.xconf"/>
<map:bindings ref="bindings.xconf" />
<map:formats ref="formats.xconf" />
<map:mappings ref="mappings.xconf" />
<map:producers ... />
<map:consumers ... />
<map:converters ... />
<map:filters ... />
<map:exceptions ... />
</map:components>
...
</map:sitemap>
All sub elements of <map:components> can place their configuration directly as
sub elements inside the sitemap or can be swapped out to external files which
are referenced by the ref="..." attribute.
I'm still unsure if we should really place everything below <map:components>,
since there are some configurations involved which don't specify new
components (e.g. bindings and mappings). Perhaps we can find a more
meaningful element name or split it up into different sections. Let's see
what some discussion on this topic will bring us ...
10.8 Config File Hierarchy
--------------------------
Here's an overview on the hierarchy of the config file as it looks for now:
cocoon.xconf (references the main sitemap.xmap with the treeprocessor
declaration)
|
+-sitemap.xmap
|
+-components.xconf
|
+-protocols.xconf
|
+-bindings.xconf
|
+-mappings.xconf
|
+-formats.xconf
|
+-producers.xconf
|
+-consumers.xconf
|
+-converters.xconf
|
+-filters.xconf
|
+-exceptions.xconf
11 Converting old sitemaps to new sitemaps
==========================================
Some of you might be interested, if this new concept is flexible enough to
provide at least the same functionality as Cocoon does today. I'll give you
some examples, about how old pipeline components can be translated to the new
pipeline components.
The most important thing to remember is, that all of the old pipeline
components (except the reader) work with the data format "/text/xml" or
derived formats. So theoretically the old implementation of the new
components does not differ very much from their new implementation.
11.1 Generators
---------------
This is simply a producer which takes no input data and produces the
output-format "/text/xml".
Here's an example:
<map:generate type="file" src="doc/{1}.xml"/>
Maps to:
<map:produce type="uri" ref="doc/{1}.xml" output-format="/text/xml"/>
You can also think of an XMLProducer, where the output-format is implicitly
set to "/text/xml", so you don't have to provide it every time you use the
producer. Of course this applys to all other components too.
11.2 Transformers
-----------------
They simply consume XML and produce XML, so they are actually filters.
Here's an example:
<map:transform type="xslt" src="stylesheets/news2xhtml.xsl"/>
Maps to:
<map:filter type="xml/xslt" ref="stylesheets/news2xhtml.xsl"/>
Since filters don't change the data format, you don't need to specify the
input- and output-format, because they are either specified implicitly in the
component definition, or default to the input/output-format of the
surrounding pipeline components.
11.3 Readers
------------
They simply read a file and deliver it, so they are actually producers.
Here's an example:
<map:read src="welcome/cocoon.gif" mime-type="image/gif"/>
Maps to:
<map:produce ref="welcome/cocoon.gif" output-format="/binary/gif"/>
NOTE 1:
The MIME type is implicitly contained in every data format. So the
output-format "/binary/gif" results in the MIME type "image/gif".
NOTE 2:
There's one difference between the reader and the producer concerning the
delivering of resources. The reader actually delivered them after reading,
which is not the case with the producer. This is actually done automatically
by the protocol handler which appends certain (configurable) pipeline
components to consumer pipelines (see section "8 Protocol Handler").
11.4 Serializers
----------------
They definitely convert XML to an other format and therefore behave like
converters.
Here's an example:
<map:serialize type="svg2png" mime-type="image/png"/>
Maps to:
<map:convert type="svg2png" input-format="/text/xml/svg"
output-format="/binary/png"/>
The other tasks of a serializer, like preparing the response of the pipeline
(HTTP headers, mime-type, ...), is done by the respective protocol handlers,
which for example append the following components to the end of the consumer
pipeline (see section "8 Protocol Handler"):
<map:convert type="http/any2response" input-format="any"
output-format="/text/http/response"/>
<map:consume type="http/response" input-format="/text/http/response"/>
11.5 Selectors
--------------
The functionality of <map:select>...</map:select> is fully supported by the
more flexible <map:branch>...</map:branch> concept and can be easily
converted.
Here's an example:
<map:select type="browser">
<map:when test="wap">
...
</map:when>
<map:when test="netscape">
...
</map:when>
<map:otherwise>
...
</map:otherwise>
</map:select>
Maps to:
<map:branch type="browser">
<map:case match="wap">
...
</map:case>
<map:case match="netscape">
...
</map:case>
<map:default>
...
</map:default>
</map:branch>
12 Use Cases
============
This section gives you some examples which show you the possibilities of this
proposed architecture.
NOTE:
For better understanding I've included the input/output-format attributes to
some of the pipeline components which makes them easier to understand. Keep
in mind, that you don't need to specify them every time. Usually you'll only
define them once per component in the components section or they are
implicitely set by surrounding components or the pipeline itself.
12.1 File Upload
----------------
This example uploads a HTML news file, extracts xml content and stores it in
an XML database.
<map:pipeline input-format="/text/sgml/html">
<map:match pattern="upload/news/*.html">
<map:convert type="html2xhtml" output-format="/text/xml/xhtml"/>
<map:filter type="xml/xslt" ref="xhtml2news.xsl"
output-format="/text/xml/newsml"/>
<map:consume type="uri"
ref="xmldb:xindice://localhost:4080/db/news/{1}.xml"/>
</map:match>
</map:pipeline>
12.2 Combining several pipelines
--------------------------------
In this example we are combining 3 pipelines:
1. This one generates data in a certain format:
<map:pipeline output-format="/text/sgml/html">
<map:match pattern="news/*.html">
<map:produce type="uri" ref="documents/news/{1}.xml"
output-format="/text/xml/newsml"/>
<map:filter type="xml/xslt" ref="stylesheets/news2xhtml.xsl"
output-format="/text/xml/xhtml"/>
<map:convert type="xhtml2html" output-format="/text/sgml/html"/>
</map:match>
</map:pipeline>
2. This one consumes data in a certain format:
<map:pipeline input-format="/text/xml/xhtml">
<map:match pattern="upload/news/*.html">
<map:convert type="html2xhtml" output-format="/text/xml/xhtml"/>
<map:filter type="xml/xslt" ref="xhtml2news.xsl"
output-format="/text/xml/newsml"/>
<map:consume type="uri"
ref="xmldb:xindice://localhost:4080/db/news/{1}.xml"/>
</map:match>
</map:pipeline>
3. This one references both pipelines and combines them into a new one:
<map:pipeline>
<map:match pattern="replicate/news/*.html">
<map:produce type="uri" ref="cocoon:/news/{1}.html"/>
<map:consume type="uri" ref="cocoon:/upload/news/{1}.html"/>
</map:match>
</map:pipeline>
12.3 Unix Pipes
---------------
This is a universal filter pipeline, which counts the number of lines of text
data flowing through the pipeline. The optional argument can be used to grep
each line.
<map:pipeline input-format="/text" output-format="/text">
<map:match pattern="filter/count/lines/**">
<map:filter type="text/grep"> <!-- unix grep (regular expression filter)
-->
<map:parameter name="pattern" value="{1}"/>
</map:filter>
<map:filter type="text/wc"> <!-- unix wc (word count) -->
<map:parameter name="mode" value="linecount"/>
</map:filter>
</map:match>
</map:pipeline>
This pipeline uses the filter from above to analyze Apache's access_log for
certain requests:
<map:pipeline output-format="/text">
<map:match pattern="statistics/forms/*">
<map:produce ref="file:///var/log/httpd/access_log"/> <!-- like unix cat
(list file contents) -->
<map:filter ref="cocoon:/filter/count/lines/forms/login.html"/> <!-- unix
grep (regular expression filter) -->
<!-- Result is the number of requests to the file /forms/login.html in the
Apache access log -->
</map:match>
</map:pipeline>
12.4 Image Processing
---------------------
This pipeline takes several image formats and converts them to the abstract
image format, which can be used by format-independent image filters:
<!-- Since we don't know the concrete image format for the input we have to
use 'any' -->
<map:pipeline input-format="any" output-format="/abstract/image">
<map:match pattern="convert/to-image/*.*">
<map:branch test="{2}">
<map:case match="jpg|jpeg|JPG|JPEG">
<map:convert type="jpg2image" input-format="/binary/jpeg"/>
</map:case>
<map:case match="gif|GIF">
<map:convert type="gif2image" input-format="/binary/gif"/>
</map:case>
<map:default>
<map:throw type="input-format" message="{2} is not a supported input
image type."/>
</map:default>
</map:branch>
</map:match>
</map:pipeline>
This pipeline takes the abstract image format and converts it to certain
specific image formats:
<!-- Since we don't know the concrete image format for the output we have to
use 'any' -->
<map:pipeline input-format="/abstract/image" output-format="any">
<map:match pattern="convert/from-image/*.*">
<map:branch test="{2}">
<map:case match="jpg|jpeg|JPG|JPEG">
<map:convert type="image2jpg" output-format="/binary/jpeg"/>
</map:case>
<map:case match="gif|GIF">
<map:convert type="image2gif" output-format="/binary/gif"/>
</map:case>
<map:default>
<map:throw type="output-format" message="{2} is not a supported output
image type."/>
</map:default>
</map:branch>
</map:match>
</map:pipeline>
This is an example for an abstract image filter pipeline, which is independent
from the specific image data format. It prepares an image for character
recognition:
<map:pipeline input-format="/abstract/image" output-format="/abstract/image">
<map:match pattern="filter/image/prepare-ocr">
<map:filter type="image/histogram">
<map:parameter name="equalize" value="full"/>
</map:filter>
<map:filter type="image/2greyscale" />
<map:filter type="image/2bw">
<map:parameter name="method" value="threshold"/>
<map:parameter name="level" value="0.5"/>
</map:filter>
</map:match>
</map:pipeline>
This pipeline invokes the pipelines from above and shows how these pipelines
can be reused as pipeline components themselfes:
<!-- Since we don't know the image format we have to use 'any' as input and
output format -->
<map:pipeline input-format="any" output-format="any">
<map:match pattern="filter/any-image/prepare-ocr/*">
<map:convert ref="cocoon:/convert/to-image/{1}"/>
<map:filter ref="cocoon:/filter/image/prepare-ocr"/>
<map:convert ref="cocoon:/convert/from-image/{1}"/>
<!--
Since the output format of the converter above is a certain image data
format,
it overrides the default for this pipeline (any).
-->
</map:match>
</map:pipeline>
12.5 PDF decompiling
--------------------
This pipeline decompiles a PDF document into an intermediate XML format (see
[4]), transforms it to a custom XML format (extract data) and stores it to an
XML database. Depending on the success state different, the client gets
redirected to different response pages.
<map:pipeline input-format="/binary/pdf">
<map:match pattern="import/*.pdf">
<map:convert type="pdf2xml" output-format="/text/xml/pdf-xml"/>
<!-- Here we have an intermediate XML stream -->
<map:filter type="xml/xslt" ref="stylesheets/pdfxml2docxml.xsl"/>
<!-- Here we have an XML stream with the extracted information -->
<map:consume type="uri"
dest="xmldb:xindice://localhost:4080/db/news/{1}.xml"/>
<map:branch type="consume/status">
<map:when test="success">
<map:redirect-to uri="success-page"/>
</map:when>
<map:default>
<map:redirect-to uri="error-page"/>
</map:default>
</map:branch>
</map:match>
</map:pipeline>
12.6 Music Processing
---------------------
This pipeline generates a printable music score from a MIDI file (without
XML):
<map:pipeline input-format="/binary/midi" output-format="/binary/pdf">
<map:match pattern="convert/midi2pdf/*">
<map:convert type="midi2musitex" output-format="/text/tex/musixtex"/>
<map:convert type="tex2dvi" input-format="/text/tex"
output-format="/binary/dvi"/>
<map:convert type="dvi2pdf" output-format="/binary/pdf"/>
</map:match>
</map:pipeline>
The next pipeline uses MidiXML, an XML format which part of MusicXML and is
available for representing music data (see [5] and [6]). It converts the
binary MIDI format to MidiXML, selects the keyboard channel, transposes it 5
pitches up and converts it back to the midi format.
<map:pipeline input-format="/binary/midi" output-format="/binary/midi">
<map:match pattern="filter/custom/*">
<map:convert type="midi2xml" output-format="/text/xml/midixml"/>
<map:filter type="midixml/select-channel">
<map:parameter name="name" value="keyboard"/>
</map:filter>
<map:filter type="midixml/transpose">
<map:parameter name="value" value="+5"/>
</map:filter>
<map:convert type="xml2midi" output-format="/binary/midi"/>
</map:match>
</map:pipeline>
13 Conclusion
=============
You might ask, why should we change so much from Cocoon?
First I think the new components are much more flexible and at least as easy
to understand as the old ones: If you want to produce a data stream you use a
producer, if you want to consume it you use a consumer, if you want to
convert it you use a converter and if you want to filter it you use a filter.
To control the data flow you can use the <map:branch/> component.
A possible migration path could be to support both sitemap versions, since the
pipeline components either have different names or provide the same
functionality. So a new sitemap implementation could be backward compatible
to older sitemap versions. This could make the transition for the user as
easy as possible.
Additionally it might be possible to provida a migration script (e.g. via XSL)
which reads an old sitemap and converts it to the new format. Since
everything from the old sitemap can be expressed in the new sitemap and can
be formally translated (see section "11 Converting old sitemaps to new
sitemaps") this should not be a big issue.
14 TODO
=======
1. Which concrete role do the data handlers play?
Do we need an input and output data handler or just one?
Do we need data handlers at all?
2. Define and manage a list of data formats (central internet repository?)
Perhaps it's possible to coordinate the work for MIME types and data
formats.
3. The number of components possibly explodes very fast.
Therefore we should take care to design good package structures and
namespaces to overcome this problem.
4. The protocol handlers have to be worked out more precisely.
5. The parameters of data format actually reflect its meta data.
Support for RDF/OWL (see [7] and [8]) would definitely make sense to get
one step further to the semantic web.
15 References
=============
[1] [RT] Input Pipelines (long) (thread on cocoon-dev initiated by Daniel
Fangerstrom on Dec 17th 2002)
http://www.mail-archive.com/cocoon-dev@xml.apache.org/msg25503.html
[2] MIME Media Types
http://www.iana.org/assignments/media-types/
[3] XML Schema Datatypes
http://www.w3.org/TR/xmlschema-2/
[4] JPedal
Open Source library written in Java which can extract data from PDF
documents.
http://www.jpedal.org/
[5] MusicXML
http://www.recordare.com/xml.html
[6] XEMO
http://www.xemo.org/
Project XEMO is an open source, modular software environment for the
development and delivery of interactive music, audio and sound applications.
It is written in Java and supports MusicXML.
[7] RDF - Resource Description Framework
http://www.w3.org/RDF/
[8] OWL - Web Ontology Language based on RDF
http://www.w3.org/TR/2002/WD-owl-ref-20021112/
16 Appendix
===========
This new architecture opens up a whole new way of flexibility and integration
of data processing which has been already possible for XML processing. Here
I'd give you an idea of some further data formats and components and I'm sure
you can think of even more. Remember: Only your mind is the limit ;-)
16.1 Data Formats
-----------------
In this section you can find a proposed list of data formats, which gives an
overview about how they could be structured.
No data format:
- none (used, if nothing is produced/consumed)
Super data format:
- any (base data format for abstract, binary and text)
Abstract data formats (used by components which are independent from concrete
file format):
- /abstract/image
- /abstract/music
- /abstract/sound
- /abstract/vector
- /abstract/vector/3d
- /abstract/video
Binary data formats:
- /binary
- /binary/au
- /binary/avi
- /binary/avi/indeo
- /binary/avi/indeo[4.1]
- /binary/avi/indeo[5.0]
- /binary/avi/divx
- /binary/bmp
- /binary/bmp/os2
- /binary/bmp/windows
- /binary/elf
- /binary/elf/executable
- /binary/elf/shared
- /binary/gif
- /binary/gif[87a]
- /binary/gif[89a]
- /binary/mp3
- /binary/mpeg
- /binary/ogg
- /binary/ole
- /binary/ole/msexcel
- /binary/ole/mspowerpoint
- /binary/ole/msword
- /binary/tiff
- /binary/tiff/jpeg
- /binary/tiff/lzw
- /binary/tiff/packbits
- /binary/tiff/zip
- /binary/wav
- /binary/...
Text data formats:
- /text
- /text/http
- /text/http/request
- /text/http/request[0.9]
- /text/http/request[1.0]
- /text/http/request[1.1]
- /text/http/response
- /text/http/response[0.9]
- /text/http/response[1.0]
- /text/http/response[1.1]
- /text/sgml
- /text/sgml/docbook
- /text/sgml/docbook/simple
- /text/sgml/html
- /text/sgml/html[3.0]
- /text/sgml/html[4.0]
- /text/sgml/html[4.1]
- /text/sgml/html/frameset
- /text/sgml/html/strict
- /text/sgml/html/transitional
- /text/tex
- /text/tex/latex
- /text/tex/musixtex
- /text/xml
- /text/xml/docbook
- /text/xml/docbook/simple
- /text/xml/rdf
- /text/xml/rdf/rss
- /text/xml/svg
- /text/xml/xhtml
- /text/xml/xhtml[1.0]
- /text/xml/xhtml[1.1]
- /text/...
16.2 Pipeline Components
------------------------
Image Processing (bringing Photoshop to Cocoon ;-):
- BlurFilter
- AquarellFilter
- NoiseFilter
- SharpenFilter
- ExtrudeFilter
- ReliefFilter
- HistogramFilter
- ...
Sound Processing (bringing Arts/SOX/Cubase to Cocoon ;-):
- EqualizerFilter
- DistortionFilter
- ChorusFilter
- DelayFilter
- FlangerFilter
- VolumeFilter
- PitchShifterFilter
- MixerAggregator
- SequenceAggregator
- MP32SoundConverter
- Sound2MP3Converter
- Ogg2SoundConverter
- Sound2OggConverter
- ...
Video Processing (bringing Premiere to Cocoon ;-):
- BlendingAggregator
- MixerAggregator
- EffectsFilter
- AVI2VideoConverter
- Video2AVIConverter
- MPG2VideoConverter
- Video2MPGConverter
- ...
For video processing it would be nice to be able to process the audio part of
the video with sound processing components and the image part of the video
with the image processing components (maximum component reuse!). This demands
that the abstract video data format is composed of the abstract sound format
and a sequence of abstract image formats which is done by extending both
/abstract/image and /abstract/sound formats in the declaration of
/abstract/video (see section "5 Data Formats").
Vector Graphics Processing (bringing Corel Draw/Illustrator to Cocoon ;-):
- BooleanFilter (Union, intersection, ...)
- TranslationFilter (Move, rotate, resize, ...)
- VectorAggregator (Aggregate different vector graphics)
- SVG2VectorConverter
- Vector2SVGConverter
- WMF2VectorConverter
- Vector2WMFConverter
- CDR2VectorConverter
- Vector2CDRConverter
- ...
Music Processing (bringing Arts/Cubase/Capella/Sibelius/Finale to Cocoon ;-):
- PitchShifterFilter
- Midi2MusicConverter
- Music2MidiConverter
- Music2ImageConverter (render music score for printing)
- Image2MusicConverter (you know Capella Scan?)
- Music2SoundConverter (render music to synthesized sound)
- Sound2MusicConverter (extract music data from sound data)
- ...
3D Graphics Processing (bringing 3D Studio/POV-Ray to Cocoon ;-):
- TranslationFilter (Move, rotate, resize, ...)
- 3DAggregator (Aggregate different 3D graphics)
- ParticleFilter
- ExplosionFilter
- 3DS23DConverter
- 3D23DSConverter
- DXF23DConverter
- 3D2DXFConverter
- POV23DConverter
- 3D2POVConverter
- 3D2ImageConverter (render an image)
- 3D2VideoConverter (render an animated scene)
- ...
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [PROPOSAL] Cocoon Science Fiction
Posted by Nicola Ken Barozzi <ni...@apache.org>.
Andreas Hochsteger wrote, On 09/02/2003 21.47:
> Hi Cocooners!
>
> Sorry for this (very) long proposal below, but I think it's definitely worth a
> read. If not, at least you can give me some feedback about your opinion ;-)
First of all, let me give you my compliments for a well thought-out and
written RT.
Second, I will try to reply in a more short form ;-) , so please do add
things that I left out.
- blocks
blocks will/should be reusable pipelines, in a way similar to what
you envision.
- data formats
This was discussed already when Cocoon was being designed, and
the result is that the sitemap does not check for consisency
of what it is given. This "feature" has never really been a
problem IMHO, so I'd be reluctant to introduce this concept
strongly in the sitemap. A simple validation-transformer in
the right places should suffice.
- binary data
There you are :-)
Using Cocoon for binary data transformation is easier than many
may think. Change our pipeline implementation to pipe the
result of a reader one to another, and and it's done. (sort of ;-)
I have been investigating such a system with Morphos.
- description:
http://cvs.apache.org/viewcvs.cgi/*checkout*/jakarta-commons-sandbox/morphos/src/java/org/apache/commons/morphos/package.html?rev=HEAD&content-type=text/html
- code:
http://cvs.apache.org/viewcvs/jakarta-commons-sandbox/morphos/
I would like to move the effort here, if others agree.
It would go in the scratchpad,
<hint> or in a brand-new "sandbox" </hint>
- branches
This has been proposed too and strongly IIRC rejected by
some as FS (flexibility syndrome). This is true for publishing,
a lot less in the flexible transformation engine you envision.
I'd keep this last in the list ;-)
- protocol indipendence
We have already, as you say, an environment. What you propose is
to use the same Cocooc instance as a back-end to multiple
simultaneous protocol frontends (mail, http, etc).
Cocoon is a transformation system, so it should not really itself
bother about how to get the data, ie it's not a server.
Would it be a compelling use-case to use a single Cocoon instance
with multiple protocols? Not sure, I don't have the need now...
--
Nicola Ken Barozzi nicolaken@apache.org
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [PROPOSAL] Cocoon Science Fiction
Posted by Neeme Praks <ne...@apache.org>.
I agree with Stefano on his point that if this kind of machinery is to
be done, then it should not be done inside cocoon.
Instead, it would be much more sensible to do this on a more general
level: write an Avalon based server for dealing with any data type where
it would be possible to plug in Cocoon "block" as well.
Then you could have a block interface like this:
serialized data -> block -> serialized data
And Cocoon could be just one of the many implementations for this.
Rgds,
Neeme
Stefano Mazzocchi wrote:
[..snip...]
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [PROPOSAL] Cocoon Science Fiction
Posted by Stefano Mazzocchi <st...@apache.org>.
Andreas Hochsteger wrote:
> Hi Cocooners!
>
> Sorry for this (very) long proposal below, but I think it's definitely worth a
> read. If not, at least you can give me some feedback about your opinion ;-)
Andreas,
thanks for taking the time for writing this. It is very appreciated. See
my personal comments inside. NOTE: they are 'personal' commment and must
be treated as such, they never represent the cocoon development
community but my personal vision of things.
[snip]
> WARNING:
> I have to say that this proposal is intended for open-minded people only,
> which aren't afraid to take a look beyond the limits.
I think I can state I'm not afraid to look beyind limits, expecially my
own, expecially those I can't see until others point me to. At the same
time, I like not to turn of my 'critical mode' while I do so. Please,
don't misinterpret this as fear of going forward, but as caution as
doing so.
[snip]
> 3 Introduction
> ==============
>
> I like the Cocoon pipeline processing concept very much.
> I like it so much, that I think it is a pitty, to limit it only to XML
> processing (although I agree, that this is the most interresting
> application).
These two sentences are antithetical and/or imprecise.
The Cocoon pipepeline model is different from the more general
Pipe&Filters design pattern because it deals with structured data,
unlike the P&F which deals with non-structured data.
The Cocoon pipeline is *not* litterarely limited to XML. It is entirely
possible to have not-well-formed XML content flow into the pipeline
(even if this is avoided as a general pattern).
It is correct to say that cocoon pipelines are limited to SAX events and
SAX events are a particular kind of structured data.
With this corrections, you are basically stating that limiting pipelines
to a particular type of structured data is limiting.
While I understand your concept, I strongly disagree: SAX provides a
multidimensional structured data space which is suitable for *any* kind
of data structure.
True, maybe not as efficiently as other formats, but removing a fix
contract between pipeline components will require a pluggable and
metadata-driven parsing/serializatin stage between each component.
I don't see any value of this compared to the current approach of SAX
adaptation of external data to the internal model.
> I'm sure some of you wanted to be able to build applications the same way like
> Unix shell pipes work. Cocoon was a big step in this direction, but it was
> only applicable for processing XML data.
*only XML* is misleading. *based on SAX* is the sentence. I've never
perceived this as a limitation, but as a paradigm shift.
Topologically speaking, the solutions space is rotated, but it's size is
not reduced.
> There are so many cases where
> pipeline processing of data (no matter if it is XML, plain text or binary
> data) is done today but we are lacking a generic and declarative way to unify
> these processing steps. Cocoon is best suited for this task through it's
> clean and easy to understand yet powerful pipeline concept.
If you want to create pipelines for genereral data, why use Cocoon? just
use the UNIX pipe or use servlet filters or apache 2.0 modules or any
type of 'byte-oriented' (thus un-structured data) pipe&filters modules.
If you remove the structure from the pipeline data that flows, Cocoon
will no be Cocoon anymore. This is not evolution, is extintion.
> 4 Pipeline Types
> ================
>
> I tried to design several pipelines variants but after thinking a while they
> all were still too limited for the way I wanted them to work.
>
> So here's another try by giving some hypotheses first:
> 1. A pipeline can produce data
> 2. A pipeline can consume data
> 3. A pipeline can convert data
> 4. A pipeline can filter data
> 5. A pipeline can accept a certain data format as input
> 6. A pipeline can produce a certain data format as output
> 7. Pipeline components follow the same hypotheses (1-6)
> 8. Only pipeline components with compatible data formats can be arranged next
> to each other
Ah, here you hint that you don't want to remove data structured-ness in
the pipeline, just want to add *other* data structures besides SAX events.
Ok, this is worth investigating.
> Based on these hypotheses you can construct pipelines, which just consume
> data, just produce data, both consume and produce data or even neither
> consume nor produce data (even this can make sense, as you'll see in section
> "9.5 Action Pipelines").
> I think these hypotheses are simple enough to understand and flexible enough
> to base this further proposal on. So let's try ...
>
> To define a pipeline we need to be able to specify the input and output
> format.
> We can do this by the help of these two attributes:
> - input-format="..."
> - output-format="..."
>
> They additionally specify the default input format for the first processing
> component and the default output format for the last processing component.
>
> Example:
> <map:pipeline input-format="format1" output-format="format2">
> ...
> </map:pipeline>
>
> This pipeline consumes the data format "format1" and produces the data format
> "format2". Which data formats are possible and how they are specified is
> shown in the next section.
>
>
> 5 Data Formats
> ==============
>
> With "data format" I mean something like XML, plain text, png, mp3, ...
> I'm not yet really sure here, how we should specify data formats, so I'll try
> to start with some requirements:
> 1. They should be easy to remember and to specify ;-)
> 2. It should be possible to create derived data formats (-> inheritance)
> 3. It should be possible to specify additional information (e.g. MIME type,
> DTD/Schema for XML, ...)
> 4. Pipelines which accept a certain data format as input can be fed with
> derived data formats
> 5. We should not reinvent standards, which are already suited for this task
> (but I fear, there does not yet exist something suitable)
You are asking for a very abstract parsing grammar. Note, however, that
is pretty easy to point to examples where these grammars will have to be
so complex that maintaining them would be a nightmare.
Think of a BNF-like grammar that is able to explain concepts like XML
namespacing or HyTime Architectural Forms.
> To make it easier for us to begin with the task of defining data formats,
> let's assume, we have three basic data formats called "abstract", "binary"
> and "text". The format "abstract" will be explained later, but "binary" and
> "text" should be clear to everyone.
Binary and text are unstructured data streams. You are falling back.
> 5.1 Data Format Definition
> --------------------------
>
> Here's a try to specify a hierarchy of data formats:
>
> <data:formats>
> <!-- #### Super data format #### -->
> <!--
> The following format is the base for all other formats (-> compare to
> java.lang.Object)
> Although it is called 'any' data format this name is not prepended to the
> derived data formats like this is the case for all
> -->
> <data:format name="any"
> impl="org.apache.cocoon.data.handler.text.DefaultHandler">
> <data:param-def name="mime-type" default="application/octet-stream"/>
> <data:param-def name="spec" default=""/> <!-- URL to the specification of
> this data format -->
> </data:format>
>
> <!-- #### Abstract data formats #### -->
> <data:format name="abstract"
> impl="org.apache.cocoon.data.handler.abstract.DefaultHandler"/>
> <data:format name="image" extends="/abstract"
> impl="org.apache.cocoon.data.handler.abstract.ImageHandler">
> <data:param-def name="depth" default=""/>
> <data:param-def name="width" default=""/>
> <data:param-def name="height" default=""/>
> </data:format>
> <data:format name="music" extends="/abstract"
> impl="org.apache.cocoon.data.handler.abstract.MusicHandler">
> <data:param-def name="channels" default=""/>
> </data:format>
> <data:format name="sound" extends="/abstract"
> impl="org.apache.cocoon.data.handler.abstract.SoundHandler">
> <data:param-def name="samplesize" default=""/>
> <data:param-def name="samplerate" default=""/>
> <data:param-def name="channels" default=""/>
> </data:format>
> <!--
> Multiple inheritance is used for video, wich extends image and sound.
> Is there a better way to specify multiple base formats? -->
> <data:format name="video" extends="/abstract/image /abstract/sound"
> impl="org.apache.cocoon.data.handler.abstract.VideoHandler">
> <data:param-def name="framerate" default=""/>
> </data:format>
> <data:format name="vector" extends="/abstract"
> impl="org.apache.cocoon.data.handler.abstract.VectorHandler">
> <data:param-def name="unit" default=""/>
> <data:param-def name="width" default=""/>
> <data:param-def name="height" default=""/>
> </data:format>
> <data:format name="3d" extends="/abstract/vector"
> impl="org.apache.cocoon.data.handler.abstract.3DHandler">
> <data:param-def name="depth" default=""/>
> </data:format>
>
> <!-- #### Binary based data formats #### -->
> <data:format name="binary"
> impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
> <data:param-def name="endian" default="little"/>
> </data:format>
>
> <!-- MS OLE based data formats -->
> <data:format name="ole" extends="/binary"
> impl="org.apache.cocoon.data.handler.binary.ole.DefaultHandler"/>
> <data:format name="msword" extends="/binary/ole"
> impl="org.apache.cocoon.data.handler.binary.ole.MSWordHandler"/>
> <data:format name="msexcel" extends="/binary/ole"
> impl="org.apache.cocoon.data.handler.binary.ole.MSExcelHandler"/>
>
> <!-- Linux ELF based data formats -->
> <data:format name="binary"
> impl="org.apache.cocoon.data.handler.binary.DefaultHandler">
> <data:param-def name="endian" default="little"/>
> </data:format>
> <data:format name="elf" extends="/binary"
> impl="org.apache.cocoon.data.handler.binary.elf.DefaultHandler">
> <data:param-def name="architecture" default="x86"/>
> </data:format>
> <data:format name="executable" extends="/binary/elf"
> impl="org.apache.cocoon.data.handler.binary.elf.ExecutableHandler"/>
> <data:format name="shared" extends="binary/elf"
> impl="org.apache.cocoon.data.handler.binary.elf.SharedLibraryHandler"/>
>
> <!-- #### Text based data formats #### -->
> <data:format name="text"
> impl="org.apache.cocoon.data.handler.text.DefaultHandler">
> <data:param-def name="encoding" default="UTF-8"/>
> <data:parameter name="mime-type" value="text/plain"/>
> </data:format>
> <data:format name="xml" extends="/text"
> impl="org.apache.cocoon.data.handler.xml.DefaultHandler">
> <!-- this handler deals with SAX events inside the pipeline -->
> <data:param-def name="schema-type" default="xsd"/> <!-- other possible
> values: dtd, rng, schematron, ... -->
> <data:param-def name="schema" default=""/>
> <data:parameter name="mime-type" value="text/xml"/>
> </data:format>
> <data:format name="xhtml" extends="/text/xml"
> impl="org.apache.cocoon.data.handler.xml.XHTMLHandler">
> <data:parameter name="mime-type" value="text/html"/>
> <data:parameter name="schema"
> value="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
> </data:format>
> </data:formats>
>
> It's just a first sketch, but I think you got the idea.
>
> Above you can see the super data format 'any', some abstract, text and binary
> data formats, which show you how to specify inherited data formats. If no
> extends="..." attribute is given, it is automatically derived from the data
> format 'any'.
>
> References to data formats are done by using a path which specifies the
> respective data format. This path is built by appending the specified data
> format name to the path of the parent data format, separated by a slash. The
> super data format is an exception to this rule and is just called 'any'. It
> is not part of the path for derived data formats to make them shorter. It is
> possible to use relative data format paths too. E.g. a pipeline consumes
> /text/xml, a converter generates XHTML from it an thus can use
> output-format="xhtml" instead of output-format="/text/xml/xhtml". The name
> 'any' is reserved only for the super data format and it is not allowed to
> name derived data formats after it.
>
> 'none' is an other reserved name which is used, if a pipeline does not consume
> data (input-format="none") or produce data (output-format="none"). It is the
> default for all pipelines, if it is not overwritten by pipelines or their
> components.
>
>
> The examples from above can be used by using the following strings for
> specifying data formats:
>
> - any
> - /abstract/image
> - /abstract/music
> - /abstract/sound
> - /abstract/video
> - /abstract/vector
> - /abstract/vector/3d
> - /binary
> - /binary/ole
> - /binary/ole/msword
> - /binary/ole/msexcel
> - /binary/elf
> - /binary/elf/executable
> - /binary/elf/shared
> - /text
> - /text/xml
> - /text/xml/xhtml
>
> See section "16.1 Data Formats" for more examples.
>
> One enhancement of this scheme might be useful: Specification of version
> numbers or format variants.
> One way might be to append the version number to the end separated by a slash,
> but I think this will mix different concerns. My suggestion would be to
> specify them by appending the version information in brackets as the
> following shows:
>
> - /text/xml/xhtml[1.0]
> - /text/xml/xhtml[1.1]
>
> Instead of:
>
> - /text/xml/xhtml/1.0
> - /text/xml/xhtml/1.1
>
>
> 5.2 Inheritance
> ---------------
>
> A pipeline which consumes a certain data format can be fed with derived data
> formats too.
> Take the following pipeline as example:
>
> <map:pipeline input-format="/text/xml">
> ...
> </map:pipeline>
>
> This pipeline would consume the data format "/text/xml/xhtml" without
> problems, but leads to an exception if you feed it with the data format
> "/text".
>
>
> 5.3 A word about MIME Types
> ---------------------------
>
> If you ask me, why don't I use the standardized MIME types (see [2]) to
> specify data formats, I can give you the following reasons:
> MIME types fulfill the requirements from above just partly. They just support
> two levels of classification and they are purpose-oriented. The data formats
> I suggest are therefore content-oriented (/text/xml/svg vs. image/svg-xml).
> So both serve different purposes.
>
> I know the importance of supporting the MIME type standard, and so the
> parameter 'mime-type' is part of the super data format 'any' and thus is
> available for every other data format too. By specifying a certain data
> format, you always have a MIME type associated, in the worst case the MIME
> type from the super data format 'any' (application/octet-stream) is used.
From what I see so far, you are describing nothing different (from an
architectural point of view) from what we already have.
> 5.4 Data Handlers
> -----------------
>
> I'm not very sure, what the data handlers actually do, but I can think of
> either defining an interface, which must be implemented by the pipeline
> components which operate with a certain data format (do we need two handlers
> here: input-handler and output-handler?) or they are concrete components
> which can be used by the pipeline components to consume or produce this data
> format. I think some discussion on this topic might not be bad.
Here you hit the nerve.
If you plan on having a different interface of data-handling for each
data-type (or data-type family), the permutation of components will kill
you.
> 5.5 Data Format Determination
> -----------------------------
>
> In many cases, I've written the input- and output-format along with the
> pipeline components, but it is also possible to specify them in the
> <map:components/> section or implicitely by implementing a certain component
> interface and therefore omitting it in the pipeline.
>
> Here's a suggested order of data format determination:
>
> 1. Input-/output-Format specified directly with a pipeline component
> <map:produce type="uri" ref="docs/file.xml" output-format="/text/xml"/>
> 2. Input-/output-Format specified by the component declaration
> <map:filters>
> <map:filter name="prettyxml" input-format="/text/xml"
> output-format="/text/xml" ... />
> </map:filters>
> 3. Output-/input-Format specified by the previous or following pipeline
> component
> <map:produce type="uri" ref="docs/file.xhtml"
> output-format="/text/xml/xhtml"/>
> <!-- input- and output-format="/text/xml/xhtml" from previous pipeline
> component -->
> <map:filter type="prettyxml"/>
> 4. Input-/output-Format specified directly with a pipeline
> <map:pipeline input-format="/text/xml" output-format="/text/xml">
> <map:filter type="prettyxml"/>
> ...
> </map:pipeline>
> 5. If nothing from above matches then assume "none".
eheh, I wish it was that easy ;-)
Suppose you have a component that operates on the svg: namespace of a
SAX stream only, what is the input type?
if data types are monodimensional, the above is feasible, but Cocoon
pipelines are *already* multi-dimensional and the above can't possibly
work (this has been discussed extensively before for pipeline validation)
>
> 6 Pipeline Components
> =====================
[snip]
Assuming you have several structured pipelines:
- SAX -> all xml/sgml content
- output/input streams -> unstructured text/binary
- OLE -> all OLE-based files (word, excel, blah blah)
- MPEG -> all MPEG-based framed multimedia (MPEG1/2, mp3)
why would you want to mix them into the same system?
I mean, if you want to apply structured-pipeline architectures to, say,
audio editing, you are welcome to do so, but why in hell should Cocoon
have to deal with this?
You are very close to win the prize for the FS-award of the year :)
It *would* make sense to add these complexities only if processing
performed in different realms could be interoperated. But I can't see how.
what does it mean to perform xstl-transformation on a video stream?
what does it mean to perform audio mixing on an email?
It would not make any sense to add functionalities inside cocoon that do
not belong in the real of its problem space. It would only dilute the
effort in the additional complexity only for sake of flexibility.
> 7 Protocol Independence
> =======================
>
> Currently Cocoon is tightly bound to certain protocols by running an instance
> of it in a certain environment (servlet, CLI) and it's not (easy) possible to
> handle different invocation protocols from the same instance. To abstract the
> transport protocols (through the use of certain consumers or producers) we
> already have a good working base. What is missing is binding a protocol to a
> certain port, but we should not duplicate work here, which is better left to
> other software like Apache or Tomcat. We just need to find a way (which I'm
> sure, that already exists somewhere) to serve different ports with different
> protocols. I think the Servlet specification is general enough to not only
> support HTTP/HTTPS and can help us here.
The servlet API is bound to the request/response paradigm and implicitly
assumes that response goes to the same address of the request. This is
not even close to be general enough for protocol abstraction.
> Given the case, that we have solved the port binding issue, we need some
> abstraction of the transport protocol. What I mean here is that I'd like to
> use pipelines independent from the way the request has been sent to Cocoon
> and how it has to be sent back to the client.
>
> To solve this we need something like a protocol handler, which maps requests
> from certain protocols to certain pipelines. The mapping itself is a very
> abstract thing and heavily depends on the used protocol.
This will make cocoon overlap with protocol-handling concerns.
> Let's assume, we even solved the protocol handler issue, I'd like to sketch
> some possible use cases below, before we continue.
>
>
> 7.1 Web Services
> ----------------
>
> As many of you know there are existing two popular styles to use Web Services:
> SOAP and REST.
> Both have their own advantages and disadvantages but I'd like to concentrate
> on SOAP and on it's transport protocol independence, because REST-style Web
> Services are already possible to do with Cocoon.
>
> SOAP allows us to use any transport protocol to deliver SOAP messages. Mostly
> HTTP(S) is used therefore, but there are many cases, where you have to use
> other protocols (like SMTP, FTP, ...).
> Whatever protocol you chose to invoke your Web Services the result should be
> always the same and the response should be delivered back through (mostly)
> the same protocol. Here is one of the greatest advantages of the protocol
> independance.
No, this is not protocol independence. This is transport independance,
you are still dependent on SOAP as a protocol.
> What you want to do now is to implement the Web Service as a bunch of
> pipelines and let the protocol handler be responsible for invoking the same
> pipeline no matter which protocol has been used.
>
>
> 7.2 Mail Server
> ---------------
>
> Nothing hinders you to implement a mail server, which has the possibility to
> integrate various data sources and to expose it's functionality via the
> traditional protocols (SMTP, POP, IMAP) but also via HTTP, WAP, as Web
> Service, and what ever you want.
>
>
> 7.3 Mailing List Manager
> ------------------------
>
> Mailing list managers typically provide several functions (subscribe,
> unsubscribe, deliver mail, suspend, archive, search, ...) and manage a list
> of subscribed users. Once again, you can write such a service once and expose
> it's functionality through traditional protocols (HTTP, SMTP, ...) but also
> as Web Service.
>
>
> 7.4 What else?
> --------------
>
> Perhaps you realize that this way you are free to implement every application
> you want by the use of the easy declarative pipeline processing concept. How
> to connect your application to the world outside is a seperate issue which
> you can decide later and specify independant from the application.
>
>
> 8 Protocol Handler
> ==================
I don't think Cocoon should implement protocol handlers. Cocoon is a
data producer, should not deal with transport.
We already have enough problems to try to come up with an Enviornment
that could work with both email and web (which have orthogonal
client/server paradigms), I don't want to further increase the
complexity down this road.
[snip]
> 11 Converting old sitemaps to new sitemaps
> ==========================================
>
> Some of you might be interested, if this new concept is flexible enough to
> provide at least the same functionality as Cocoon does today.
Yes, I agree that the architecture you describe can be seen as an
'extention' of what Cocoon has today, therefore is possible to rewrite
current sitemaps in the model you propose.
yet, I fail to see the advantage of doing so. Since you don't gain any
functionality in the problem space where cocoon lives on.
> 12 Use Cases
> ============
you provide fancy use cases but they show me the power of the structured
pipe&filter design pattern, they don't tell me why we should do this in
cocoon.
because it's cool, or because it's doable are not very good arguments
around here.
> 13 Conclusion
> =============
>
> You might ask, why should we change so much from Cocoon?
exactly.
> First I think the new components are much more flexible and at least as easy
> to understand as the old ones: If you want to produce a data stream you use a
> producer, if you want to consume it you use a consumer, if you want to
> convert it you use a converter and if you want to filter it you use a filter.
that is your personal view and can't stand as an objective argument.
> To control the data flow you can use the <map:branch/> component.
>
> A possible migration path could be to support both sitemap versions, since the
> pipeline components either have different names or provide the same
> functionality. So a new sitemap implementation could be backward compatible
> to older sitemap versions. This could make the transition for the user as
> easy as possible.
>
> Additionally it might be possible to provida a migration script (e.g. via XSL)
> which reads an old sitemap and converts it to the new format. Since
> everything from the old sitemap can be expressed in the new sitemap and can
> be formally translated (see section "11 Converting old sitemaps to new
> sitemaps") this should not be a big issue.
You don't say *why* we should do this. What do we gain? why should we do
audio/video processing on the server side? why should we introduce
components that work on just one pipeline model and can't be shared with
others?
Oh, you definately win my vote for the FS of the year award :)
--
Stefano Mazzocchi <st...@apache.org>
Pluralitas non est ponenda sine necessitate [William of Ockham]
--------------------------------------------------------------------
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [PROPOSAL] Cocoon Science Fiction
Posted by Andreas Hochsteger <e9...@student.tuwien.ac.at>.
Hi!
On Monday 10 February 2003 0:58, Niclas Hedhman wrote:
> Didn't read all right now (no time), and maybe you mention it towards the
> end, but shouldn't this be part of the "BLOCKS" handling??
No, I didn't mention it yet in the proposal, but I thought about it the same
before.
It would definitely make sense to support this from the blocks too (imagine
data formats as part of block interfaces!). But I think we should discuss
first a bit about this proposal and where it has some impacts to.
>
> Niclas
>
> On Monday 10 February 2003 04:47, Andreas Hochsteger wrote:
> > Hi Cocooners!
> >
> > Sorry for this (very) long proposal below, but I think it's definitely
> > worth a read. If not, at least you can give me some feedback about your
> > opinion ;-)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org
--
Bye,
Andreas Hochsteger
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [PROPOSAL] Cocoon Science Fiction
Posted by SAXESS - Hussayn Dabbous <da...@saxess.com>.
I'm not familiar with the BLOCKS handling, so you say,
that BLOCKS will allow to pipe arbitrary content ?
That would be a great enhancement for one of my apps,
where i have to deal with a mixture of XML-content and
non-XMLIZABLE content ...
I'll have to read this BLOCKS proposal.
Can you give me a pointer ?
Hussayn
Niclas Hedhman wrote:
> Didn't read all right now (no time), and maybe you mention it towards the end,
> but shouldn't this be part of the "BLOCKS" handling??
>
> Niclas
>
> On Monday 10 February 2003 04:47, Andreas Hochsteger wrote:
>
>>Hi Cocooners!
>>
>>Sorry for this (very) long proposal below, but I think it's definitely
>>worth a read. If not, at least you can give me some feedback about your
>>opinion ;-)
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
> For additional commands, email: cocoon-dev-help@xml.apache.org
>
--
Dr. Hussayn Dabbous
SAXESS Software Design GmbH
Neuenhöfer Allee 125
50935 Köln
Telefon: +49-221-56011-0
Fax: +49-221-56011-20
E-Mail: dabbous@saxess.com
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: [PROPOSAL] Cocoon Science Fiction
Posted by Niclas Hedhman <ni...@hedhman.org>.
Didn't read all right now (no time), and maybe you mention it towards the end,
but shouldn't this be part of the "BLOCKS" handling??
Niclas
On Monday 10 February 2003 04:47, Andreas Hochsteger wrote:
> Hi Cocooners!
>
> Sorry for this (very) long proposal below, but I think it's definitely
> worth a read. If not, at least you can give me some feedback about your
> opinion ;-)
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org