You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Sylvain Wallez <sy...@anyware-tech.com> on 2003/08/13 12:02:04 UTC
[RT] Views for readers
Frederic's question about search engine integration led me to
questioning myself at how Cocoon's Lucene integration could be able to
transparently index Word & PDF documents along with XML-produced documents.
There exists some text-extraction libraries for Word & PDF (e.g.
http://www.textmining.org/). Now how can we integrate this as
transparently as possible in Cocoon's search functionnality ?
The Lucene indexer crawls a website and asks for a particular view
("content") which is used to fill the index. But Word and PDF documents
being binary files, they're handled by a <map:read> statement, which
does not handle views. On the other hand, this use case shows that
having views on binary content may make sense : the "normal" requests
just sends back the binary content, while a view can use a text/XML
extraction on these binary files.
So the question is : how could views be plugged to readers ? I must say
that I don't have an answer, as views contain transformers and a
serializer, but no generator. So how could we express in the sitemap
that a particular view on a reader should "replace" that reader by a
particular generator ? Or should this go through some special readers
that could also act as generators ?
Or maybe these are silly thoughts and we should use a <map:select>
directing to a <map:read> or <map:generate> depending on the view. But
this introduces explicit view management in the pipelines, which doesn't
seem nice to me.
Any thoughts ?
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 14:22 Europe/Rome, Sylvain Wallez wrote:
> Jeff Turner wrote:
>
> <snip/>
>
>> Isn't the problem there that a <map:read> is a whole little pipeline
>> unto itself? If it were broken into two atomic operations:
>>
>> <map:generate type="binary" src="foo.doc"/>
>> <map:serialize type="binary"/>
>>
>> then we could have a <map:view from-position="first"/> using a
>> content-aware pipeline, and everything would work.
>>
>> I have the feeling that handling non-XML content in Cocoon is Just
>> Wrong, and that <map:read> is just a hack. The fact that it doesn't
>> integrate with Views is a symptom of this. In a theoretically pure
>> world, we'd either make Cocoon an XML-only framework and kill
>> <map:read>, or make Cocoon a generic data pipelining framework
>> capable of handling and transforming binary content.
>>
>> Well it's a RT after all.. ;)
>>
>
> Content-aware and binary pipelines in the same post? Wow! Yes, it's
> definitely a RT ;-P
I am against to both content-aware selection and binary pipelines.
I still have to see a need for them that cannot be solved with
machinery already in place or with the newly proposed RequestFactories.
--
Stefano.
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Jeff Turner wrote:
<snip/>
>Isn't the problem there that a <map:read> is a whole little pipeline unto itself? If it were broken into two atomic operations:
>
><map:generate type="binary" src="foo.doc"/>
><map:serialize type="binary"/>
>
>then we could have a <map:view from-position="first"/> using a content-aware pipeline, and everything would work.
>
>I have the feeling that handling non-XML content in Cocoon is Just Wrong, and that <map:read> is just a hack. The fact that it doesn't integrate with Views is a symptom of this. In a theoretically pure world, we'd either make Cocoon an XML-only framework and kill <map:read>, or make Cocoon a generic data pipelining framework capable of handling and transforming binary content.
>
>Well it's a RT after all.. ;)
>
Content-aware and binary pipelines in the same post? Wow! Yes, it's
definitely a RT ;-P
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Nicola Ken Barozzi <ni...@apache.org>.
Sylvain Wallez wrote, On 14/08/2003 14.30:
> Nicola Ken Barozzi wrote:
>
>>
>> Jeff Turner wrote, On 14/08/2003 14.17:
>>
>> ...
>>
>>> Isn't the problem there that a <map:read> is a whole little pipeline
>>> unto
>>> itself? If it were broken into two atomic operations:
>>>
>>> <map:generate type="binary" src="foo.doc"/>
>>> <map:serialize type="binary"/>
>>>
>>> then we could have a <map:view from-position="first"/> using a
>>> content-aware pipeline, and everything would work.
>>
>> Well, why can't the view simply start from a reader?
>>
>> <map:read src="foo.doc"/>
>
> Because a view finishes a partial XML pipeline, meaning it requires a
> generator to be already present...
That's because of how we define a view now ;-)
If we had just pipelines that handle both binary and xml data, the viw
would finish a partial pipeline, in this case starting from binary.
>>> I have the feeling that handling non-XML content in Cocoon is Just
>>> Wrong,
>>> and that <map:read> is just a hack. The fact that it doesn't integrate
>>> with Views is a symptom of this. In a theoretically pure world, we'd
>>> either make Cocoon an XML-only framework and kill <map:read>, or make
>>> Cocoon a generic data pipelining framework capable of handling and
>>> transforming binary content.
>>
>> Well, it can be done easily by allowing more than one reader and by
>> allowing readers in the xml pipeline.
>>
>> Some time back I had proposed the following to be possible (and got
>> touted as the usual FS man)
>>
>> <map:read src="foo1.doc"/>
>> <map:read type="stripstuff"/>
>> <map:read type="otherfilter"/>
>
> Mhhh... I guess "stripstuff" and "otherfilter" are actually
> <map:transform-binary> and not <map:read> as they do have an input. Now
> how do we "close" the pipeline ? Is there a <map:serialize-binary> ?
Since streams are just streams, they don't need to be adapted like XML,
so there is no notion of Generator or Serializer really, but only
filter. So the reader is just a filter, and if in the middle it's just
given a stream and has to output to a stream. So there is no need to
open, and no need to close.
>> And also:
>>
>> <map:read src="foo1.doc"/>
>> <map:generate src="foo1.doc"/>
>> <map:serialize src="foo1.doc"/>
>> <map:read type="zip"/>
>
>
> Wow! What's the result of this ??
Oops, a bit too quick.
<!-- remove encription or do other stream preprocessing -->
<map:read type="decrypt" src="foo1.doc"/>
<!-- normal generation but from the previous reader output -->
<map:generate type="doc2xml"/>
<!-- eventual transforms-->
<!-- give back html -->
<map:serialize type="html"/>
<!-- zip that result so that it takes less bandwidth -->
<map:read type="zip"/>
>> We can already do this BTW by using the Cocooon protocol, but it's
>> such a hack!
>
> Sounds interesting. Can you elaborate on the hack ?
<map:match pattern="mypage.html">
<map:read src="internal/mypage.html" type="zip"/>
</map:match>
<map:match pattern="internal/mypage.html">
<!-- generate, transform, serialize... -->
</map:match>
BTW, maybe you may be interested in my RT about aspected pipeline
snippets, it could be interesting. Basically it would make it possible
to insert pipeline components inside all pipelines using certain rules.
--
Nicola Ken Barozzi nicolaken@apache.org
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------
Re: [RT] Views for readers
Posted by Tony Collen <co...@umn.edu>.
Upayavira wrote:
> On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote:
>
>
>>I find this more understandable (but dunno about implementation):
>>
>><!-- if reader is executed, the rest is not -->
>><map:read src="docs/{1}.doc" unless-view="wordToXml"/>
>><map:generate src="docs/{1}.doc" type="wordToXml"/>
>><map:transform...
>
>
> Simplifying further:
> <map:read src="docs/{1}.doc" view-generator="wordToXml"/>
>
> Surely that'd do it?
this might be better, because what happens when someone comes along doing this:
<map:read src="docs/{1}.doc" unless-view="wordToXml"/>
<map:generate src="docs/{2}.doc" type="wordToXml"/>
....
Then the same request represents two difference "sources", which could be either confusing or very
useful and I don't fully understand the implications of everything.
Just tossing my $0.02 in... it's early and I'm tired :)
Tony
Re: [RT] Views for readers
Posted by Upayavira <uv...@upaya.co.uk>.
On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote:
> I find this more understandable (but dunno about implementation):
>
> <!-- if reader is executed, the rest is not -->
> <map:read src="docs/{1}.doc" unless-view="wordToXml"/>
> <map:generate src="docs/{1}.doc" type="wordToXml"/>
> <map:transform...
Simplifying further:
<map:read src="docs/{1}.doc" view-generator="wordToXml"/>
Surely that'd do it?
Regards, Upayavira
Re: [RT] Views for readers
Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 19:07 Europe/Rome, Miles Elam wrote:
> Vadim Gritsenko wrote:
>
>> Here is another wild (or not?) thought.
>
>
> Not so wild to me.
>
>> All this discussion comes down to the requirement of generating some
>> XML out of the content usually served by the reader, if that's
>> possible (and it is possible for some of the types of the content),
>> in order to feed this XMLized content into the view. This generated
>> XML is somewhat "equivalent" to the binary represenation for the
>> purpose of view building. So, I'm going to the conclusion that some
>> types of readers can be paired with the generator producing
>> "equivalent", but XMLized, content. The best place to indicate such
>> pairing is the time when you declare a reader:
>
>
> <snip idea="interesting"/>
>
> The syntax looks a bit ugly to me, but the idea seems much more sane
> to me.
>
>> PS: Modifying sitemap syntax to allow reader/generator pairs with
>> some "unless" attrbiutes looks awful to me.
>
>
> Complete agreement. One of the reasons for the sitemap (*the*
> reason?) is for the simple and easy management of a site. Some recent
> proposals seem to be pushing in the direction of Apache HTTPd's
> mod_rewrite; A lot of flexibility by adding "just one more > construct."
>
> From the mod_rewrite page:
>
> "The great thing about mod_rewrite is it gives you all the
> configurability and flexibility of Sendmail. The downside to
> mod_rewrite is that it gives you all the configurability and
> flexibility of Sendmail."
>
> -- Brian Behlendorf
> Apache Group
>
> "Despite the tons of examples and docs, mod_rewrite is voodoo.
> Damned cool voodoo, but still voodoo."
>
> -- Brian Moore
> bem@news.cmc.net
>
> It'd be a shame if the sitemap became a cousin to mod_rewrite despite
> the cool voodoo.
I can hardly agree more!
>
> - Miles Elam
>
>
> P.S. I shudder to think of what will happen to search index creation
> times when multi-megabyte Word documents and the like are sent down
> the pipe. The parsers, however efficient they may turn out to be,
> will still have to contend with seemingly endless streams of seemingly
> pointless formatting cruft. I'm sure we've all seen 10MB files that
> would be <100K in proper HTML I'm sure. Ah well...'tis the cost of
> progress, I guess.
cocoon is not about binary and should *NOT* touch them. Readers were
implemented as helpers. multi-views for binary files belong to the
repository level, not to the publishing level!!!
I haven't read all email left (300 more to go after 5 days of offline)
but I strongly hope you haven't implemented this or I'll scream!!!
--
Stefano.
Re: [RT] Views for readers
Posted by Miles Elam <mi...@pcextremist.com>.
Vadim Gritsenko wrote:
> Here is another wild (or not?) thought.
Not so wild to me.
> All this discussion comes down to the requirement of generating some
> XML out of the content usually served by the reader, if that's
> possible (and it is possible for some of the types of the content), in
> order to feed this XMLized content into the view. This generated XML
> is somewhat "equivalent" to the binary represenation for the purpose
> of view building. So, I'm going to the conclusion that some types of
> readers can be paired with the generator producing "equivalent", but
> XMLized, content. The best place to indicate such pairing is the time
> when you declare a reader:
<snip idea="interesting"/>
The syntax looks a bit ugly to me, but the idea seems much more sane to me.
> PS: Modifying sitemap syntax to allow reader/generator pairs with some
> "unless" attrbiutes looks awful to me.
Complete agreement. One of the reasons for the sitemap (*the* reason?)
is for the simple and easy management of a site. Some recent proposals
seem to be pushing in the direction of Apache HTTPd's mod_rewrite; A
lot of flexibility by adding "just one more construct."
From the mod_rewrite page:
"The great thing about mod_rewrite is it gives you all the
configurability and flexibility of Sendmail. The downside to
mod_rewrite is that it gives you all the configurability and
flexibility of Sendmail."
-- Brian Behlendorf
Apache Group
"Despite the tons of examples and docs, mod_rewrite is voodoo.
Damned cool voodoo, but still voodoo."
-- Brian Moore
bem@news.cmc.net
It'd be a shame if the sitemap became a cousin to mod_rewrite despite
the cool voodoo.
- Miles Elam
P.S. I shudder to think of what will happen to search index creation
times when multi-megabyte Word documents and the like are sent down the
pipe. The parsers, however efficient they may turn out to be, will
still have to contend with seemingly endless streams of seemingly
pointless formatting cruft. I'm sure we've all seen 10MB files that
would be <100K in proper HTML I'm sure. Ah well...'tis the cost of
progress, I guess.
Re: [RT] Views for readers
Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:
> Vadim Gritsenko wrote:
>
>> Sylvain Wallez wrote:
>>
>>> Vadim Gritsenko wrote:
>>>
>>>> Sylvain Wallez wrote:
>>>>
>>>>> Vadim Gritsenko wrote:
>>>>
>>>>
>>>>
>>
>> <snip/>
>>
>>>> Here is another wild (or not?) thought.
>>>>
>>>> All this discussion comes down to the requirement of generating
>>>> some XML out of the content usually served by the reader, if that's
>>>> possible (and it is possible for some of the types of the content),
>>>> in order to feed this XMLized content into the view. This generated
>>>> XML is somewhat "equivalent" to the binary represenation for the
>>>> purpose of view building. So, I'm going to the conclusion that some
>>>> types of readers can be paired with the generator producing
>>>> "equivalent", but XMLized, content. The best place to indicate such
>>>> pairing is the time when you declare a reader:
>>>>
>>>> <map:readers default="resource">
>>>> <map:reader name="resource"
>>>> src="org.apache.cocoon.reading.ResourceReader"/>
>>>> <map:reader name="html"
>>>> src="org.apache.cocoon.reading.ResourceReader">
>>>>
>>>> <generator-paired-to-this-reader>html</generator-paired-to-this-reader>
>>>>
>>>> </map:reader>
>>>> <map:reader name="msexcel"
>>>> src="org.apache.cocoon.reading.ResourceReader">
>>>>
>>>> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader>
>>>>
>>>> </map:reader>
>>>> <map:reader name="pdf"
>>>> src="org.apache.cocoon.reading.ResourceReader">
>>>>
>>>> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader>
>>>>
>>>> </map:reader>
>>>> </map:readers>
>>>
>>>
>>>
>>>
>>>
>>> I'm afraid this won't work :
>>
>>
>>
>>
>> Can you suggest some improvements so it does work? My goal is to have
>> as little impact on sitemap syntax as possible.
>>
>>
>>> - a generator specific to a given content-type is very unlikely to
>>> produce the document type expected by the view. We will most often
>>> need an additional transformation (e.g. the "xword2xdoc.xsl" that
>>> was in my example)
>>
>>
>>
>>
>> More wild suggestions.
>>
>> 1/ Do something with the views. Say, allow duplicate view names and
>> make them work as selector:
>>
>> <map:views>
>> <!-- works if ("when") reader -->
>> <map:view from-position="reader" name="content">
>> <map:transform src="wordml2content.xsl" label="content"/>
>> <map:serialize type="xml"/>
>> </map:view>
>> <!-- works if ("when") label -->
>> <map:view from-label="content" name="content">
>> <map:serialize type="xml"/>
>> </map:view>
>> <!-- works if no label ("otherwise") -->
>> <map:view from-position="first" name="content">
>> <map:serialize type="xml"/>
>> </map:view>
>> </map:views>
>
>
>
> Still the same problem I desperatly pointing out again and again : how
> can the from-position="reader" use different generators (i.e. parsers)
> depending on the binary content ?
I did not copy reader-to-generator association
(<generator-paired-to-this-reader/>) declared on top. Get the generator
from there.
>> 2/ Do something with the readers.
>
...
> This introduces sitemap snippets into a component manager
> configuration, wich is not good at all.
Yep. Not good.
>> 3/ Alternative to 2:
>>
>> <map:readers default="resource">
>> <map:reader name="msword"
>> src="org.apache.cocoon.reading.ResourceReader">
>> <xmlizer-uri>cocoon://word-2-content/</xmlizer-uri>
>> </map:reader>
>> </map:readers>
>>
>> <map:views>
>> <map:view from-label="content" name="content">
>> <map:serialize type="xml"/>
>> </map:view>
>> </map:views>
>>
>> <map:pipelines>
>> ...
>> <map:read src="my.doc"/>
>> ...
>> <map:match pattern="word-2-content/*">
>> <map:generate type="msword" src="{1}/>
>> <map:transform src="wordml2content.xsl" label="content"/>
>> <map:serialize type="xml"/>
>> </map:match>
>> </map:pipelines>
>
>
>
> Sounds better, but has the problem that it implies that every view
> should return xml content on "my.doc".
Yep. Unless you define one xmlizer URI per view... Awful!
> Or to we introduce a "label" attribute on <map:read> to define on
> which particular view the xmlizer-uri should be triggered ?
Possible.
>> I would not say that I like any of the suggestions above. The
>> cleanest way ATM is the usage of map:resource I suggested in other
>> email (I yet to see your comment on it).
>
>
>
> Sorry, I have no particular comment on the use of resources, as it's
> mainly a refactoring of the action/matcher proposals.
But it solves the problem! And the cleanest solution (with minimal
impact) among all discussed here.
>>> - views, through their associated labels, can be plugged at any
>>> point of the pipelines. Defining pair generators restricts views to
>>> be only from-label="start".
>>>
>>>> PS: Modifying sitemap syntax to allow reader/generator pairs with
>>>> some "unless" attrbiutes looks awful to me.
>>>
>>>
>>>
>>> Doesn't seem so awful to me, since the reader should be executed
>>> "unless" certain conditions are met, which are that the specified
>>> label(s) correspond to the one at which the requested view should
>>> start.
>>
>>
>>
>> This "unless" attribute is nothing else than shortcut for
>> <map:match>. Given point on verbosity and given the obfuscated
>> result, I'm for verbosity.
>
>
>
> Not exacly : you can currently match on the view name (provided that
> the environment actually does rely on the "cocoon-view" parameter),
(Special "view" matcher is still possible)
> but you cannot match on the labels. And only labels are currently used
> in the <map:pipelines> section.
I don't understand this. What is "match on the labels" in this context?
>> PS Keep sitemap syntax clean! Say "No!" to woodo!
>
Should be "voodoo" above
> Funny. That's often me that says "too much magic kills the confidence".
Now it's my turn :)
> Let's stop this discussion for now. I have the feeling we won't reach
> consensus and will just come to some useless flame war.
I don't see an elegant solution to the reader/view problem right now.
And we always can make another flamefest later (are you planning a visit
to US? :)
Vadim
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Vadim Gritsenko wrote:
> Sylvain Wallez wrote:
>
>> Vadim Gritsenko wrote:
>>
>>> Sylvain Wallez wrote:
>>>
>>>> Vadim Gritsenko wrote:
>>>
>>>
>
> <snip/>
>
>>> Here is another wild (or not?) thought.
>>>
>>> All this discussion comes down to the requirement of generating some
>>> XML out of the content usually served by the reader, if that's
>>> possible (and it is possible for some of the types of the content),
>>> in order to feed this XMLized content into the view. This generated
>>> XML is somewhat "equivalent" to the binary represenation for the
>>> purpose of view building. So, I'm going to the conclusion that some
>>> types of readers can be paired with the generator producing
>>> "equivalent", but XMLized, content. The best place to indicate such
>>> pairing is the time when you declare a reader:
>>>
>>> <map:readers default="resource">
>>> <map:reader name="resource"
>>> src="org.apache.cocoon.reading.ResourceReader"/>
>>> <map:reader name="html"
>>> src="org.apache.cocoon.reading.ResourceReader">
>>>
>>> <generator-paired-to-this-reader>html</generator-paired-to-this-reader>
>>> </map:reader>
>>> <map:reader name="msexcel"
>>> src="org.apache.cocoon.reading.ResourceReader">
>>>
>>> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader>
>>>
>>> </map:reader>
>>> <map:reader name="pdf"
>>> src="org.apache.cocoon.reading.ResourceReader">
>>>
>>> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader>
>>>
>>> </map:reader>
>>> </map:readers>
>>
>>
>>
>>
>> I'm afraid this won't work :
>
>
>
> Can you suggest some improvements so it does work? My goal is to have
> as little impact on sitemap syntax as possible.
>
>
>> - a generator specific to a given content-type is very unlikely to
>> produce the document type expected by the view. We will most often
>> need an additional transformation (e.g. the "xword2xdoc.xsl" that was
>> in my example)
>
>
>
> More wild suggestions.
>
> 1/ Do something with the views. Say, allow duplicate view names and
> make them work as selector:
>
> <map:views>
> <!-- works if ("when") reader -->
> <map:view from-position="reader" name="content">
> <map:transform src="wordml2content.xsl" label="content"/>
> <map:serialize type="xml"/>
> </map:view>
> <!-- works if ("when") label -->
> <map:view from-label="content" name="content">
> <map:serialize type="xml"/>
> </map:view>
> <!-- works if no label ("otherwise") -->
> <map:view from-position="first" name="content">
> <map:serialize type="xml"/>
> </map:view>
> </map:views>
Still the same problem I desperatly pointing out again and again : how
can the from-position="reader" use different generators (i.e. parsers)
depending on the binary content ?
> 2/ Do something with the readers.
>
> <map:readers default="resource">
> <map:reader name="msword"
> src="org.apache.cocoon.reading.ResourceReader">
> <map:generate type="msword"/>
> <map:transform src="wordml2content.xsl"/>
> </map:reader>
> </map:readers>
This introduces sitemap snippets into a component manager configuration,
wich is not good at all.
> 3/ Alternative to 2:
>
> <map:readers default="resource">
> <map:reader name="msword"
> src="org.apache.cocoon.reading.ResourceReader">
> <xmlizer-uri>cocoon://word-2-content/</xmlizer-uri>
> </map:reader>
> </map:readers>
>
> <map:views>
> <map:view from-label="content" name="content">
> <map:serialize type="xml"/>
> </map:view>
> </map:views>
>
> <map:pipelines>
> ...
> <map:read src="my.doc"/>
> ...
> <map:match pattern="word-2-content/*">
> <map:generate type="msword" src="{1}/>
> <map:transform src="wordml2content.xsl" label="content"/>
> <map:serialize type="xml"/>
> </map:match>
> </map:pipelines>
Sounds better, but has the problem that it implies that every view
should return xml content on "my.doc". Or to we introduce a "label"
attribute on <map:read> to define on which particular view the
xmlizer-uri should be triggered ?
> I would not say that I like any of the suggestions above. The cleanest
> way ATM is the usage of map:resource I suggested in other email (I yet
> to see your comment on it).
Sorry, I have no particular comment on the use of resources, as it's
mainly a refactoring of the action/matcher proposals.
>> - views, through their associated labels, can be plugged at any point
>> of the pipelines. Defining pair generators restricts views to be only
>> from-label="start".
>>
>>> PS: Modifying sitemap syntax to allow reader/generator pairs with
>>> some "unless" attrbiutes looks awful to me.
>>
>>
>> Doesn't seem so awful to me, since the reader should be executed
>> "unless" certain conditions are met, which are that the specified
>> label(s) correspond to the one at which the requested view should start.
>
>
> This "unless" attribute is nothing else than shortcut for <map:match>.
> Given point on verbosity and given the obfuscated result, I'm for
> verbosity.
Not exacly : you can currently match on the view name (provided that the
environment actually does rely on the "cocoon-view" parameter), but you
cannot match on the labels. And only labels are currently used in the
<map:pipelines> section.
> PS Keep sitemap syntax clean! Say "No!" to woodo!
Funny. That's often me that says "too much magic kills the confidence".
Let's stop this discussion for now. I have the feeling we won't reach
consensus and will just come to some useless flame war.
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 21:10 Europe/Rome, Vadim Gritsenko wrote:
> PS Keep sitemap syntax clean! Say "No!" to woodo!
Amen!
--
Stefano.
Re: [RT] Views for readers
Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:
> Vadim Gritsenko wrote:
>
>> Sylvain Wallez wrote:
>>
>>> Vadim Gritsenko wrote:
>>
<snip/>
>> Here is another wild (or not?) thought.
>>
>> All this discussion comes down to the requirement of generating some
>> XML out of the content usually served by the reader, if that's
>> possible (and it is possible for some of the types of the content),
>> in order to feed this XMLized content into the view. This generated
>> XML is somewhat "equivalent" to the binary represenation for the
>> purpose of view building. So, I'm going to the conclusion that some
>> types of readers can be paired with the generator producing
>> "equivalent", but XMLized, content. The best place to indicate such
>> pairing is the time when you declare a reader:
>>
>> <map:readers default="resource">
>> <map:reader name="resource"
>> src="org.apache.cocoon.reading.ResourceReader"/>
>> <map:reader name="html"
>> src="org.apache.cocoon.reading.ResourceReader">
>>
>> <generator-paired-to-this-reader>html</generator-paired-to-this-reader>
>> </map:reader>
>> <map:reader name="msexcel"
>> src="org.apache.cocoon.reading.ResourceReader">
>>
>> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader>
>>
>> </map:reader>
>> <map:reader name="pdf"
>> src="org.apache.cocoon.reading.ResourceReader">
>>
>> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader>
>>
>> </map:reader>
>> </map:readers>
>
>
>
> I'm afraid this won't work :
Can you suggest some improvements so it does work? My goal is to have as
little impact on sitemap syntax as possible.
> - a generator specific to a given content-type is very unlikely to
> produce the document type expected by the view. We will most often
> need an additional transformation (e.g. the "xword2xdoc.xsl" that was
> in my example)
More wild suggestions.
1/ Do something with the views. Say, allow duplicate view names and make
them work as selector:
<map:views>
<!-- works if ("when") reader -->
<map:view from-position="reader" name="content">
<map:transform src="wordml2content.xsl" label="content"/>
<map:serialize type="xml"/>
</map:view>
<!-- works if ("when") label -->
<map:view from-label="content" name="content">
<map:serialize type="xml"/>
</map:view>
<!-- works if no label ("otherwise") -->
<map:view from-position="first" name="content">
<map:serialize type="xml"/>
</map:view>
</map:views>
2/ Do something with the readers.
<map:readers default="resource">
<map:reader name="msword"
src="org.apache.cocoon.reading.ResourceReader">
<map:generate type="msword"/>
<map:transform src="wordml2content.xsl"/>
</map:reader>
</map:readers>
3/ Alternative to 2:
<map:readers default="resource">
<map:reader name="msword"
src="org.apache.cocoon.reading.ResourceReader">
<xmlizer-uri>cocoon://word-2-content/</xmlizer-uri>
</map:reader>
</map:readers>
<map:views>
<map:view from-label="content" name="content">
<map:serialize type="xml"/>
</map:view>
</map:views>
<map:pipelines>
...
<map:read src="my.doc"/>
...
<map:match pattern="word-2-content/*">
<map:generate type="msword" src="{1}/>
<map:transform src="wordml2content.xsl" label="content"/>
<map:serialize type="xml"/>
</map:match>
</map:pipelines>
I would not say that I like any of the suggestions above. The cleanest
way ATM is the usage of map:resource I suggested in other email (I yet
to see your comment on it).
> - views, through their associated labels, can be plugged at any point
> of the pipelines. Defining pair generators restricts views to be only
> from-label="start".
>
>> PS: Modifying sitemap syntax to allow reader/generator pairs with
>> some "unless" attrbiutes looks awful to me.
>
>
>
> Doesn't seem so awful to me, since the reader should be executed
> "unless" certain conditions are met, which are that the specified
> label(s) correspond to the one at which the requested view should start.
This "unless" attribute is nothing else than shortcut for <map:match>.
Given point on verbosity and given the obfuscated result, I'm for verbosity.
PS Keep sitemap syntax clean! Say "No!" to woodo!
Vadim
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Vadim Gritsenko wrote:
> Sylvain Wallez wrote:
>
>> Vadim Gritsenko wrote:
>>
>>> Sylvain Wallez wrote:
>>
> <snip/>
>
>>>> Any other proposal or opinion on this subject before we start a vote ?
>>>
>>>
>>> Can't you just enable generators in map:view in case when view
>>> starts with reader?
>>
>>
>> No, since views "capture" the (XML) output at certain points of the
>> pipeline to provide a different formatting.
>
>
> In case of the reader, there is no (XML) output in the pipeline. It's
> special case, unless you want to introduce binary pipelines (and I
> hope you don't want to), so it would require special handling.
>
>> E.g. the processing for the "indexable-content" view
>
>
> Sidenote: It's called "content" -- the view which you use to build a
> site search index.
Picky sidenote : this is configurable using the <content-view-query>
config of the <lucene-xml-indexer> component ;-)
>> is the same for all URIs, be them XML pipelines or a single reader.
>>
>> So there's no way other than having a generator _before_ jumping to
>> the view, feeding that view with the kind of XML content it expects.
>
>
> Here is another wild (or not?) thought.
>
> All this discussion comes down to the requirement of generating some
> XML out of the content usually served by the reader, if that's
> possible (and it is possible for some of the types of the content), in
> order to feed this XMLized content into the view. This generated XML
> is somewhat "equivalent" to the binary represenation for the purpose
> of view building. So, I'm going to the conclusion that some types of
> readers can be paired with the generator producing "equivalent", but
> XMLized, content. The best place to indicate such pairing is the time
> when you declare a reader:
>
> <map:readers default="resource">
> <map:reader name="resource"
> src="org.apache.cocoon.reading.ResourceReader"/>
> <map:reader name="html"
> src="org.apache.cocoon.reading.ResourceReader">
>
> <generator-paired-to-this-reader>html</generator-paired-to-this-reader>
> </map:reader>
> <map:reader name="msexcel"
> src="org.apache.cocoon.reading.ResourceReader">
>
> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader>
>
> </map:reader>
> <map:reader name="pdf" src="org.apache.cocoon.reading.ResourceReader">
>
> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader>
>
> </map:reader>
> </map:readers>
I'm afraid this won't work :
- a generator specific to a given content-type is very unlikely to
produce the document type expected by the view. We will most often need
an additional transformation (e.g. the "xword2xdoc.xsl" that was in my
example)
- views, through their associated labels, can be plugged at any point of
the pipelines. Defining pair generators restricts views to be only
from-label="start".
> PS: Modifying sitemap syntax to allow reader/generator pairs with some
> "unless" attrbiutes looks awful to me.
Doesn't seem so awful to me, since the reader should be executed
"unless" certain conditions are met, which are that the specified
label(s) correspond to the one at which the requested view should start.
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:
> Vadim Gritsenko wrote:
>
>> Sylvain Wallez wrote:
>
<snip/>
>>> Any other proposal or opinion on this subject before we start a vote ?
>>
>>
>>
>> Can't you just enable generators in map:view in case when view starts
>> with reader?
>
>
>
> No, since views "capture" the (XML) output at certain points of the
> pipeline to provide a different formatting.
In case of the reader, there is no (XML) output in the pipeline. It's
special case, unless you want to introduce binary pipelines (and I hope
you don't want to), so it would require special handling.
> E.g. the processing for the "indexable-content" view
Sidenote: It's called "content" -- the view which you use to build a
site search index.
> is the same for all URIs, be them XML pipelines or a single reader.
>
> So there's no way other than having a generator _before_ jumping to
> the view, feeding that view with the kind of XML content it expects.
Here is another wild (or not?) thought.
All this discussion comes down to the requirement of generating some XML
out of the content usually served by the reader, if that's possible (and
it is possible for some of the types of the content), in order to feed
this XMLized content into the view. This generated XML is somewhat
"equivalent" to the binary represenation for the purpose of view
building. So, I'm going to the conclusion that some types of readers can
be paired with the generator producing "equivalent", but XMLized,
content. The best place to indicate such pairing is the time when you
declare a reader:
<map:readers default="resource">
<map:reader name="resource"
src="org.apache.cocoon.reading.ResourceReader"/>
<map:reader name="html" src="org.apache.cocoon.reading.ResourceReader">
<generator-paired-to-this-reader>html</generator-paired-to-this-reader>
</map:reader>
<map:reader name="msexcel"
src="org.apache.cocoon.reading.ResourceReader">
<generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader>
</map:reader>
<map:reader name="pdf" src="org.apache.cocoon.reading.ResourceReader">
<generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader>
</map:reader>
</map:readers>
PS: Modifying sitemap syntax to allow reader/generator pairs with some
"unless" attrbiutes looks awful to me.
Vadim
Re: [RT] Views for readers
Posted by Vadim Gritsenko <va...@verizon.net>.
Miles Elam wrote:
> Sylvain Wallez wrote:
>
>> Go back to first post of this thread, where (last paragraph) I
>> proposed something similar. The whole discussion is about how we
>> could have a syntax which doesn't introduce such verbosity in the
>> sitemap.
>
>
>
> Verbosity is not necessarily a bad thing. If it were, would any of us
> be using XML? ;-)
Good point.
<snip/>
>> Let's consider the MIDI example. Suppose we have a large collection
>> of karaoke files (MIDI supports embedded text that can be played on
>> screen while playing the music), and we want to index the text of
>> these songs for easy retrieval (along with some other meta-data).
>>
>> Here's a sitemap example, using the current syntax
>
<snip/>
>> And the proposed shorter one :
>>
>> <map:match pattern="*.mid">
>> <map:read src="{1}.mid" unless-label="content"/>
>> <map:generate type="midi" src="{1}.mid"/>
>> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>> <!-- should never come here -->
>> <map:serialize type="xml"/>
>> </map:match>
>
Two lines. What does it give except obfuscation? Given the point above
("Verbosity is not necessarily a bad thing" (c) Miles Elam) more
readable and already supported syntax is:
<map:resource name="midi"/>
<map:match type="view" pattern="content">
<map:generate type="midi" src="{1}.mid"/>
<map:transform src="xmidi2xdoc.xsl" label="content"/>
<map:serialize type="xml"/>
</map:match>
<map:read mime-type="whatever/midi" src="{1}.mid"/>
</map:match>
<map:match pattern="*.mid"/>
<map:call resource="midi"/>
</map:match>
Moreover! Resource "midi" is reusable:
<map:match pattern="another/*.mid"/>
<map:call resource="midi"/>
</map:match>
, while example above is not.
> This breaks current convention that either a reader or a
> generator/transformer/serializer can act in a pipeline.
And, given this resource example, it does not break any sitemap
semantics which we have today.
> In the first example, if "content" isn't specified, the action returns
> null and the reader is invoked; As far as the pipeline logic is
> concerned, there is only the reader. Serializers are already known as
> universal exit points. To use the second, the convention must be
> broken and readers must become universal exit points.
>
> In other words,
>
> <map:match pattern="*.mid">
> <map:read src="{1}.mid"/> <!-- without the unless-label -->
> <map:generate type="midi" src="{1}.mid"/>
> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
> <!-- should never come here -->
> <map:serialize type="xml"/>
> </map:match>
>
> must become valid for consistency. A reader becomes an exit point and
> the rest of a pipeline is, by default, ignored. Is this an intended
> consequence?
I fell strongly "-1" on this one.
Vadim
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:
>
>
>
> Not according to the code, they're not. Check out
> AbstractProcessingPipeline.java. There are method bodies like:
>
> public void setGenerator (String role, String source, Parameters
> param, Parameters hintParam)
> throws ProcessingException {
> if (this.generator != null) {
> throw new ProcessingException ("Generator already set. You
> can only select one Generator (" + role + ")");
> }
> if (this.reader != null) {
> throw new ProcessingException ("Reader already set. You
> cannot use a reader and a generator for one pipeline.");
> }
> ...
>
> and
>
> public void setReader (String role, String source, Parameters
> param, String mimeType)
> throws ProcessingException {
> if (this.reader != null) {
> throw new ProcessingException ("Reader already set. You can
> only select one Reader (" + role + ")");
> }
> if (this.generator != null) {
> throw new ProcessingException ("Generator already set. You
> cannot use a reader and a generator for one pipeline.");
> }
> ...
>
>
> Either the policy was in effect when this file (and its subclasses)
> were made or someone put constraining statements in that serve no
> purpose. The file was last modified on August 6th of this year. If
> the policy has changed, no one told the code.
This has been there for a very long time. And has nothing to do with the
fact that readers and serializers end the execution of the sitemap :
check ReadNode and SerializeNode in o.a.c.components.treeprocessor.sitemap.
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Miles Elam <mi...@pcextremist.com>.
Sylvain Wallez wrote:
> Miles Elam wrote:
>
>> In other words, the pipeline is full of side effects and dependant
>> upon things happening behind the curtain (to use a Wizard of Oz
>> reference). You'd be right in that it adds to the confusion. I
>> agree with Vadim. This is obfuscation in exchange for two lines of
>> verboseness.
>
>
> Just some additional precisions, "mon frère" !
I hope it wasn't taken the wrong way. I did not intend any offense.
> Yes, the pipeline is full of side effects, which can break pipelines
> at any point an continue somewhere else without this being explicitely
> visible in the pipeline construction statements.
>
> These side effects are called "views", and the way to define views is
> through labels.
Don't get me wrong. I see clearly the reason why views exist. I see
clearly why reader views are wanted. When working with XML data -- not
just text, but structured text -- getting at that data before it is
processed into a presentation format (such as viewing source, getting a
true content view, etc.) can prove invaluable.
> And even worse : labels can be placed on component definitions,
> meaning a clean pipeline with no label attribute at all is full of
> these side effects.
>
> So what you call obfuscation has been there *for years*. And
> everybody's happy with it.
When grabbing from the presentation format as a source, you are
comparing apples and oranges. Not only are there innumerable binary
formats out there being squeezed into a few reader implementations, but
they are not all desirable data. While you may want the data from a PDF
file, you may not bother with a PNG image because it may index "Created
with The Gimp" over and over.
Since putting in all binary format-to-generator mapping info seems out
of the question, all of the pipeline path must be specified in the
matcher -- hence the discussion surrounding readers and generators in
the same matcher. If everything is specified in the same matcher and
not truly orthogonal, as is the case for views currently, why add the
extra syntax for what amounts to a non-orthogonal if-else clause?
if (!content-view)
read
else
generate
transform
serialize
as opposed to
generate
+---------- view-short-curcuit! --+-> transform-x
transform-1 +-> serialize
transform-2
serialize
There is a discontinuity there that makes me uncomfortable. This is not
an overt attachment to symmetry. This is seeing the same tool applied
to two (in my opinion) very different tasks. I am not a committer and
can't vote. But these are my thoughts on the matter. Take with as many
grains of salt as are necessary.
- Miles Elam
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:
> In other words, the pipeline is full of side effects and dependant
> upon things happening behind the curtain (to use a Wizard of Oz
> reference). You'd be right in that it adds to the confusion. I agree
> with Vadim. This is obfuscation in exchange for two lines of
> verboseness.
Just some additional precisions, "mon frère" !
Yes, the pipeline is full of side effects, which can break pipelines at
any point an continue somewhere else without this being explicitely
visible in the pipeline construction statements.
These side effects are called "views", and the way to define views is
through labels.
And even worse : labels can be placed on component definitions, meaning
a clean pipeline with no label attribute at all is full of these side
effects.
So what you call obfuscation has been there *for years*. And everybody's
happy with it.
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Miles Elam <mi...@pcextremist.com>.
Sylvain Wallez wrote:
>> The functionality for all readers would obviously be the same: move
>> these bytes from here to there. But yes, the codified mapping I
>> think is important.
>
>
> Please read carefully : I wrote *generators* !! This isn't about
> moving bytes, but about producing an XML document.
Au contraire mon frére, this is implemented with generators but it is
about pulling searchable info out of arbitrary binary data. The first
step to that goal is to standardize it -- therefore generators are
added. The issue is about *readers* and the custom formats they
encompass not being indexable.
>> You're mixing the <map:act> with a </map:match>, but I get the idea.
>
>
> Picky guy, eh ?
You know it. :)
> Readers already are universal exit points : once you encounter a
> reader, sitemap processing is terminated. <map:read> and
> <map:serialize> are like a "return" statement in Java.
Not according to the code, they're not. Check out
AbstractProcessingPipeline.java. There are method bodies like:
public void setGenerator (String role, String source, Parameters
param, Parameters hintParam)
throws ProcessingException {
if (this.generator != null) {
throw new ProcessingException ("Generator already set. You
can only select one Generator (" + role + ")");
}
if (this.reader != null) {
throw new ProcessingException ("Reader already set. You
cannot use a reader and a generator for one pipeline.");
}
...
and
public void setReader (String role, String source, Parameters param,
String mimeType)
throws ProcessingException {
if (this.reader != null) {
throw new ProcessingException ("Reader already set. You can
only select one Reader (" + role + ")");
}
if (this.generator != null) {
throw new ProcessingException ("Generator already set. You
cannot use a reader and a generator for one pipeline.");
}
...
Either the policy was in effect when this file (and its subclasses) were
made or someone put constraining statements in that serve no purpose.
The file was last modified on August 6th of this year. If the policy
has changed, no one told the code.
> No consequence : this is how the sitemap works today, and the above is
> valid, even if we can consider that the sitemap engine should more
> strict and signal that there's some unreachable code.
I can't speak to validity, but this is NOT how it works today.
> To add more to the confusion, in both your and my example, we can even
> avoid writing the <map:serialize> statement. Since some additional
> filtering occurs beforehand (either through the action or through
> reader labels), this statement is never reached and is useless !
In other words, the pipeline is full of side effects and dependant upon
things happening behind the curtain (to use a Wizard of Oz reference).
You'd be right in that it adds to the confusion. I agree with Vadim.
This is obfuscation in exchange for two lines of verboseness.
- Miles Elam
Re: [RT] Views for readers
Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:
<snip/>
>> In other words,
>>
>> <map:match pattern="*.mid">
>> <map:read src="{1}.mid"/> <!-- without the unless-label -->
>> <map:generate type="midi" src="{1}.mid"/>
>> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>> <!-- should never come here -->
>> <map:serialize type="xml"/>
>> </map:match>
>>
>> must become valid for consistency. A reader becomes an exit point
>> and the rest of a pipeline is, by default, ignored. Is this an
>> intended consequence?
>
>
>
> No consequence : this is how the sitemap works today, and the above is
> valid,
No, that's not valid today. And if current sitemap implementation does
not passes the conformance test, it does not indicate that invalid
syntax has become valid. It just indicates that current sitemap
implementation is not conformant.
PS Absense of the official conformance test suite does not make point
above invalid. Here is an attempt at the test:
http://cvs.apache.org/viewcvs.cgi/cocoon-2.0/src/webapp/mount/lint/sitemap.xmap?rev=1.1&content-type=text/vnd.viewcvs-markup
Vadim
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:
> Sylvain Wallez wrote:
>
>> Go back to first post of this thread, where (last paragraph) I
>> proposed something similar. The whole discussion is about how we
>> could have a syntax which doesn't introduce such verbosity in the
>> sitemap.
>
>
> Verbosity is not necessarily a bad thing. If it were, would any of us
> be using XML? ;-)
Good, point. However, the only verbosity currently added by views is the
"label" attribute. This proposal is about achieving the same low
verbosity for views with binary content.
>> As I explained in several replies, there's no equivalence between a
>> reader and generator able to parse a given binary format. There needs
>> to be some kind of adaptation/extraction before feeding the view.
>
>
> Yup.
>
>> And what you describe above as "a PDF reader, a Word reader, a
>> Postscript reader, etc." are IMO nothing more than _generators_, just
>> like the SWF and MIDI generators we already have.
>
>
> The functionality for all readers would obviously be the same: move
> these bytes from here to there. But yes, the codified mapping I think
> is important.
Please read carefully : I wrote *generators* !! This isn't about moving
bytes, but about producing an XML document.
>> Let's consider the MIDI example. Suppose we have a large collection
>> of karaoke files (MIDI supports embedded text that can be played on
>> screen while playing the music), and we want to index the text of
>> these songs for easy retrieval (along with some other meta-data).
>>
>> Here's a sitemap example, using the current syntax
>> <map:match pattern="*.mid"/>
>> <map:act type="catch-view" src="content">
>> <map:generate type="midi" src="{1}.mid"/>
>> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>> <!-- should never come here -->
>> <map:serialize type="xml"/>
>> </map:match>
>> <map:read src="{1}.mid"/>
>> </map:match>
>
>
>
> You're mixing the <map:act> with a </map:match>, but I get the idea.
Picky guy, eh ?
>> (the "content" view starts at the "content-label" label to clearly
>> distinguish the two notions).
>>
>> And the proposed shorter one :
>>
>> <map:match pattern="*.mid">
>> <map:read src="{1}.mid" unless-label="content"/>
>> <map:generate type="midi" src="{1}.mid"/>
>> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>> <!-- should never come here -->
>> <map:serialize type="xml"/>
>> </map:match>
>
>
>
> This breaks current convention that either a reader or a
> generator/transformer/serializer can act in a pipeline. In the first
> example, if "content" isn't specified, the action returns null and the
> reader is invoked; As far as the pipeline logic is concerned, there
> is only the reader. Serializers are already known as universal exit
> points. To use the second, the convention must be broken and readers
> must become universal exit points.
Readers already are universal exit points : once you encounter a reader,
sitemap processing is terminated. <map:read> and <map:serialize> are
like a "return" statement in Java.
> In other words,
>
> <map:match pattern="*.mid">
> <map:read src="{1}.mid"/> <!-- without the unless-label -->
> <map:generate type="midi" src="{1}.mid"/>
> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
> <!-- should never come here -->
> <map:serialize type="xml"/>
> </map:match>
>
> must become valid for consistency. A reader becomes an exit point and
> the rest of a pipeline is, by default, ignored. Is this an intended
> consequence?
No consequence : this is how the sitemap works today, and the above is
valid, even if we can consider that the sitemap engine should more
strict and signal that there's some unreachable code.
To add more to the confusion, in both your and my example, we can even
avoid writing the <map:serialize> statement. Since some additional
filtering occurs beforehand (either through the action or through reader
labels), this statement is never reached and is useless !
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Miles Elam <mi...@pcextremist.com>.
Sylvain Wallez wrote:
> Go back to first post of this thread, where (last paragraph) I
> proposed something similar. The whole discussion is about how we could
> have a syntax which doesn't introduce such verbosity in the sitemap.
Verbosity is not necessarily a bad thing. If it were, would any of us
be using XML? ;-)
> As I explained in several replies, there's no equivalence between a
> reader and generator able to parse a given binary format. There needs
> to be some kind of adaptation/extraction before feeding the view.
Yup.
> And what you describe above as "a PDF reader, a Word reader, a
> Postscript reader, etc." are IMO nothing more than _generators_, just
> like the SWF and MIDI generators we already have.
The functionality for all readers would obviously be the same: move
these bytes from here to there. But yes, the codified mapping I think
is important.
> Let's consider the MIDI example. Suppose we have a large collection of
> karaoke files (MIDI supports embedded text that can be played on
> screen while playing the music), and we want to index the text of
> these songs for easy retrieval (along with some other meta-data).
>
> Here's a sitemap example, using the current syntax
> <map:match pattern="*.mid"/>
> <map:act type="catch-view" src="content">
> <map:generate type="midi" src="{1}.mid"/>
> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
> <!-- should never come here -->
> <map:serialize type="xml"/>
> </map:match>
> <map:read src="{1}.mid"/>
> </map:match>
You're mixing the <map:act> with a </map:match>, but I get the idea.
> (the "content" view starts at the "content-label" label to clearly
> distinguish the two notions).
>
> And the proposed shorter one :
>
> <map:match pattern="*.mid">
> <map:read src="{1}.mid" unless-label="content"/>
> <map:generate type="midi" src="{1}.mid"/>
> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
> <!-- should never come here -->
> <map:serialize type="xml"/>
> </map:match>
This breaks current convention that either a reader or a
generator/transformer/serializer can act in a pipeline. In the first
example, if "content" isn't specified, the action returns null and the
reader is invoked; As far as the pipeline logic is concerned, there is
only the reader. Serializers are already known as universal exit
points. To use the second, the convention must be broken and readers
must become universal exit points.
In other words,
<map:match pattern="*.mid">
<map:read src="{1}.mid"/> <!-- without the unless-label -->
<map:generate type="midi" src="{1}.mid"/>
<map:transform src="xmidi2xdoc.xsl" label="content-label"/>
<!-- should never come here -->
<map:serialize type="xml"/>
</map:match>
must become valid for consistency. A reader becomes an exit point and
the rest of a pipeline is, by default, ignored. Is this an intended
consequence?
- Miles Elam
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:
> Ummm... Quick question: What are the use cases for this that are not
> handled by existing methods? I mean, couldn't this be handled with an
> (as-yet unwritten) action?
>
> <map:match pattern="*.doc">
> <map:act type="catch-view">
> <map:parameter name="view-name" value="content"/>
> <map:generate type="word2xml" src="{../1}.doc"/>
> <!-- complete the pipeline -->
> </map:act>
> <map:read src="{1}.doc"/>
> </map:match>
Go back to first post of this thread, where (last paragraph) I proposed
something similar. The whole discussion is about how we could have a
syntax which doesn't introduce such verbosity in the sitemap.
> Jeff mentioned getting metainformation from binary data for searching,
> but surely there are so many different types of binary data, a
> universal view seems rather heavy-handed. It works for search queries
> (barely, in my opinion). For content manipulation clients (like
> WebDAV), these clients can't pass the query string trigger for views.
> This seems to me to be a one-trick pony. To make views available for
> readers, it seems as though specificity is lost.
>
> The point of XML was specifically structured content, yes? Any
> conformant parser should be able to read any conformant file. Binary
> content has no such constraint. If both a reader and a generator are
> required in a matcher, I think some type of syntax that separates the
> two *visually* (not just conceptually) is necessary as a cue.
>
> Putting in binary options makes all content one step worse than your
> typical HTML web page: lack of intelligent structure without hope of
> enforcing a schema. Generators that read from Word (and other similar
> formats) have taken some time to come to fruition precisely because of
> their arbitrary nature (varying character set assumptions, embedded
> OLE objects, various content encoding blocks, etc.). Remember, XML
> (in this case as metadata) is just one representation of structure.
> The important thing (in my opinion) is preserving the structure. I
> don't see that happening with further intermingling of arbitrary
> binary data.
>
> I guess I'm in the camp that's glad that readers exist. Every time I
> have run into the dreaded error that comes from trying to load the
> output of a reader into the generator of another matcher, I have found
> a sitemap organization error. I guess I'm seeing the Cocoon version
> of "goto considered harmful." Sure it's flexible. Sure it's
> powerful. But will it impart more complexity and discomfort than it
> solves in actual practice?
>
> Hacking the view internals seems overkill (emphasis on kill). Inline
> with resource reader's role as "arbitrary, unorganized bit bucket with
> a MIME type," there is no universal way of delivering appropriate
> content. The method of getting content from a Word document is very
> different from the method of content gathering from a PDF document.
> Views, orthogonal access to similar resources (ie. XML resources),
> doesn't apply. "View source" on a text file is straightforward.
> "View source" on an XML file even more so. What is "View source" on
> reader content? You would have to assign a different view to each
> class of reader or put in some MIME type matching hack. Neither is
> less work or easier to grok than simply putting in an action or
> selector in the appropriate matchers I think.
>
> If this type of thing moves forward, I would rather see more
> specificity going into readers than twiddling with what comes out: a
> PDF reader, a Word reader, a Postscript reader, etc. In that case
> you're separating out by schema, by at least some form of contract.
> The alternative is equivalent to saying, "let's just make one class of
> transformer because all XML is alike and only three transformation
> options are available anyway."
As I explained in several replies, there's no equivalence between a
reader and generator able to parse a given binary format. There needs to
be some kind of adaptation/extraction before feeding the view.
And what you describe above as "a PDF reader, a Word reader, a
Postscript reader, etc." are IMO nothing more than _generators_, just
like the SWF and MIDI generators we already have.
Let's consider the MIDI example. Suppose we have a large collection of
karaoke files (MIDI supports embedded text that can be played on screen
while playing the music), and we want to index the text of these songs
for easy retrieval (along with some other meta-data).
Here's a sitemap example, using the current syntax
<map:match pattern="*.mid"/>
<map:act type="catch-view" src="content">
<map:generate type="midi" src="{1}.mid"/>
<map:transform src="xmidi2xdoc.xsl" label="content-label"/>
<!-- should never come here -->
<map:serialize type="xml"/>
</map:match>
<map:read src="{1}.mid"/>
</map:match>
(the "content" view starts at the "content-label" label to clearly
distinguish the two notions).
And the proposed shorter one :
<map:match pattern="*.mid">
<map:read src="{1}.mid" unless-label="content"/>
<map:generate type="midi" src="{1}.mid"/>
<map:transform src="xmidi2xdoc.xsl" label="content-label"/>
<!-- should never come here -->
<map:serialize type="xml"/>
</map:match>
Note also that the "catch-view" action is not an easy thing to do, as
the view is defined on the environment object which is theoretically not
visible to components.
Furthermore, it would be better to catch on labels, since several views
can be plugged on a given label (e.g. "content" & "pretty-content"). And
it would be impossible for the action to access this information.
> P.S. Sorry to start trouble, but I think someone had to mention it.
No trouble. Just lots of misunderstandings in this thread, I guess.
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Miles Elam <mi...@pcextremist.com>.
Vadim Gritsenko wrote:
>> Ummm... Quick question: What are the use cases for this that are
>> not handled by existing methods? I mean, couldn't this be handled
>> with an (as-yet unwritten) action?
>
>
> Matcher *does* exist:
Heh heh... learning something new everyday.
- Miles Elam
Re: [RT] Views for readers
Posted by Vadim Gritsenko <va...@verizon.net>.
Miles Elam wrote:
> Ummm... Quick question: What are the use cases for this that are not
> handled by existing methods? I mean, couldn't this be handled with an
> (as-yet unwritten) action?
Matcher *does* exist:
> <map:match pattern="*.doc">
<map:match type="wildcard-request-parameter" pattern="content">
<map:parameter name="parameter-name" value="cocoon-view"/>
> <map:generate type="word2xml" src="{../1}.doc"/>
> <!-- complete the pipeline -->
</map:match>
> <map:read src="{1}.doc"/>
> </map:match>
<snip/>
Vadim
Re: [RT] Views for readers
Posted by Miles Elam <mi...@pcextremist.com>.
Ummm... Quick question: What are the use cases for this that are not
handled by existing methods? I mean, couldn't this be handled with an
(as-yet unwritten) action?
<map:match pattern="*.doc">
<map:act type="catch-view">
<map:parameter name="view-name" value="content"/>
<map:generate type="word2xml" src="{../1}.doc"/>
<!-- complete the pipeline -->
</map:act>
<map:read src="{1}.doc"/>
</map:match>
Jeff mentioned getting metainformation from binary data for searching,
but surely there are so many different types of binary data, a universal
view seems rather heavy-handed. It works for search queries (barely, in
my opinion). For content manipulation clients (like WebDAV), these
clients can't pass the query string trigger for views. This seems to me
to be a one-trick pony. To make views available for readers, it seems
as though specificity is lost.
The point of XML was specifically structured content, yes? Any
conformant parser should be able to read any conformant file. Binary
content has no such constraint. If both a reader and a generator are
required in a matcher, I think some type of syntax that separates the
two *visually* (not just conceptually) is necessary as a cue.
Putting in binary options makes all content one step worse than your
typical HTML web page: lack of intelligent structure without hope of
enforcing a schema. Generators that read from Word (and other similar
formats) have taken some time to come to fruition precisely because of
their arbitrary nature (varying character set assumptions, embedded OLE
objects, various content encoding blocks, etc.). Remember, XML (in
this case as metadata) is just one representation of structure. The
important thing (in my opinion) is preserving the structure. I don't
see that happening with further intermingling of arbitrary binary data.
I guess I'm in the camp that's glad that readers exist. Every time I
have run into the dreaded error that comes from trying to load the
output of a reader into the generator of another matcher, I have found a
sitemap organization error. I guess I'm seeing the Cocoon version of
"goto considered harmful." Sure it's flexible. Sure it's powerful.
But will it impart more complexity and discomfort than it solves in
actual practice?
Hacking the view internals seems overkill (emphasis on kill). Inline
with resource reader's role as "arbitrary, unorganized bit bucket with a
MIME type," there is no universal way of delivering appropriate
content. The method of getting content from a Word document is very
different from the method of content gathering from a PDF document.
Views, orthogonal access to similar resources (ie. XML resources),
doesn't apply. "View source" on a text file is straightforward. "View
source" on an XML file even more so. What is "View source" on reader
content? You would have to assign a different view to each class of
reader or put in some MIME type matching hack. Neither is less work or
easier to grok than simply putting in an action or selector in the
appropriate matchers I think.
If this type of thing moves forward, I would rather see more specificity
going into readers than twiddling with what comes out: a PDF reader, a
Word reader, a Postscript reader, etc. In that case you're separating
out by schema, by at least some form of contract. The alternative is
equivalent to saying, "let's just make one class of transformer because
all XML is alike and only three transformation options are available
anyway."
- Miles Elam
P.S. Sorry to start trouble, but I think someone had to mention it.
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Vadim Gritsenko wrote:
> Sylvain Wallez wrote:
>
>> Bertrand Delacretaz wrote:
>>
>>> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>>>
>>>> ...But shouldn't we keep labels that are already used into
>>>> pipelines ? E.g :
>>>>
>>>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>>>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>>>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>>>
>>>
>>> If it's this way I'd prefer "unless-label" in map:read to make it
>>> clear.
>>>
>>> Or maybe
>>>
>>> <map:read src="docs/{1}.doc" unless-label="*"/>
>>>
>>> would do, meaning "use this unless any views are requested"
>>> (and * would be the only allowed value).
>>
<snip/>
>> Any other proposal or opinion on this subject before we start a vote ?
>
>
> Can't you just enable generators in map:view in case when view starts
> with reader?
No, since views "capture" the (XML) output at certain points of the
pipeline to provide a different formatting. E.g. the processing for the
"indexable-content" view is the same for all URIs, be them XML pipelines
or a single reader.
So there's no way other than having a generator _before_ jumping to the
view, feeding that view with the kind of XML content it expects.
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:
> Bertrand Delacretaz wrote:
>
>> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>>
>>> ...But shouldn't we keep labels that are already used into pipelines
>>> ? E.g :
>>>
>>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>>
>>
>>
>> If it's this way I'd prefer "unless-label" in map:read to make it clear.
>>
>> Or maybe
>>
>> <map:read src="docs/{1}.doc" unless-label="*"/>
>>
>> would do, meaning "use this unless any views are requested"
>> (and * would be the only allowed value).
>>
>>> Ah, and this is very easily implementable ;-)
>>
>>
>>
>> Quickquick, do it before the FS police hears us ;-)
>>
>> Seriously, I find this useful for indexing and other purposes
>> (gettting meta-information about binary files, images, etc for example).
>
>
>
> Me too. But since is a change in the sitemap syntax, we should have a
> vote on this.
>
> Any other proposal or opinion on this subject before we start a vote ?
Can't you just enable generators in map:view in case when view starts
with reader?
Vadim
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Bertrand Delacretaz wrote:
> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>
>> ...But shouldn't we keep labels that are already used into pipelines
>> ? E.g :
>>
>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>
>
> If it's this way I'd prefer "unless-label" in map:read to make it clear.
>
> Or maybe
>
> <map:read src="docs/{1}.doc" unless-label="*"/>
>
> would do, meaning "use this unless any views are requested"
> (and * would be the only allowed value).
>
>> Ah, and this is very easily implementable ;-)
>
>
> Quickquick, do it before the FS police hears us ;-)
>
> Seriously, I find this useful for indexing and other purposes
> (gettting meta-information about binary files, images, etc for example).
Me too. But since is a change in the sitemap syntax, we should have a
vote on this.
Any other proposal or opinion on this subject before we start a vote ?
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 16:02 Europe/Rome, Bertrand Delacretaz
wrote:
> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>
>> ...But shouldn't we keep labels that are already used into pipelines
>> ? E.g :
>>
>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>
> If it's this way I'd prefer "unless-label" in map:read to make it
> clear.
>
> Or maybe
>
> <map:read src="docs/{1}.doc" unless-label="*"/>
>
> would do, meaning "use this unless any views are requested"
> (and * would be the only allowed value).
>
>> Ah, and this is very easily implementable ;-)
>
> Quickquick, do it before the FS police hears us ;-)
Gotcha! I dislike having a map:read before a map:generator. Try again.
> Seriously, I find this useful for indexing and other purposes
> (gettting meta-information about binary files, images, etc for
> example).
wait wait wait wait this is exactly what JSR-170 is doing.
They are preparing for public review shortly, please let's way until
they are out before touching this at the sitemap level!!!
> --
Stefano.
Re: [RT] Views for readers
Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
> ...But shouldn't we keep labels that are already used into pipelines ?
> E.g :
>
> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
If it's this way I'd prefer "unless-label" in map:read to make it clear.
Or maybe
<map:read src="docs/{1}.doc" unless-label="*"/>
would do, meaning "use this unless any views are requested"
(and * would be the only allowed value).
> Ah, and this is very easily implementable ;-)
Quickquick, do it before the FS police hears us ;-)
Seriously, I find this useful for indexing and other purposes (gettting
meta-information about binary files, images, etc for example).
-Bertrand
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Bertrand Delacretaz wrote:
> Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit :
>
>> ...But what if we write it the other way around :
>> <map:read src="docs/{1}.doc">
>> <map:generate src="docs/{1}.doc" type="wordToXml" label="content"/>
>> </map:read>
>
>
> I find this more understandable (but dunno about implementation):
>
> <!-- if reader is executed, the rest is not -->
> <map:read src="docs/{1}.doc" unless-view="wordToXml"/>
> <map:generate src="docs/{1}.doc" type="wordToXml"/>
> <map:transform...
Interesting. This is looks like a more compact notation for the
view-selector I was thinking of at first. We're leaving the RT world...
But shouldn't we keep labels that are already used into pipelines ? E.g :
<map:read src="docs/{1}.doc" label="raw, xdoc"/>
<map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
<map:transform src="xword2xdoc.xsl" label="xdoc"/>
The "label" on the reader would skip the reader if the requested view
corresponds to one of these labels. Now should this be named "label" or
"unless-label" ?
Ah, and this is very easily implementable ;-)
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit :
> ...But what if we write it the other way around :
> <map:read src="docs/{1}.doc">
> <map:generate src="docs/{1}.doc" type="wordToXml" label="content"/>
> </map:read>
I find this more understandable (but dunno about implementation):
<!-- if reader is executed, the rest is not -->
<map:read src="docs/{1}.doc" unless-view="wordToXml"/>
<map:generate src="docs/{1}.doc" type="wordToXml"/>
<map:transform...
-Bertrand
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Bertrand Delacretaz wrote:
> How about making it the other way round, by allowing Generators to
> read from Readers?
>
> <map:match pattern="*.doc" default-view="binary">
> <map:generator label="xml-content-for-indexing" type="wordToXml">
> <map:read src="word-documents/{1}.doc" label="binary" mime-type=.../>
> </map:generator>
> <map:serialize type="xml"/>
> </map:match>
Do you mean that the generator would be used if the
"xml-content-for-indexing" view is selected ? This doesn't fit with the
existing sitemap behaviour, since generators are _always_ added to the
pipeline.
But what if we write it the other way around :
<map:read src="docs/{1}.doc">
<map:generate src="docs/{1}.doc" type="wordToXml" label="content"/>
</map:read>
The meaning of the above is : if a view is requested, execute what's
_inside_ the <map:read>. If it builds a complete pipeline then return
its result, otherwise just perform the usual read operation.
> Is that RT-ish enough?
Mmmmh... not as wild as Nicola Ken's. Try again ;-P
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
How about making it the other way round, by allowing Generators to read
from Readers?
<map:match pattern="*.doc" default-view="binary">
<map:generator label="xml-content-for-indexing" type="wordToXml">
<map:read src="word-documents/{1}.doc" label="binary"
mime-type=.../>
</map:generator>
<map:serialize type="xml"/>
</map:match>
Is that RT-ish enough?
-Bertrand
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Nicola Ken Barozzi wrote:
>
> Jeff Turner wrote, On 14/08/2003 14.17:
>
> ...
>
>> Isn't the problem there that a <map:read> is a whole little pipeline
>> unto
>> itself? If it were broken into two atomic operations:
>>
>> <map:generate type="binary" src="foo.doc"/>
>> <map:serialize type="binary"/>
>>
>> then we could have a <map:view from-position="first"/> using a
>> content-aware pipeline, and everything would work.
>
>
> Well, why can't the view simply start from a reader?
>
> <map:read src="foo.doc"/>
Because a view finishes a partial XML pipeline, meaning it requires a
generator to be already present...
>> I have the feeling that handling non-XML content in Cocoon is Just
>> Wrong,
>> and that <map:read> is just a hack. The fact that it doesn't integrate
>> with Views is a symptom of this. In a theoretically pure world, we'd
>> either make Cocoon an XML-only framework and kill <map:read>, or make
>> Cocoon a generic data pipelining framework capable of handling and
>> transforming binary content.
>
>
> Well, it can be done easily by allowing more than one reader and by
> allowing readers in the xml pipeline.
>
> Some time back I had proposed the following to be possible (and got
> touted as the usual FS man)
>
> <map:read src="foo1.doc"/>
> <map:read type="stripstuff"/>
> <map:read type="otherfilter"/>
Mhhh... I guess "stripstuff" and "otherfilter" are actually
<map:transform-binary> and not <map:read> as they do have an input. Now
how do we "close" the pipeline ? Is there a <map:serialize-binary> ?
> And also:
>
> <map:read src="foo1.doc"/>
> <map:generate src="foo1.doc"/>
> <map:serialize src="foo1.doc"/>
> <map:read type="zip"/>
Wow! What's the result of this ??
> We can already do this BTW by using the Cocooon protocol, but it's
> such a hack!
Sounds interesting. Can you elaborate on the hack ?
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Nicola Ken Barozzi <ni...@apache.org>.
Jeff Turner wrote, On 14/08/2003 14.17:
...
> Isn't the problem there that a <map:read> is a whole little pipeline unto
> itself? If it were broken into two atomic operations:
>
> <map:generate type="binary" src="foo.doc"/>
> <map:serialize type="binary"/>
>
> then we could have a <map:view from-position="first"/> using a
> content-aware pipeline, and everything would work.
Well, why can't the view simply start from a reader?
<map:read src="foo.doc"/>
> I have the feeling that handling non-XML content in Cocoon is Just Wrong,
> and that <map:read> is just a hack. The fact that it doesn't integrate
> with Views is a symptom of this. In a theoretically pure world, we'd
> either make Cocoon an XML-only framework and kill <map:read>, or make
> Cocoon a generic data pipelining framework capable of handling and
> transforming binary content.
Well, it can be done easily by allowing more than one reader and by
allowing readers in the xml pipeline.
Some time back I had proposed the following to be possible (and got
touted as the usual FS man)
<map:read src="foo1.doc"/>
<map:read type="stripstuff"/>
<map:read type="otherfilter"/>
And also:
<map:read src="foo1.doc"/>
<map:generate src="foo1.doc"/>
<map:serialize src="foo1.doc"/>
<map:read type="zip"/>
We can already do this BTW by using the Cocooon protocol, but it's such
a hack!
> Well it's a RT after all.. ;)
*sigh*
If Cocoon had this capability and could be embedded more easily
*without* the sitemap, it would be a cool transformation library...
--
Nicola Ken Barozzi nicolaken@apache.org
- verba volant, scripta manent -
(discussions get forgotten, just code remains)
---------------------------------------------------------------------
Re: [RT] Views for readers
Posted by Jeff Turner <je...@apache.org>.
On Thu, Aug 14, 2003 at 01:41:55PM +0200, Sylvain Wallez wrote:
> Jeff Turner wrote:
...
> ><map:view name="indexablecontent" from-position="first">
> > <map:select type="xml-type">
> > <map:when test="docbook">
> > <map:transform src="docbook2whatever.xsl"/>
> > </map:when>
> > <map:when test="tei">
> > <map:transform src="tei2whatever.xsl"/>
> > </map:when>
> > <map:when test="msword">
> > <map:transform src="word2whatever.xsl"/>
> > </map:when>
> > </map:select>
> ></map:view>
> >
>
> Ah, ok, the "strongly type pipelines" are a different wording for
> "content-aware selectors" !
Ah yes. Strange how the same concept can live two separate lives in
one's head ;) Like the same class in two classloaders.
> >So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
> >return XML representing the content of the .doc file.
> >
> >I described the same thing in a mail with subject 'Type-aware Views (Re:
> >Link view goodness)'. Same need, different context, same proposed
> >solution.
> >
>
> Not exactly : the use case here is that we have a binary file which is
> normally sent as is to the browser using a reader. It is _not_ parsed as
> an XML stream. So we can't attach a view to these kinds of URLs since
> views provide a different _ending_ to a pipeline, meaning there must
> exist at least a generator and optionnaly one or more transformers at
> the point where processing is directed to the view.
>
> So even content-aware selectors don't solve this problem...
Isn't the problem there that a <map:read> is a whole little pipeline unto
itself? If it were broken into two atomic operations:
<map:generate type="binary" src="foo.doc"/>
<map:serialize type="binary"/>
then we could have a <map:view from-position="first"/> using a
content-aware pipeline, and everything would work.
I have the feeling that handling non-XML content in Cocoon is Just Wrong,
and that <map:read> is just a hack. The fact that it doesn't integrate
with Views is a symptom of this. In a theoretically pure world, we'd
either make Cocoon an XML-only framework and kill <map:read>, or make
Cocoon a generic data pipelining framework capable of handling and
transforming binary content.
Well it's a RT after all.. ;)
--Jeff
> Sylvain
>
> --
> Sylvain Wallez Anyware Technologies
> http://www.apache.org/~sylvain http://www.anyware-tech.com
> { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
> Orixo, the opensource XML business alliance - http://www.orixo.com
>
>
Re: [RT] Views for readers
Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Jeff Turner wrote:
>On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
>
>
>>Frederic's question about search engine integration led me to
>>questioning myself at how Cocoon's Lucene integration could be able to
>>transparently index Word & PDF documents along with XML-produced documents.
>>
>>There exists some text-extraction libraries for Word & PDF (e.g.
>>http://www.textmining.org/). Now how can we integrate this as
>>transparently as possible in Cocoon's search functionnality ?
>>
>>The Lucene indexer crawls a website and asks for a particular view
>>("content") which is used to fill the index. But Word and PDF documents
>>being binary files, they're handled by a <map:read> statement, which
>>does not handle views. On the other hand, this use case shows that
>>having views on binary content may make sense : the "normal" requests
>>just sends back the binary content, while a view can use a text/XML
>>extraction on these binary files.
>>
>>So the question is : how could views be plugged to readers ? I must say
>>that I don't have an answer, as views contain transformers and a
>>serializer, but no generator. So how could we express in the sitemap
>>that a particular view on a reader should "replace" that reader by a
>>particular generator ? Or should this go through some special readers
>>that could also act as generators ?
>>
>>Or maybe these are silly thoughts and we should use a <map:select>
>>directing to a <map:read> or <map:generate> depending on the view. But
>>this introduces explicit view management in the pipelines, which doesn't
>>seem nice to me.
>>
>>
>
>Solution: strongly typed pipelines! :)
>
>Imagine if, at each node in the sitemap, we knew what type of content we
>were dealing with (usually some flavour of XML). Then we could write a
>single view that behaves differently depending on the _type_ of data:
>
><map:view name="indexablecontent" from-position="first">
> <map:select type="xml-type">
> <map:when test="docbook">
> <map:transform src="docbook2whatever.xsl"/>
> </map:when>
> <map:when test="tei">
> <map:transform src="tei2whatever.xsl"/>
> </map:when>
> <map:when test="msword">
> <map:transform src="word2whatever.xsl"/>
> </map:when>
> </map:select>
></map:view>
>
Ah, ok, the "strongly type pipelines" are a different wording for
"content-aware selectors" !
>So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
>return XML representing the content of the .doc file.
>
>I described the same thing in a mail with subject 'Type-aware Views (Re:
>Link view goodness)'. Same need, different context, same proposed
>solution.
>
Not exactly : the use case here is that we have a binary file which is
normally sent as is to the browser using a reader. It is _not_ parsed as
an XML stream. So we can't attach a view to these kinds of URLs since
views provide a different _ending_ to a pipeline, meaning there must
exist at least a generator and optionnaly one or more transformers at
the point where processing is directed to the view.
So even content-aware selectors don't solve this problem...
Sylvain
--
Sylvain Wallez Anyware Technologies
http://www.apache.org/~sylvain http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance - http://www.orixo.com
Re: [RT] Views for readers
Posted by Jeff Turner <je...@apache.org>.
On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
> Frederic's question about search engine integration led me to
> questioning myself at how Cocoon's Lucene integration could be able to
> transparently index Word & PDF documents along with XML-produced documents.
>
> There exists some text-extraction libraries for Word & PDF (e.g.
> http://www.textmining.org/). Now how can we integrate this as
> transparently as possible in Cocoon's search functionnality ?
>
> The Lucene indexer crawls a website and asks for a particular view
> ("content") which is used to fill the index. But Word and PDF documents
> being binary files, they're handled by a <map:read> statement, which
> does not handle views. On the other hand, this use case shows that
> having views on binary content may make sense : the "normal" requests
> just sends back the binary content, while a view can use a text/XML
> extraction on these binary files.
>
> So the question is : how could views be plugged to readers ? I must say
> that I don't have an answer, as views contain transformers and a
> serializer, but no generator. So how could we express in the sitemap
> that a particular view on a reader should "replace" that reader by a
> particular generator ? Or should this go through some special readers
> that could also act as generators ?
>
> Or maybe these are silly thoughts and we should use a <map:select>
> directing to a <map:read> or <map:generate> depending on the view. But
> this introduces explicit view management in the pipelines, which doesn't
> seem nice to me.
Solution: strongly typed pipelines! :)
Imagine if, at each node in the sitemap, we knew what type of content we
were dealing with (usually some flavour of XML). Then we could write a
single view that behaves differently depending on the _type_ of data:
<map:view name="indexablecontent" from-position="first">
<map:select type="xml-type">
<map:when test="docbook">
<map:transform src="docbook2whatever.xsl"/>
</map:when>
<map:when test="tei">
<map:transform src="tei2whatever.xsl"/>
</map:when>
<map:when test="msword">
<map:transform src="word2whatever.xsl"/>
</map:when>
</map:select>
</map:view>
So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
return XML representing the content of the .doc file.
I described the same thing in a mail with subject 'Type-aware Views (Re:
Link view goodness)'. Same need, different context, same proposed
solution.
--Jeff
> Any thoughts ?
>
> Sylvain
>
> --
> Sylvain Wallez Anyware Technologies
> http://www.apache.org/~sylvain http://www.anyware-tech.com
> { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
> Orixo, the opensource XML business alliance - http://www.orixo.com
>
>
Re: [RT] Views for readers
Posted by Sam Coward <sa...@atnet.net.au>.
Hmm,
> Frederic's question about search engine integration led me to
> questioning myself at how Cocoon's Lucene integration could be able to
> transparently index Word & PDF documents along with XML-produced
> documents.
I have been wondering that too. At my company, we put together a simple
web management tool to put small collections of documents into a web
frame for a client. Pretty useless, but it's what he wanted.
At the time I had thought it may be possible to just improve Lucene so
it could understand binary files by introducing mime-type triggerable
filter modules that converted to text on the input stream. After all, if
the text were only going to be used for indexing, it wouldn't matter if
the text wasn't available within Cocoon itself. In any case he's happy
with what he has and we're happily doing other stuff.
Perhaps if the individual extractors are part of specialised readers for
specific types of documents, then you could configure the label for the
XML they return? That would allow for the duality of that behaviour to
be mostly concealed and managed from within Cocoon with little effect to
the sitemap.
I personally find it tempting to think that it may be possible to rip
out XML from any of these formats, and do with it as we wish,
particulary when I saw that programs like catdoc could recognize the
tables even from Word 2k documents. But I often find myself thinking
back against that, and that maybe I should represent all content (even
document content) semantically in XML and let rendering technologies
(PDFSerializer, POI) handle binary output, and perhaps leverage document
importers that map those documents back to XML (they all seem to be
proprietary, big buck solutions from what I see currently, though). In
any case, it does seem that is certainly a ways off in the future *sigh*
Hmm, an OCR extractor would be way cool for faxes too!
just my 2c, i never say anything most of the time, anyway
Sam