You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Sylvain Wallez <sy...@anyware-tech.com> on 2003/08/13 12:02:04 UTC

[RT] Views for readers

Frederic's question about search engine integration led me to 
questioning myself at how Cocoon's Lucene integration could be able to 
transparently index Word & PDF documents along with XML-produced documents.

There exists some text-extraction libraries for Word & PDF (e.g. 
http://www.textmining.org/). Now how can we integrate this as 
transparently as possible in Cocoon's search functionnality ?

The Lucene indexer crawls a website and asks for a particular view 
("content") which is used to fill the index. But Word and PDF documents 
being binary files, they're handled by a <map:read> statement, which 
does not handle views. On the other hand, this use case shows that 
having views on binary content may make sense : the "normal" requests 
just sends back the binary content, while a view can use a text/XML 
extraction on these binary files.

So the question is : how could views be plugged to readers ? I must say 
that I don't have an answer, as views contain transformers and a 
serializer, but no generator. So how could we express in the sitemap 
that a particular view on a reader should "replace" that reader by a 
particular generator ? Or should this go through some special readers 
that could also act as generators ?

Or maybe these are silly thoughts and we should use a <map:select> 
directing to a <map:read> or <map:generate> depending on the view. But 
this introduces explicit view management in the pipelines, which doesn't 
seem nice to me.

Any thoughts ?

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 14:22 Europe/Rome, Sylvain Wallez wrote:

> Jeff Turner wrote:
>
> <snip/>
>
>> Isn't the problem there that a <map:read> is a whole little pipeline 
>> unto itself?  If it were broken into two atomic operations:
>>
>> <map:generate type="binary" src="foo.doc"/>
>> <map:serialize type="binary"/>
>>
>> then we could have a <map:view from-position="first"/> using a 
>> content-aware pipeline, and everything would work.
>>
>> I have the feeling that handling non-XML content in Cocoon is Just 
>> Wrong, and that <map:read> is just a hack.  The fact that it doesn't 
>> integrate with Views is a symptom of this.  In a theoretically pure 
>> world, we'd either make Cocoon an XML-only framework and kill 
>> <map:read>, or make Cocoon a generic data pipelining framework 
>> capable of handling and transforming binary content.
>>
>> Well it's a RT after all.. ;)
>>
>
> Content-aware and binary pipelines in the same post? Wow! Yes, it's 
> definitely a RT ;-P

I am against to both content-aware selection and binary pipelines.

I still have to see a need for them that cannot be solved with 
machinery already in place or with the newly proposed RequestFactories.

--
Stefano.


Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Jeff Turner wrote:

<snip/>

>Isn't the problem there that a <map:read> is a whole little pipeline unto itself?  If it were broken into two atomic operations:
>
><map:generate type="binary" src="foo.doc"/>
><map:serialize type="binary"/>
>
>then we could have a <map:view from-position="first"/> using a content-aware pipeline, and everything would work.
>
>I have the feeling that handling non-XML content in Cocoon is Just Wrong, and that <map:read> is just a hack.  The fact that it doesn't integrate with Views is a symptom of this.  In a theoretically pure world, we'd either make Cocoon an XML-only framework and kill <map:read>, or make Cocoon a generic data pipelining framework capable of handling and transforming binary content.
>
>Well it's a RT after all.. ;)
>

Content-aware and binary pipelines in the same post? Wow! Yes, it's 
definitely a RT ;-P

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Sylvain Wallez wrote, On 14/08/2003 14.30:
> Nicola Ken Barozzi wrote:
> 
>>
>> Jeff Turner wrote, On 14/08/2003 14.17:
>>
>> ...
>>
>>> Isn't the problem there that a <map:read> is a whole little pipeline 
>>> unto
>>> itself?  If it were broken into two atomic operations:
>>>
>>> <map:generate type="binary" src="foo.doc"/>
>>> <map:serialize type="binary"/>
>>>
>>> then we could have a <map:view from-position="first"/> using a
>>> content-aware pipeline, and everything would work.
>>
>> Well, why can't the view simply start from a reader?
>>
>>  <map:read src="foo.doc"/> 
> 
> Because a view finishes a partial XML pipeline, meaning it requires a 
> generator to be already present...

That's because of how we define a view now ;-)
If we had just pipelines that handle both binary and xml data, the viw 
would finish a partial pipeline, in this case starting from binary.

>>> I have the feeling that handling non-XML content in Cocoon is Just 
>>> Wrong,
>>> and that <map:read> is just a hack.  The fact that it doesn't integrate
>>> with Views is a symptom of this.  In a theoretically pure world, we'd
>>> either make Cocoon an XML-only framework and kill <map:read>, or make
>>> Cocoon a generic data pipelining framework capable of handling and
>>> transforming binary content.
>>
>> Well, it can be done easily by allowing more than one reader and by 
>> allowing readers in the xml pipeline.
>>
>> Some time back I had proposed the following to be possible (and got 
>> touted as the usual FS man)
>>
>>  <map:read src="foo1.doc"/>
>>  <map:read type="stripstuff"/>
>>  <map:read type="otherfilter"/> 
> 
> Mhhh... I guess "stripstuff" and "otherfilter" are actually 
> <map:transform-binary> and not <map:read> as they do have an input. Now 
> how do we "close" the pipeline ? Is there a <map:serialize-binary> ?

Since streams are just streams, they don't need to be adapted like XML, 
so there is no notion of Generator or Serializer really, but only 
filter. So the reader is just a filter, and if in the middle it's just 
given a stream and has to output to a stream. So there is no need to 
open, and no need to close.

>> And also:
>>
>>  <map:read src="foo1.doc"/>
>>  <map:generate src="foo1.doc"/>
>>  <map:serialize src="foo1.doc"/>
>>  <map:read type="zip"/> 
> 
> 
> Wow! What's the result of this ??

Oops, a bit too quick.

<!-- remove encription or do other stream preprocessing -->
   <map:read type="decrypt" src="foo1.doc"/>
<!-- normal generation but from the previous reader output -->
   <map:generate type="doc2xml"/>
<!-- eventual transforms-->
<!-- give back html -->
   <map:serialize type="html"/>
<!-- zip that result so that it takes less bandwidth -->
   <map:read type="zip"/>

>> We can already do this BTW by using the Cocooon protocol, but it's 
>> such a hack! 
> 
> Sounds interesting. Can you elaborate on the hack ?

<map:match pattern="mypage.html">
   <map:read src="internal/mypage.html" type="zip"/>
</map:match>

<map:match pattern="internal/mypage.html">
   <!-- generate, transform, serialize... -->
</map:match>

BTW, maybe you may be interested in my RT about aspected pipeline 
snippets, it could be interesting. Basically it would make it possible 
to insert pipeline components inside all pipelines using certain rules.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------



Re: [RT] Views for readers

Posted by Tony Collen <co...@umn.edu>.
Upayavira wrote:
> On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote:
> 
> 
>>I find this more understandable (but dunno about implementation):
>>
>><!-- if reader is executed, the rest is not -->
>><map:read src="docs/{1}.doc" unless-view="wordToXml"/>
>><map:generate src="docs/{1}.doc" type="wordToXml"/>
>><map:transform...
> 
> 
> Simplifying further:
>   <map:read src="docs/{1}.doc" view-generator="wordToXml"/>
> 
> Surely that'd do it?

this might be better, because what happens when someone comes along doing this:

<map:read src="docs/{1}.doc" unless-view="wordToXml"/>
<map:generate src="docs/{2}.doc" type="wordToXml"/>
....

Then the same request represents two difference "sources", which could be either confusing or very 
useful and I don't fully understand the implications of everything.

Just tossing my $0.02 in... it's early and I'm tired :)



Tony


Re: [RT] Views for readers

Posted by Upayavira <uv...@upaya.co.uk>.
On 14 Aug 2003 at 15:34, Bertrand Delacretaz wrote:

> I find this more understandable (but dunno about implementation):
> 
> <!-- if reader is executed, the rest is not -->
> <map:read src="docs/{1}.doc" unless-view="wordToXml"/>
> <map:generate src="docs/{1}.doc" type="wordToXml"/>
> <map:transform...

Simplifying further:
  <map:read src="docs/{1}.doc" view-generator="wordToXml"/>

Surely that'd do it?

Regards, Upayavira


Re: [RT] Views for readers

Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 19:07 Europe/Rome, Miles Elam wrote:

> Vadim Gritsenko wrote:
>
>> Here is another wild (or not?) thought.
>
>
> Not so wild to me.
>
>> All this discussion comes down to the requirement of generating some 
>> XML out of the content usually served by the reader, if that's 
>> possible (and it is possible for some of the types of the content), 
>> in order to feed this XMLized content into the view. This generated 
>> XML is somewhat "equivalent" to the binary represenation for the 
>> purpose of view building. So, I'm going to the conclusion that some 
>> types of readers can be paired with the generator producing 
>> "equivalent", but XMLized, content. The best place to indicate such 
>> pairing is the time when you declare a reader:
>
>
> <snip idea="interesting"/>
>
> The syntax looks a bit ugly to me, but the idea seems much more sane 
> to me.
>
>> PS: Modifying sitemap syntax to allow reader/generator pairs with 
>> some "unless" attrbiutes looks awful to me.
>
>
> Complete agreement.  One of the reasons for the sitemap (*the* 
> reason?) is for the simple and easy management of a site.  Some recent 
> proposals seem to be pushing in the direction of Apache HTTPd's 
> mod_rewrite;  A lot of flexibility by adding "just one more > construct."
>
> From the mod_rewrite page:
>
>    "The great thing about mod_rewrite is it gives you all the
>    configurability and flexibility of Sendmail. The downside to
>    mod_rewrite is that it gives you all the configurability and
>    flexibility of Sendmail."
>
>    -- Brian Behlendorf
>    Apache Group
>
>    "Despite the tons of examples and docs, mod_rewrite is voodoo.
>    Damned cool voodoo, but still voodoo."
>
>    -- Brian Moore
>    bem@news.cmc.net
>
> It'd be a shame if the sitemap became a cousin to mod_rewrite despite 
> the cool voodoo.

I can hardly agree more!

>
> - Miles Elam
>
>
> P.S.  I shudder to think of what will happen to search index creation 
> times when multi-megabyte Word documents and the like are sent down 
> the pipe.  The parsers, however efficient they may turn out to be, 
> will still have to contend with seemingly endless streams of seemingly 
> pointless formatting cruft.  I'm sure we've all seen 10MB files that 
> would be <100K in proper HTML I'm sure.  Ah well...'tis the cost of 
> progress, I guess.

cocoon is not about binary and should *NOT* touch them. Readers were 
implemented as helpers. multi-views for binary files belong to the 
repository level, not to the publishing level!!!

I haven't read all email left (300 more to go after 5 days of offline) 
but I strongly hope you haven't implemented this or I'll scream!!!

--
Stefano.


Re: [RT] Views for readers

Posted by Miles Elam <mi...@pcextremist.com>.
Vadim Gritsenko wrote:

> Here is another wild (or not?) thought. 


Not so wild to me.

> All this discussion comes down to the requirement of generating some 
> XML out of the content usually served by the reader, if that's 
> possible (and it is possible for some of the types of the content), in 
> order to feed this XMLized content into the view. This generated XML 
> is somewhat "equivalent" to the binary represenation for the purpose 
> of view building. So, I'm going to the conclusion that some types of 
> readers can be paired with the generator producing "equivalent", but 
> XMLized, content. The best place to indicate such pairing is the time 
> when you declare a reader: 


<snip idea="interesting"/>

The syntax looks a bit ugly to me, but the idea seems much more sane to me.

> PS: Modifying sitemap syntax to allow reader/generator pairs with some 
> "unless" attrbiutes looks awful to me.


Complete agreement.  One of the reasons for the sitemap (*the* reason?) 
is for the simple and easy management of a site.  Some recent proposals 
seem to be pushing in the direction of Apache HTTPd's mod_rewrite;  A 
lot of flexibility by adding "just one more construct."

 From the mod_rewrite page:

    "The great thing about mod_rewrite is it gives you all the
    configurability and flexibility of Sendmail. The downside to
    mod_rewrite is that it gives you all the configurability and
    flexibility of Sendmail."

    -- Brian Behlendorf
    Apache Group

    "Despite the tons of examples and docs, mod_rewrite is voodoo.
    Damned cool voodoo, but still voodoo."

    -- Brian Moore
    bem@news.cmc.net

It'd be a shame if the sitemap became a cousin to mod_rewrite despite 
the cool voodoo.

- Miles Elam


P.S.  I shudder to think of what will happen to search index creation 
times when multi-megabyte Word documents and the like are sent down the 
pipe.  The parsers, however efficient they may turn out to be, will 
still have to contend with seemingly endless streams of seemingly 
pointless formatting cruft.  I'm sure we've all seen 10MB files that 
would be <100K in proper HTML I'm sure.  Ah well...'tis the cost of 
progress, I guess.



Re: [RT] Views for readers

Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:

> Vadim Gritsenko wrote:
>
>> Sylvain Wallez wrote:
>>
>>> Vadim Gritsenko wrote:
>>>
>>>> Sylvain Wallez wrote:
>>>>
>>>>> Vadim Gritsenko wrote: 
>>>>
>>>>
>>>>
>>
>> <snip/>
>>
>>>> Here is another wild (or not?) thought.
>>>>
>>>> All this discussion comes down to the requirement of generating 
>>>> some XML out of the content usually served by the reader, if that's 
>>>> possible (and it is possible for some of the types of the content), 
>>>> in order to feed this XMLized content into the view. This generated 
>>>> XML is somewhat "equivalent" to the binary represenation for the 
>>>> purpose of view building. So, I'm going to the conclusion that some 
>>>> types of readers can be paired with the generator producing 
>>>> "equivalent", but XMLized, content. The best place to indicate such 
>>>> pairing is the time when you declare a reader:
>>>>
>>>>  <map:readers default="resource">
>>>>    <map:reader name="resource" 
>>>> src="org.apache.cocoon.reading.ResourceReader"/>
>>>>    <map:reader name="html" 
>>>> src="org.apache.cocoon.reading.ResourceReader">
>>>>      
>>>> <generator-paired-to-this-reader>html</generator-paired-to-this-reader> 
>>>>
>>>>    </map:reader>
>>>>    <map:reader name="msexcel" 
>>>> src="org.apache.cocoon.reading.ResourceReader">
>>>>      
>>>> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader> 
>>>>
>>>>    </map:reader>
>>>>    <map:reader name="pdf" 
>>>> src="org.apache.cocoon.reading.ResourceReader">
>>>>      
>>>> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader> 
>>>>
>>>>    </map:reader>
>>>>  </map:readers> 
>>>
>>>
>>>
>>>
>>>
>>> I'm afraid this won't work :
>>
>>
>>
>>
>> Can you suggest some improvements so it does work? My goal is to have 
>> as little impact on sitemap syntax as possible.
>>
>>
>>> - a generator specific to a given content-type is very unlikely to 
>>> produce the document type expected by the view. We will most often 
>>> need an additional transformation (e.g. the "xword2xdoc.xsl" that 
>>> was in my example)
>>
>>
>>
>>
>> More wild suggestions.
>>
>> 1/ Do something with the views. Say, allow duplicate view names and 
>> make them work as selector:
>>
>>  <map:views>
>>    <!-- works if ("when") reader -->
>>    <map:view from-position="reader" name="content">
>>      <map:transform src="wordml2content.xsl" label="content"/>
>>      <map:serialize type="xml"/>
>>    </map:view>
>>    <!-- works if ("when") label -->
>>    <map:view from-label="content" name="content">
>>      <map:serialize type="xml"/>
>>    </map:view>
>>    <!-- works if no label ("otherwise") -->
>>    <map:view from-position="first" name="content">
>>      <map:serialize type="xml"/>
>>    </map:view>
>>  </map:views> 
>
>
>
> Still the same problem I desperatly pointing out again and again : how 
> can the from-position="reader" use different generators (i.e. parsers) 
> depending on the binary content ?


I did not copy reader-to-generator association 
(<generator-paired-to-this-reader/>) declared on top. Get the generator 
from there.


>> 2/ Do something with the readers.
>
...

> This introduces sitemap snippets into a component manager 
> configuration, wich is not good at all.


Yep. Not good.


>> 3/ Alternative to 2:
>>
>>  <map:readers default="resource">
>>    <map:reader name="msword" 
>> src="org.apache.cocoon.reading.ResourceReader">
>>      <xmlizer-uri>cocoon://word-2-content/</xmlizer-uri>
>>    </map:reader>
>>  </map:readers>
>>
>>  <map:views>
>>    <map:view from-label="content" name="content">
>>      <map:serialize type="xml"/>
>>    </map:view>
>>  </map:views>
>>
>>  <map:pipelines>
>>    ...
>>    <map:read src="my.doc"/>
>>    ...
>>    <map:match pattern="word-2-content/*">
>>      <map:generate type="msword" src="{1}/>
>>      <map:transform src="wordml2content.xsl" label="content"/>
>>      <map:serialize type="xml"/>
>>    </map:match>
>>  </map:pipelines> 
>
>
>
> Sounds better, but has the problem that it implies that every view 
> should return xml content on "my.doc".


Yep. Unless you define one xmlizer URI per view... Awful!


> Or to we introduce a "label" attribute on <map:read> to define on 
> which particular view the xmlizer-uri should be triggered ?


Possible.


>> I would not say that I like any of the suggestions above. The 
>> cleanest way ATM is the usage of map:resource I suggested in other 
>> email (I yet to see your comment on it). 
>
>
>
> Sorry, I have no particular comment on the use of resources, as it's 
> mainly a refactoring of the action/matcher proposals.


But it solves the problem! And the cleanest solution (with minimal 
impact) among all discussed here.


>>> - views, through their associated labels, can be plugged at any 
>>> point of the pipelines. Defining pair generators restricts views to 
>>> be only from-label="start".
>>>
>>>> PS: Modifying sitemap syntax to allow reader/generator pairs with 
>>>> some "unless" attrbiutes looks awful to me. 
>>>
>>>
>>>
>>> Doesn't seem so awful to me, since the reader should be executed 
>>> "unless" certain conditions are met, which are that the specified 
>>> label(s) correspond to the one at which the requested view should 
>>> start. 
>>
>>
>>
>> This "unless" attribute is nothing else than shortcut for 
>> <map:match>. Given point on verbosity and given the obfuscated 
>> result, I'm for verbosity.
>
>
>
> Not exacly : you can currently match on the view name (provided that 
> the environment actually does rely on the "cocoon-view" parameter),


(Special "view" matcher is still possible)


> but you cannot match on the labels. And only labels are currently used 
> in the <map:pipelines> section.


I don't understand this. What is "match on the labels" in this context?


>> PS Keep sitemap syntax clean! Say "No!" to woodo!
>

Should be "voodoo" above


> Funny. That's often me that says "too much magic kills the confidence".


Now it's my turn :)


> Let's stop this discussion for now. I have the feeling we won't reach 
> consensus and will just come to some useless flame war. 


I don't see an elegant solution to the reader/view problem right now. 
And we always can make another flamefest later (are you planning a visit 
to US? :)

Vadim



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Vadim Gritsenko wrote:

> Sylvain Wallez wrote:
>
>> Vadim Gritsenko wrote:
>>
>>> Sylvain Wallez wrote:
>>>
>>>> Vadim Gritsenko wrote: 
>>>
>>>
>
> <snip/>
>
>>> Here is another wild (or not?) thought.
>>>
>>> All this discussion comes down to the requirement of generating some 
>>> XML out of the content usually served by the reader, if that's 
>>> possible (and it is possible for some of the types of the content), 
>>> in order to feed this XMLized content into the view. This generated 
>>> XML is somewhat "equivalent" to the binary represenation for the 
>>> purpose of view building. So, I'm going to the conclusion that some 
>>> types of readers can be paired with the generator producing 
>>> "equivalent", but XMLized, content. The best place to indicate such 
>>> pairing is the time when you declare a reader:
>>>
>>>  <map:readers default="resource">
>>>    <map:reader name="resource" 
>>> src="org.apache.cocoon.reading.ResourceReader"/>
>>>    <map:reader name="html" 
>>> src="org.apache.cocoon.reading.ResourceReader">
>>>      
>>> <generator-paired-to-this-reader>html</generator-paired-to-this-reader>
>>>    </map:reader>
>>>    <map:reader name="msexcel" 
>>> src="org.apache.cocoon.reading.ResourceReader">
>>>      
>>> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader> 
>>>
>>>    </map:reader>
>>>    <map:reader name="pdf" 
>>> src="org.apache.cocoon.reading.ResourceReader">
>>>      
>>> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader> 
>>>
>>>    </map:reader>
>>>  </map:readers> 
>>
>>
>>
>>
>> I'm afraid this won't work :
>
>
>
> Can you suggest some improvements so it does work? My goal is to have 
> as little impact on sitemap syntax as possible.
>
>
>> - a generator specific to a given content-type is very unlikely to 
>> produce the document type expected by the view. We will most often 
>> need an additional transformation (e.g. the "xword2xdoc.xsl" that was 
>> in my example)
>
>
>
> More wild suggestions.
>
> 1/ Do something with the views. Say, allow duplicate view names and 
> make them work as selector:
>
>  <map:views>
>    <!-- works if ("when") reader -->
>    <map:view from-position="reader" name="content">
>      <map:transform src="wordml2content.xsl" label="content"/>
>      <map:serialize type="xml"/>
>    </map:view>
>    <!-- works if ("when") label -->
>    <map:view from-label="content" name="content">
>      <map:serialize type="xml"/>
>    </map:view>
>    <!-- works if no label ("otherwise") -->
>    <map:view from-position="first" name="content">
>      <map:serialize type="xml"/>
>    </map:view>
>  </map:views> 


Still the same problem I desperatly pointing out again and again : how 
can the from-position="reader" use different generators (i.e. parsers) 
depending on the binary content ?

> 2/ Do something with the readers.
>
>  <map:readers default="resource">
>    <map:reader name="msword" 
> src="org.apache.cocoon.reading.ResourceReader">
>      <map:generate type="msword"/>
>      <map:transform src="wordml2content.xsl"/>
>    </map:reader>
>  </map:readers>


This introduces sitemap snippets into a component manager configuration, 
wich is not good at all.

> 3/ Alternative to 2:
>
>  <map:readers default="resource">
>    <map:reader name="msword" 
> src="org.apache.cocoon.reading.ResourceReader">
>      <xmlizer-uri>cocoon://word-2-content/</xmlizer-uri>
>    </map:reader>
>  </map:readers>
>
>  <map:views>
>    <map:view from-label="content" name="content">
>      <map:serialize type="xml"/>
>    </map:view>
>  </map:views>
>
>  <map:pipelines>
>    ...
>    <map:read src="my.doc"/>
>    ...
>    <map:match pattern="word-2-content/*">
>      <map:generate type="msword" src="{1}/>
>      <map:transform src="wordml2content.xsl" label="content"/>
>      <map:serialize type="xml"/>
>    </map:match>
>  </map:pipelines> 


Sounds better, but has the problem that it implies that every view 
should return xml content on "my.doc". Or to we introduce a "label" 
attribute on <map:read> to define on which particular view the 
xmlizer-uri should be triggered ?

> I would not say that I like any of the suggestions above. The cleanest 
> way ATM is the usage of map:resource I suggested in other email (I yet 
> to see your comment on it). 


Sorry, I have no particular comment on the use of resources, as it's 
mainly a refactoring of the action/matcher proposals.

>> - views, through their associated labels, can be plugged at any point 
>> of the pipelines. Defining pair generators restricts views to be only 
>> from-label="start".
>>
>>> PS: Modifying sitemap syntax to allow reader/generator pairs with 
>>> some "unless" attrbiutes looks awful to me. 
>>
>>
>> Doesn't seem so awful to me, since the reader should be executed 
>> "unless" certain conditions are met, which are that the specified 
>> label(s) correspond to the one at which the requested view should start. 
>
>
> This "unless" attribute is nothing else than shortcut for <map:match>. 
> Given point on verbosity and given the obfuscated result, I'm for 
> verbosity.


Not exacly : you can currently match on the view name (provided that the 
environment actually does rely on the "cocoon-view" parameter), but you 
cannot match on the labels. And only labels are currently used in the 
<map:pipelines> section.

> PS Keep sitemap syntax clean! Say "No!" to woodo! 


Funny. That's often me that says "too much magic kills the confidence".

Let's stop this discussion for now. I have the feeling we won't reach 
consensus and will just come to some useless flame war.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 21:10 Europe/Rome, Vadim Gritsenko wrote:

> PS Keep sitemap syntax clean! Say "No!" to woodo!

Amen!

--
Stefano.


Re: [RT] Views for readers

Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:

> Vadim Gritsenko wrote:
>
>> Sylvain Wallez wrote:
>>
>>> Vadim Gritsenko wrote: 
>>

<snip/>

>> Here is another wild (or not?) thought.
>>
>> All this discussion comes down to the requirement of generating some 
>> XML out of the content usually served by the reader, if that's 
>> possible (and it is possible for some of the types of the content), 
>> in order to feed this XMLized content into the view. This generated 
>> XML is somewhat "equivalent" to the binary represenation for the 
>> purpose of view building. So, I'm going to the conclusion that some 
>> types of readers can be paired with the generator producing 
>> "equivalent", but XMLized, content. The best place to indicate such 
>> pairing is the time when you declare a reader:
>>
>>  <map:readers default="resource">
>>    <map:reader name="resource" 
>> src="org.apache.cocoon.reading.ResourceReader"/>
>>    <map:reader name="html" 
>> src="org.apache.cocoon.reading.ResourceReader">
>>      
>> <generator-paired-to-this-reader>html</generator-paired-to-this-reader>
>>    </map:reader>
>>    <map:reader name="msexcel" 
>> src="org.apache.cocoon.reading.ResourceReader">
>>      
>> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader> 
>>
>>    </map:reader>
>>    <map:reader name="pdf" 
>> src="org.apache.cocoon.reading.ResourceReader">
>>      
>> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader> 
>>
>>    </map:reader>
>>  </map:readers> 
>
>
>
> I'm afraid this won't work :


Can you suggest some improvements so it does work? My goal is to have as 
little impact on sitemap syntax as possible.


> - a generator specific to a given content-type is very unlikely to 
> produce the document type expected by the view. We will most often 
> need an additional transformation (e.g. the "xword2xdoc.xsl" that was 
> in my example)


More wild suggestions.

1/ Do something with the views. Say, allow duplicate view names and make 
them work as selector:

  <map:views>
    <!-- works if ("when") reader -->
    <map:view from-position="reader" name="content">
      <map:transform src="wordml2content.xsl" label="content"/>
      <map:serialize type="xml"/>
    </map:view>
    <!-- works if ("when") label -->
    <map:view from-label="content" name="content">
      <map:serialize type="xml"/>
    </map:view>
    <!-- works if no label ("otherwise") -->
    <map:view from-position="first" name="content">
      <map:serialize type="xml"/>
    </map:view>
  </map:views>

2/ Do something with the readers.

  <map:readers default="resource">
    <map:reader name="msword" 
src="org.apache.cocoon.reading.ResourceReader">
      <map:generate type="msword"/>
      <map:transform src="wordml2content.xsl"/>
    </map:reader>
  </map:readers>

3/ Alternative to 2:

  <map:readers default="resource">
    <map:reader name="msword" 
src="org.apache.cocoon.reading.ResourceReader">
      <xmlizer-uri>cocoon://word-2-content/</xmlizer-uri>
    </map:reader>
  </map:readers>

  <map:views>
    <map:view from-label="content" name="content">
      <map:serialize type="xml"/>
    </map:view>
  </map:views>

  <map:pipelines>
    ...
    <map:read src="my.doc"/>
    ...
    <map:match pattern="word-2-content/*">
      <map:generate type="msword" src="{1}/>
      <map:transform src="wordml2content.xsl" label="content"/>
      <map:serialize type="xml"/>
    </map:match>
  </map:pipelines>

I would not say that I like any of the suggestions above. The cleanest 
way ATM is the usage of map:resource I suggested in other email (I yet 
to see your comment on it).


> - views, through their associated labels, can be plugged at any point 
> of the pipelines. Defining pair generators restricts views to be only 
> from-label="start".
>
>> PS: Modifying sitemap syntax to allow reader/generator pairs with 
>> some "unless" attrbiutes looks awful to me. 
>
>
>
> Doesn't seem so awful to me, since the reader should be executed 
> "unless" certain conditions are met, which are that the specified 
> label(s) correspond to the one at which the requested view should start. 


This "unless" attribute is nothing else than shortcut for <map:match>. 
Given point on verbosity and given the obfuscated result, I'm for verbosity.


PS Keep sitemap syntax clean! Say "No!" to woodo!

Vadim



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Vadim Gritsenko wrote:

> Sylvain Wallez wrote:
>
>> Vadim Gritsenko wrote:
>>
>>> Sylvain Wallez wrote:
>>
> <snip/>
>
>>>> Any other proposal or opinion on this subject before we start a vote ? 
>>>
>>>
>>> Can't you just enable generators in map:view in case when view 
>>> starts with reader? 
>>
>>
>> No, since views "capture" the (XML) output at certain points of the 
>> pipeline to provide a different formatting.
>
>
> In case of the reader, there is no (XML) output in the pipeline. It's 
> special case, unless you want to introduce binary pipelines (and I 
> hope you don't want to), so it would require special handling.
>
>> E.g. the processing for the "indexable-content" view
>
>
> Sidenote: It's called "content" -- the view which you use to build a 
> site search index. 


Picky sidenote : this is configurable using the <content-view-query> 
config of the <lucene-xml-indexer> component ;-)

>> is the same for all URIs, be them XML pipelines or a single reader.
>>
>> So there's no way other than having a generator _before_ jumping to 
>> the view, feeding that view with the kind of XML content it expects.
>
>
> Here is another wild (or not?) thought.
>
> All this discussion comes down to the requirement of generating some 
> XML out of the content usually served by the reader, if that's 
> possible (and it is possible for some of the types of the content), in 
> order to feed this XMLized content into the view. This generated XML 
> is somewhat "equivalent" to the binary represenation for the purpose 
> of view building. So, I'm going to the conclusion that some types of 
> readers can be paired with the generator producing "equivalent", but 
> XMLized, content. The best place to indicate such pairing is the time 
> when you declare a reader:
>
>  <map:readers default="resource">
>    <map:reader name="resource" 
> src="org.apache.cocoon.reading.ResourceReader"/>
>    <map:reader name="html" 
> src="org.apache.cocoon.reading.ResourceReader">
>      
> <generator-paired-to-this-reader>html</generator-paired-to-this-reader>
>    </map:reader>
>    <map:reader name="msexcel" 
> src="org.apache.cocoon.reading.ResourceReader">
>      
> <generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader> 
>
>    </map:reader>
>    <map:reader name="pdf" src="org.apache.cocoon.reading.ResourceReader">
>      
> <generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader> 
>
>    </map:reader>
>  </map:readers> 


I'm afraid this won't work :

- a generator specific to a given content-type is very unlikely to 
produce the document type expected by the view. We will most often need 
an additional transformation (e.g. the "xword2xdoc.xsl" that was in my 
example)

- views, through their associated labels, can be plugged at any point of 
the pipelines. Defining pair generators restricts views to be only 
from-label="start".

> PS: Modifying sitemap syntax to allow reader/generator pairs with some 
> "unless" attrbiutes looks awful to me. 


Doesn't seem so awful to me, since the reader should be executed 
"unless" certain conditions are met, which are that the specified 
label(s) correspond to the one at which the requested view should start.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:

> Vadim Gritsenko wrote:
>
>> Sylvain Wallez wrote:
>

<snip/>

>>> Any other proposal or opinion on this subject before we start a vote ? 
>>
>>
>>
>> Can't you just enable generators in map:view in case when view starts 
>> with reader? 
>
>
>
> No, since views "capture" the (XML) output at certain points of the 
> pipeline to provide a different formatting.


In case of the reader, there is no (XML) output in the pipeline. It's 
special case, unless you want to introduce binary pipelines (and I hope 
you don't want to), so it would require special handling.


> E.g. the processing for the "indexable-content" view


Sidenote: It's called "content" -- the view which you use to build a 
site search index.


> is the same for all URIs, be them XML pipelines or a single reader.
>
> So there's no way other than having a generator _before_ jumping to 
> the view, feeding that view with the kind of XML content it expects.


Here is another wild (or not?) thought.

All this discussion comes down to the requirement of generating some XML 
out of the content usually served by the reader, if that's possible (and 
it is possible for some of the types of the content), in order to feed 
this XMLized content into the view. This generated XML is somewhat 
"equivalent" to the binary represenation for the purpose of view 
building. So, I'm going to the conclusion that some types of readers can 
be paired with the generator producing "equivalent", but XMLized, 
content. The best place to indicate such pairing is the time when you 
declare a reader:

  <map:readers default="resource">
    <map:reader name="resource" 
src="org.apache.cocoon.reading.ResourceReader"/>
    <map:reader name="html" src="org.apache.cocoon.reading.ResourceReader">
      
<generator-paired-to-this-reader>html</generator-paired-to-this-reader>
    </map:reader>
    <map:reader name="msexcel" 
src="org.apache.cocoon.reading.ResourceReader">
      
<generator-paired-to-this-reader>poi-excel-generator</generator-paired-to-this-reader>
    </map:reader>
    <map:reader name="pdf" src="org.apache.cocoon.reading.ResourceReader">
      
<generator-paired-to-this-reader>pdf-text-extractor-generator</generator-paired-to-this-reader>
    </map:reader>
  </map:readers>


PS: Modifying sitemap syntax to allow reader/generator pairs with some 
"unless" attrbiutes looks awful to me.

Vadim



Re: [RT] Views for readers

Posted by Vadim Gritsenko <va...@verizon.net>.
Miles Elam wrote:

> Sylvain Wallez wrote:
>
>> Go back to first post of this thread, where (last paragraph) I 
>> proposed something similar. The whole discussion is about how we 
>> could have a syntax which doesn't introduce such verbosity in the 
>> sitemap. 
>
>
>
> Verbosity is not necessarily a bad thing.  If it were, would any of us 
> be using XML?  ;-) 


Good point.

<snip/>


>> Let's consider the MIDI example. Suppose we have a large collection 
>> of karaoke files (MIDI supports embedded text that can be played on 
>> screen while playing the music), and we want to index the text of 
>> these songs for easy retrieval (along with some other meta-data).
>>
>> Here's a sitemap example, using the current syntax 
>

<snip/>

>> And the proposed shorter one :
>>
>> <map:match pattern="*.mid">
>>  <map:read src="{1}.mid" unless-label="content"/>
>>  <map:generate type="midi" src="{1}.mid"/>
>>  <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>>  <!-- should never come here -->
>>  <map:serialize type="xml"/>
>> </map:match>
>

Two lines. What does it give except obfuscation? Given the point above 
("Verbosity is not necessarily a bad thing" (c) Miles Elam) more 
readable and already supported syntax is:

 <map:resource name="midi"/>
  <map:match type="view" pattern="content">
    <map:generate type="midi" src="{1}.mid"/>
    <map:transform src="xmidi2xdoc.xsl" label="content"/>
    <map:serialize type="xml"/>
  </map:match>
  <map:read mime-type="whatever/midi" src="{1}.mid"/>
 </map:match>

 <map:match pattern="*.mid"/>
  <map:call resource="midi"/>
 </map:match>

Moreover! Resource "midi" is reusable:

 <map:match pattern="another/*.mid"/>
  <map:call resource="midi"/>
 </map:match>

, while example above is not.



> This breaks current convention that either a reader or a 
> generator/transformer/serializer can act in a pipeline.


And, given this resource example, it does not break any sitemap 
semantics which we have today.



> In the first example, if "content" isn't specified, the action returns 
> null and the reader is invoked;  As far as the pipeline logic is 
> concerned, there is only the reader.  Serializers are already known as 
> universal exit points.  To use the second, the convention must be 
> broken and readers must become universal exit points.
>
> In other words,
>
> <map:match pattern="*.mid">
> <map:read src="{1}.mid"/> <!-- without the unless-label -->
> <map:generate type="midi" src="{1}.mid"/>
> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
> <!-- should never come here -->
> <map:serialize type="xml"/>
> </map:match>
>
> must become valid for consistency.  A reader becomes an exit point and 
> the rest of a pipeline is, by default, ignored.  Is this an intended 
> consequence?


I fell strongly "-1" on this one.

Vadim



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:

>
>
>
> Not according to the code, they're not.  Check out 
> AbstractProcessingPipeline.java.  There are method bodies like:
>
>    public void setGenerator (String role, String source, Parameters 
> param, Parameters hintParam)
>    throws ProcessingException {
>        if (this.generator != null) {
>            throw new ProcessingException ("Generator already set. You 
> can only select one Generator (" + role + ")");
>        }
>        if (this.reader != null) {
>            throw new ProcessingException ("Reader already set. You 
> cannot use a reader and a generator for one pipeline.");
>        }
>    ...
>
> and
>
>    public void setReader (String role, String source, Parameters 
> param, String mimeType)
>    throws ProcessingException {
>        if (this.reader != null) {
>            throw new ProcessingException ("Reader already set. You can 
> only select one Reader (" + role + ")");
>        }
>        if (this.generator != null) {
>            throw new ProcessingException ("Generator already set. You 
> cannot use a reader and a generator for one pipeline.");
>        }
>    ...
>
>
> Either the policy was in effect when this file (and its subclasses) 
> were made or someone put constraining statements in that serve no 
> purpose.  The file was last modified on August 6th of this year.  If 
> the policy has changed, no one told the code.


This has been there for a very long time. And has nothing to do with the 
fact that readers and serializers end the execution of the sitemap : 
check ReadNode and SerializeNode in o.a.c.components.treeprocessor.sitemap.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Miles Elam <mi...@pcextremist.com>.
Sylvain Wallez wrote:

> Miles Elam wrote:
>
>> In other words, the pipeline is full of side effects and dependant 
>> upon things happening behind the curtain (to use a Wizard of Oz 
>> reference).  You'd be right in that it adds to the confusion.  I 
>> agree with Vadim.  This is obfuscation in exchange for two lines of 
>> verboseness.
>
>
> Just some additional precisions, "mon frère" ! 


I hope it wasn't taken the wrong way.  I did not intend any offense.

> Yes, the pipeline is full of side effects, which can break pipelines 
> at any point an continue somewhere else without this being explicitely 
> visible in the pipeline construction statements.
>
> These side effects are called "views", and the way to define views is 
> through labels. 


Don't get me wrong.  I see clearly the reason why views exist.  I see 
clearly why reader views are wanted.  When working with XML data -- not 
just text, but structured text -- getting at that data before it is 
processed into a presentation format (such as viewing source, getting a 
true content view, etc.) can prove invaluable.

> And even worse : labels can be placed on component definitions, 
> meaning a clean pipeline with no label attribute at all is full of 
> these side effects.
>
> So what you call obfuscation has been there *for years*. And 
> everybody's happy with it. 


When grabbing from the presentation format as a source, you are 
comparing apples and oranges.  Not only are there innumerable binary 
formats out there being squeezed into a few reader implementations, but 
they are not all desirable data.  While you may want the data from a PDF 
file, you may not bother with a PNG image because it may index "Created 
with The Gimp" over and over.

Since putting in all binary format-to-generator mapping info seems out 
of the question, all of the pipeline path must be specified in the 
matcher -- hence the discussion surrounding readers and generators in 
the same matcher.  If everything is specified in the same matcher and 
not truly orthogonal, as is the case for views currently, why add the 
extra syntax for what amounts to a non-orthogonal if-else clause?

if (!content-view)
    read
else
    generate
    transform
    serialize

as opposed to

generate
   +---------- view-short-curcuit! --+-> transform-x
transform-1                          +-> serialize
transform-2
serialize


There is a discontinuity there that makes me uncomfortable.  This is not 
an overt attachment to symmetry.  This is seeing the same tool applied 
to two (in my opinion) very different tasks.  I am not a committer and 
can't vote.  But these are my thoughts on the matter.  Take with as many 
grains of salt as are necessary.

- Miles Elam



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:

> In other words, the pipeline is full of side effects and dependant 
> upon things happening behind the curtain (to use a Wizard of Oz 
> reference).  You'd be right in that it adds to the confusion.  I agree 
> with Vadim.  This is obfuscation in exchange for two lines of 
> verboseness.


Just some additional precisions, "mon frère" !

Yes, the pipeline is full of side effects, which can break pipelines at 
any point an continue somewhere else without this being explicitely 
visible in the pipeline construction statements.

These side effects are called "views", and the way to define views is 
through labels.

And even worse : labels can be placed on component definitions, meaning 
a clean pipeline with no label attribute at all is full of these side 
effects.

So what you call obfuscation has been there *for years*. And everybody's 
happy with it.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Miles Elam <mi...@pcextremist.com>.
Sylvain Wallez wrote:

>> The functionality for all readers would obviously be the same: move 
>> these bytes from here to there.  But yes, the codified mapping I 
>> think is important.
>
>
> Please read carefully : I wrote *generators* !! This isn't about 
> moving bytes, but about producing an XML document. 


Au contraire mon frére, this is implemented with generators but it is 
about pulling searchable info out of arbitrary binary data.  The first 
step to that goal is to standardize it -- therefore generators are 
added.  The issue is about *readers* and the custom formats they 
encompass not being indexable.

>> You're mixing the <map:act> with a </map:match>, but I get the idea. 
>
>
> Picky guy, eh ? 


You know it.  :)

> Readers already are universal exit points : once you encounter a 
> reader, sitemap processing is terminated. <map:read> and 
> <map:serialize> are like a "return" statement in Java. 


Not according to the code, they're not.  Check out 
AbstractProcessingPipeline.java.  There are method bodies like:

    public void setGenerator (String role, String source, Parameters 
param, Parameters hintParam)
    throws ProcessingException {
        if (this.generator != null) {
            throw new ProcessingException ("Generator already set. You 
can only select one Generator (" + role + ")");
        }
        if (this.reader != null) {
            throw new ProcessingException ("Reader already set. You 
cannot use a reader and a generator for one pipeline.");
        }
    ...

and

    public void setReader (String role, String source, Parameters param, 
String mimeType)
    throws ProcessingException {
        if (this.reader != null) {
            throw new ProcessingException ("Reader already set. You can 
only select one Reader (" + role + ")");
        }
        if (this.generator != null) {
            throw new ProcessingException ("Generator already set. You 
cannot use a reader and a generator for one pipeline.");
        }
    ...


Either the policy was in effect when this file (and its subclasses) were 
made or someone put constraining statements in that serve no purpose.  
The file was last modified on August 6th of this year.  If the policy 
has changed, no one told the code.

> No consequence : this is how the sitemap works today, and the above is 
> valid, even if we can consider that the sitemap engine should more 
> strict and signal that there's some unreachable code. 


I can't speak to validity, but this is NOT how it works today.

> To add more to the confusion, in both your and my example, we can even 
> avoid writing the <map:serialize> statement. Since some additional 
> filtering occurs beforehand (either through the action or through 
> reader labels), this statement is never reached and is useless ! 


In other words, the pipeline is full of side effects and dependant upon 
things happening behind the curtain (to use a Wizard of Oz reference).  
You'd be right in that it adds to the confusion.  I agree with Vadim.  
This is obfuscation in exchange for two lines of verboseness.

- Miles Elam



Re: [RT] Views for readers

Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:

<snip/>

>> In other words,
>>
>> <map:match pattern="*.mid">
>> <map:read src="{1}.mid"/> <!-- without the unless-label -->
>> <map:generate type="midi" src="{1}.mid"/>
>> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>> <!-- should never come here -->
>> <map:serialize type="xml"/>
>> </map:match>
>>
>> must become valid for consistency.  A reader becomes an exit point 
>> and the rest of a pipeline is, by default, ignored.  Is this an 
>> intended consequence?
>
>
>
> No consequence : this is how the sitemap works today, and the above is 
> valid,


No, that's not valid today. And if current sitemap implementation does 
not passes the conformance test, it does not indicate that invalid 
syntax has become valid. It just indicates that current sitemap 
implementation is not conformant.

PS Absense of the official conformance test suite does not make point 
above invalid. Here is an attempt at the test:
  
http://cvs.apache.org/viewcvs.cgi/cocoon-2.0/src/webapp/mount/lint/sitemap.xmap?rev=1.1&content-type=text/vnd.viewcvs-markup

Vadim



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:

> Sylvain Wallez wrote:
>
>> Go back to first post of this thread, where (last paragraph) I 
>> proposed something similar. The whole discussion is about how we 
>> could have a syntax which doesn't introduce such verbosity in the 
>> sitemap. 
>
>
> Verbosity is not necessarily a bad thing.  If it were, would any of us 
> be using XML?  ;-) 


Good, point. However, the only verbosity currently added by views is the 
"label" attribute. This proposal is about achieving the same low 
verbosity for views with binary content.

>> As I explained in several replies, there's no equivalence between a 
>> reader and generator able to parse a given binary format. There needs 
>> to be some kind of adaptation/extraction before feeding the view. 
>
>
> Yup.
>
>> And what you describe above as "a PDF reader, a Word reader, a 
>> Postscript reader, etc." are IMO nothing more than _generators_, just 
>> like the SWF and MIDI generators we already have. 
>
>
> The functionality for all readers would obviously be the same: move 
> these bytes from here to there.  But yes, the codified mapping I think 
> is important.


Please read carefully : I wrote *generators* !! This isn't about moving 
bytes, but about producing an XML document.

>> Let's consider the MIDI example. Suppose we have a large collection 
>> of karaoke files (MIDI supports embedded text that can be played on 
>> screen while playing the music), and we want to index the text of 
>> these songs for easy retrieval (along with some other meta-data).
>>
>> Here's a sitemap example, using the current syntax
>> <map:match pattern="*.mid"/>
>>  <map:act type="catch-view" src="content">
>>    <map:generate type="midi" src="{1}.mid"/>
>>    <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>>    <!-- should never come here -->
>>    <map:serialize type="xml"/>
>>  </map:match>
>>  <map:read src="{1}.mid"/>
>> </map:match> 
>
>
>
> You're mixing the <map:act> with a </map:match>, but I get the idea. 


Picky guy, eh ?

>> (the "content" view starts at the "content-label" label to clearly 
>> distinguish the two notions).
>>
>> And the proposed shorter one :
>>
>> <map:match pattern="*.mid">
>>  <map:read src="{1}.mid" unless-label="content"/>
>>  <map:generate type="midi" src="{1}.mid"/>
>>  <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>>  <!-- should never come here -->
>>  <map:serialize type="xml"/>
>> </map:match> 
>
>
>
> This breaks current convention that either a reader or a 
> generator/transformer/serializer can act in a pipeline.  In the first 
> example, if "content" isn't specified, the action returns null and the 
> reader is invoked;  As far as the pipeline logic is concerned, there 
> is only the reader.  Serializers are already known as universal exit 
> points.  To use the second, the convention must be broken and readers 
> must become universal exit points. 


Readers already are universal exit points : once you encounter a reader, 
sitemap processing is terminated. <map:read> and <map:serialize> are 
like a "return" statement in Java.

> In other words,
>
> <map:match pattern="*.mid">
> <map:read src="{1}.mid"/> <!-- without the unless-label -->
> <map:generate type="midi" src="{1}.mid"/>
> <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
> <!-- should never come here -->
> <map:serialize type="xml"/>
> </map:match>
>
> must become valid for consistency.  A reader becomes an exit point and 
> the rest of a pipeline is, by default, ignored.  Is this an intended 
> consequence?


No consequence : this is how the sitemap works today, and the above is 
valid, even if we can consider that the sitemap engine should more 
strict and signal that there's some unreachable code.

To add more to the confusion, in both your and my example, we can even 
avoid writing the <map:serialize> statement. Since some additional 
filtering occurs beforehand (either through the action or through reader 
labels), this statement is never reached and is useless !

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Miles Elam <mi...@pcextremist.com>.
Sylvain Wallez wrote:

> Go back to first post of this thread, where (last paragraph) I 
> proposed something similar. The whole discussion is about how we could 
> have a syntax which doesn't introduce such verbosity in the sitemap. 


Verbosity is not necessarily a bad thing.  If it were, would any of us 
be using XML?  ;-)

> As I explained in several replies, there's no equivalence between a 
> reader and generator able to parse a given binary format. There needs 
> to be some kind of adaptation/extraction before feeding the view. 


Yup.

> And what you describe above as "a PDF reader, a Word reader, a 
> Postscript reader, etc." are IMO nothing more than _generators_, just 
> like the SWF and MIDI generators we already have. 


The functionality for all readers would obviously be the same: move 
these bytes from here to there.  But yes, the codified mapping I think 
is important.

> Let's consider the MIDI example. Suppose we have a large collection of 
> karaoke files (MIDI supports embedded text that can be played on 
> screen while playing the music), and we want to index the text of 
> these songs for easy retrieval (along with some other meta-data).
>
> Here's a sitemap example, using the current syntax
> <map:match pattern="*.mid"/>
>  <map:act type="catch-view" src="content">
>    <map:generate type="midi" src="{1}.mid"/>
>    <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>    <!-- should never come here -->
>    <map:serialize type="xml"/>
>  </map:match>
>  <map:read src="{1}.mid"/>
> </map:match> 


You're mixing the <map:act> with a </map:match>, but I get the idea.

> (the "content" view starts at the "content-label" label to clearly 
> distinguish the two notions).
>
> And the proposed shorter one :
>
> <map:match pattern="*.mid">
>  <map:read src="{1}.mid" unless-label="content"/>
>  <map:generate type="midi" src="{1}.mid"/>
>  <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
>  <!-- should never come here -->
>  <map:serialize type="xml"/>
> </map:match> 


This breaks current convention that either a reader or a 
generator/transformer/serializer can act in a pipeline.  In the first 
example, if "content" isn't specified, the action returns null and the 
reader is invoked;  As far as the pipeline logic is concerned, there is 
only the reader.  Serializers are already known as universal exit 
points.  To use the second, the convention must be broken and readers 
must become universal exit points.

In other words,

<map:match pattern="*.mid">
 <map:read src="{1}.mid"/> <!-- without the unless-label -->
 <map:generate type="midi" src="{1}.mid"/>
 <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
 <!-- should never come here -->
 <map:serialize type="xml"/>
</map:match>

must become valid for consistency.  A reader becomes an exit point and 
the rest of a pipeline is, by default, ignored.  Is this an intended 
consequence?

- Miles Elam



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Miles Elam wrote:

> Ummm...  Quick question:  What are the use cases for this that are not 
> handled by existing methods?  I mean, couldn't this be handled with an 
> (as-yet unwritten) action?
>
> <map:match pattern="*.doc">
>  <map:act type="catch-view">
>    <map:parameter name="view-name" value="content"/>
>    <map:generate type="word2xml" src="{../1}.doc"/>
>    <!-- complete the pipeline -->
>  </map:act>
>  <map:read src="{1}.doc"/>
> </map:match>


Go back to first post of this thread, where (last paragraph) I proposed 
something similar. The whole discussion is about how we could have a 
syntax which doesn't introduce such verbosity in the sitemap.

> Jeff mentioned getting metainformation from binary data for searching, 
> but surely there are so many different types of binary data, a 
> universal view seems rather heavy-handed.  It works for search queries 
> (barely, in my opinion).  For content manipulation clients (like 
> WebDAV), these clients can't pass the query string trigger for views.  
> This seems to me to be a one-trick pony.  To make views available for 
> readers, it seems as though specificity is lost.
>
> The point of XML was specifically structured content, yes?  Any 
> conformant parser should be able to read any conformant file.  Binary 
> content has no such constraint.  If both a reader and a generator are 
> required in a matcher, I think some type of syntax that separates the 
> two *visually* (not just conceptually) is necessary as a cue.
>
> Putting in binary options makes all content one step worse than your 
> typical HTML web page: lack of intelligent structure without hope of 
> enforcing a schema.  Generators that read from Word (and other similar 
> formats) have taken some time to come to fruition precisely because of 
> their arbitrary nature (varying character set assumptions, embedded 
> OLE objects, various content encoding blocks, etc.).   Remember, XML 
> (in this case as metadata) is just one representation of structure.  
> The important thing (in my opinion) is preserving the structure.  I 
> don't see that happening with further intermingling of arbitrary 
> binary data.
>
> I guess I'm in the camp that's glad that readers exist.  Every time I 
> have run into the dreaded error that comes from trying to load the 
> output of a reader into the generator of another matcher, I have found 
> a sitemap organization error.  I guess I'm seeing the Cocoon version 
> of "goto considered harmful."  Sure it's flexible.  Sure it's 
> powerful.  But will it impart more complexity and discomfort than it 
> solves in actual practice?
>
> Hacking the view internals seems overkill (emphasis on kill).  Inline 
> with resource reader's role as "arbitrary, unorganized bit bucket with 
> a MIME type," there is no universal way of delivering appropriate 
> content.  The method of getting content from a Word document is very 
> different from the method of content gathering from a PDF document.  
> Views, orthogonal access to similar resources (ie. XML resources), 
> doesn't apply.  "View source" on a text file is straightforward.  
> "View source" on an XML file even more so.  What is "View source" on 
> reader content?  You would have to assign a different view to each 
> class of reader or put in some MIME type matching hack.  Neither is 
> less work or easier to grok than simply putting in an action or 
> selector in the appropriate matchers I think.
>
> If this type of thing moves forward, I would rather see more 
> specificity going into readers than twiddling with what comes out: a 
> PDF reader, a Word reader, a Postscript reader, etc.  In that case 
> you're separating out by schema, by at least some form of contract.  
> The alternative is equivalent to saying, "let's just make one class of 
> transformer because all XML is alike and only three transformation 
> options are available anyway."


As I explained in several replies, there's no equivalence between a 
reader and generator able to parse a given binary format. There needs to 
be some kind of adaptation/extraction before feeding the view.

And what you describe above as "a PDF reader, a Word reader, a 
Postscript reader, etc." are IMO nothing more than _generators_, just 
like the SWF and MIDI generators we already have.

Let's consider the MIDI example. Suppose we have a large collection of 
karaoke files (MIDI supports embedded text that can be played on screen 
while playing the music), and we want to index the text of these songs 
for easy retrieval (along with some other meta-data).

Here's a sitemap example, using the current syntax
<map:match pattern="*.mid"/>
  <map:act type="catch-view" src="content">
    <map:generate type="midi" src="{1}.mid"/>
    <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
    <!-- should never come here -->
    <map:serialize type="xml"/>
  </map:match>
  <map:read src="{1}.mid"/>
</map:match>

(the "content" view starts at the "content-label" label to clearly 
distinguish the two notions).

And the proposed shorter one :

<map:match pattern="*.mid">
  <map:read src="{1}.mid" unless-label="content"/>
  <map:generate type="midi" src="{1}.mid"/>
  <map:transform src="xmidi2xdoc.xsl" label="content-label"/>
  <!-- should never come here -->
  <map:serialize type="xml"/>
</map:match>

Note also that the "catch-view" action is not an easy thing to do, as 
the view is defined on the environment object which is theoretically not 
visible to components.

Furthermore, it would be better to catch on labels, since several views 
can be plugged on a given label (e.g. "content" & "pretty-content"). And 
it would be impossible for the action to access this information.

> P.S. Sorry to start trouble, but I think someone had to mention it. 


No trouble. Just lots of misunderstandings in this thread, I guess.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Miles Elam <mi...@pcextremist.com>.
Vadim Gritsenko wrote:

>> Ummm...  Quick question:  What are the use cases for this that are 
>> not handled by existing methods?  I mean, couldn't this be handled 
>> with an (as-yet unwritten) action?
>
>
> Matcher *does* exist:


Heh heh...  learning something new everyday.

- Miles Elam


Re: [RT] Views for readers

Posted by Vadim Gritsenko <va...@verizon.net>.
Miles Elam wrote:

> Ummm...  Quick question:  What are the use cases for this that are not 
> handled by existing methods?  I mean, couldn't this be handled with an 
> (as-yet unwritten) action?


Matcher *does* exist:


> <map:match pattern="*.doc">


<map:match type="wildcard-request-parameter" pattern="content">
  <map:parameter name="parameter-name" value="cocoon-view"/>

>    <map:generate type="word2xml" src="{../1}.doc"/>
>    <!-- complete the pipeline -->


</map:match>


>  <map:read src="{1}.doc"/>
> </map:match> 


<snip/>

Vadim



Re: [RT] Views for readers

Posted by Miles Elam <mi...@pcextremist.com>.
Ummm...  Quick question:  What are the use cases for this that are not 
handled by existing methods?  I mean, couldn't this be handled with an 
(as-yet unwritten) action?

<map:match pattern="*.doc">
  <map:act type="catch-view">
    <map:parameter name="view-name" value="content"/>
    <map:generate type="word2xml" src="{../1}.doc"/>
    <!-- complete the pipeline -->
  </map:act>
  <map:read src="{1}.doc"/>
</map:match>

Jeff mentioned getting metainformation from binary data for searching, 
but surely there are so many different types of binary data, a universal 
view seems rather heavy-handed.  It works for search queries (barely, in 
my opinion).  For content manipulation clients (like WebDAV), these 
clients can't pass the query string trigger for views.  This seems to me 
to be a one-trick pony.  To make views available for readers, it seems 
as though specificity is lost.

The point of XML was specifically structured content, yes?  Any 
conformant parser should be able to read any conformant file.  Binary 
content has no such constraint.  If both a reader and a generator are 
required in a matcher, I think some type of syntax that separates the 
two *visually* (not just conceptually) is necessary as a cue.

Putting in binary options makes all content one step worse than your 
typical HTML web page: lack of intelligent structure without hope of 
enforcing a schema.  Generators that read from Word (and other similar 
formats) have taken some time to come to fruition precisely because of 
their arbitrary nature (varying character set assumptions, embedded OLE 
objects, various content encoding blocks, etc.).   Remember, XML (in 
this case as metadata) is just one representation of structure.  The 
important thing (in my opinion) is preserving the structure.  I don't 
see that happening with further intermingling of arbitrary binary data.

I guess I'm in the camp that's glad that readers exist.  Every time I 
have run into the dreaded error that comes from trying to load the 
output of a reader into the generator of another matcher, I have found a 
sitemap organization error.  I guess I'm seeing the Cocoon version of 
"goto considered harmful."  Sure it's flexible.  Sure it's powerful.  
But will it impart more complexity and discomfort than it solves in 
actual practice?

Hacking the view internals seems overkill (emphasis on kill).  Inline 
with resource reader's role as "arbitrary, unorganized bit bucket with a 
MIME type," there is no universal way of delivering appropriate 
content.  The method of getting content from a Word document is very 
different from the method of content gathering from a PDF document.  
Views, orthogonal access to similar resources (ie. XML resources), 
doesn't apply.  "View source" on a text file is straightforward.  "View 
source" on an XML file even more so.  What is "View source" on reader 
content?  You would have to assign a different view to each class of 
reader or put in some MIME type matching hack.  Neither is less work or 
easier to grok than simply putting in an action or selector in the 
appropriate matchers I think.

If this type of thing moves forward, I would rather see more specificity 
going into readers than twiddling with what comes out: a PDF reader, a 
Word reader, a Postscript reader, etc.  In that case you're separating 
out by schema, by at least some form of contract.  The alternative is 
equivalent to saying, "let's just make one class of transformer because 
all XML is alike and only three transformation options are available 
anyway."

- Miles Elam

P.S. Sorry to start trouble, but I think someone had to mention it.



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Vadim Gritsenko wrote:

> Sylvain Wallez wrote:
>
>> Bertrand Delacretaz wrote:
>>
>>> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>>>
>>>> ...But shouldn't we keep labels that are already used into 
>>>> pipelines ? E.g :
>>>>
>>>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>>>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>>>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>>>
>>>
>>> If it's this way I'd prefer "unless-label" in map:read to make it 
>>> clear.
>>>
>>> Or maybe
>>>
>>>   <map:read src="docs/{1}.doc" unless-label="*"/>
>>>
>>> would do, meaning "use this unless any views are requested"
>>> (and * would be the only allowed value).
>>
<snip/>

>> Any other proposal or opinion on this subject before we start a vote ? 
>
>
> Can't you just enable generators in map:view in case when view starts 
> with reader? 


No, since views "capture" the (XML) output at certain points of the 
pipeline to provide a different formatting. E.g. the processing for the 
"indexable-content" view is the same for all URIs, be them XML pipelines 
or a single reader.

So there's no way other than having a generator _before_ jumping to the 
view, feeding that view with the kind of XML content it expects.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Vadim Gritsenko <va...@verizon.net>.
Sylvain Wallez wrote:

> Bertrand Delacretaz wrote:
>
>> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>>
>>> ...But shouldn't we keep labels that are already used into pipelines 
>>> ? E.g :
>>>
>>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>>
>>
>>
>> If it's this way I'd prefer "unless-label" in map:read to make it clear.
>>
>> Or maybe
>>
>>   <map:read src="docs/{1}.doc" unless-label="*"/>
>>
>> would do, meaning "use this unless any views are requested"
>> (and * would be the only allowed value).
>>
>>> Ah, and this is very easily implementable ;-)
>>
>>
>>
>> Quickquick, do it before the FS police hears us ;-)
>>
>> Seriously, I find this useful for indexing and other purposes 
>> (gettting meta-information about binary files, images, etc for example). 
>
>
>
> Me too. But since is a change in the sitemap syntax, we should have a 
> vote on this.
>
> Any other proposal or opinion on this subject before we start a vote ? 


Can't you just enable generators in map:view in case when view starts 
with reader?

Vadim



Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Bertrand Delacretaz wrote:

> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>
>> ...But shouldn't we keep labels that are already used into pipelines 
>> ? E.g :
>>
>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>
>
> If it's this way I'd prefer "unless-label" in map:read to make it clear.
>
> Or maybe
>
>   <map:read src="docs/{1}.doc" unless-label="*"/>
>
> would do, meaning "use this unless any views are requested"
> (and * would be the only allowed value).
>
>> Ah, and this is very easily implementable ;-)
>
>
> Quickquick, do it before the FS police hears us ;-)
>
> Seriously, I find this useful for indexing and other purposes 
> (gettting meta-information about binary files, images, etc for example). 


Me too. But since is a change in the sitemap syntax, we should have a 
vote on this.

Any other proposal or opinion on this subject before we start a vote ?

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Stefano Mazzocchi <st...@apache.org>.
On Thursday, Aug 14, 2003, at 16:02 Europe/Rome, Bertrand Delacretaz 
wrote:

> Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :
>
>> ...But shouldn't we keep labels that are already used into pipelines 
>> ? E.g :
>>
>> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
>> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
>> <map:transform src="xword2xdoc.xsl" label="xdoc"/>
>
> If it's this way I'd prefer "unless-label" in map:read to make it 
> clear.
>
> Or maybe
>
>   <map:read src="docs/{1}.doc" unless-label="*"/>
>
> would do, meaning "use this unless any views are requested"
> (and * would be the only allowed value).
>
>> Ah, and this is very easily implementable ;-)
>
> Quickquick, do it before the FS police hears us ;-)

Gotcha! I dislike having a map:read before a map:generator. Try again.

> Seriously, I find this useful for indexing and other purposes 
> (gettting meta-information about binary files, images, etc for 
> example).

wait wait wait wait this is exactly what JSR-170 is doing.

They are preparing for public review shortly, please let's way until 
they are out before touching this at the sitemap level!!!

> --
Stefano.


Re: [RT] Views for readers

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Jeudi, 14 aoû 2003, à 15:53 Europe/Zurich, Sylvain Wallez a écrit :

> ...But shouldn't we keep labels that are already used into pipelines ? 
> E.g :
>
> <map:read src="docs/{1}.doc" label="raw, xdoc"/>
> <map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
> <map:transform src="xword2xdoc.xsl" label="xdoc"/>

If it's this way I'd prefer "unless-label" in map:read to make it clear.

Or maybe

   <map:read src="docs/{1}.doc" unless-label="*"/>

would do, meaning "use this unless any views are requested"
(and * would be the only allowed value).

> Ah, and this is very easily implementable ;-)

Quickquick, do it before the FS police hears us ;-)

Seriously, I find this useful for indexing and other purposes (gettting 
meta-information about binary files, images, etc for example).

-Bertrand

Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Bertrand Delacretaz wrote:

> Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit :
>
>> ...But what if we write it the other way around :
>> <map:read src="docs/{1}.doc">
>>  <map:generate src="docs/{1}.doc" type="wordToXml" label="content"/>
>> </map:read>
>
>
> I find this more understandable (but dunno about implementation):
>
> <!-- if reader is executed, the rest is not -->
> <map:read src="docs/{1}.doc" unless-view="wordToXml"/>
> <map:generate src="docs/{1}.doc" type="wordToXml"/>
> <map:transform... 


Interesting. This is looks like a more compact notation for the 
view-selector I was thinking of at first. We're leaving the RT world...

But shouldn't we keep labels that are already used into pipelines ? E.g :

<map:read src="docs/{1}.doc" label="raw, xdoc"/>
<map:generate src="docs/{1}.doc" type="word2xml" label="raw"/>
<map:transform src="xword2xdoc.xsl" label="xdoc"/>

The "label" on the reader would skip the reader if the requested view 
corresponds to one of these labels. Now should this be named "label" or 
"unless-label" ?

Ah, and this is very easily implementable ;-)

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
Le Jeudi, 14 aoû 2003, à 15:24 Europe/Zurich, Sylvain Wallez a écrit :

> ...But what if we write it the other way around :
> <map:read src="docs/{1}.doc">
>  <map:generate src="docs/{1}.doc" type="wordToXml" label="content"/>
> </map:read>

I find this more understandable (but dunno about implementation):

<!-- if reader is executed, the rest is not -->
<map:read src="docs/{1}.doc" unless-view="wordToXml"/>
<map:generate src="docs/{1}.doc" type="wordToXml"/>
<map:transform...

-Bertrand


Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Bertrand Delacretaz wrote:

> How about making it the other way round, by allowing Generators to 
> read from Readers?
>
> <map:match pattern="*.doc" default-view="binary">
>   <map:generator label="xml-content-for-indexing" type="wordToXml">
>     <map:read src="word-documents/{1}.doc" label="binary" mime-type=.../>
>   </map:generator>
>   <map:serialize type="xml"/>
> </map:match> 


Do you mean that the generator would be used if the 
"xml-content-for-indexing" view is selected ? This doesn't fit with the 
existing sitemap behaviour, since generators are _always_ added to the 
pipeline.

But what if we write it the other way around :
<map:read src="docs/{1}.doc">
  <map:generate src="docs/{1}.doc" type="wordToXml" label="content"/>
</map:read>

The meaning of the above is : if a view is requested, execute what's 
_inside_ the <map:read>. If it builds a complete pipeline then return 
its result, otherwise just perform the usual read operation.

> Is that RT-ish enough? 


Mmmmh... not as wild as Nicola Ken's. Try again ;-P

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Bertrand Delacretaz <bd...@codeconsult.ch>.
How about making it the other way round, by allowing Generators to read 
from Readers?

<map:match pattern="*.doc" default-view="binary">
   <map:generator label="xml-content-for-indexing" type="wordToXml">
     <map:read src="word-documents/{1}.doc" label="binary" 
mime-type=.../>
   </map:generator>
   <map:serialize type="xml"/>
</map:match>

Is that RT-ish enough?

-Bertrand


Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Nicola Ken Barozzi wrote:

>
> Jeff Turner wrote, On 14/08/2003 14.17:
>
> ...
>
>> Isn't the problem there that a <map:read> is a whole little pipeline 
>> unto
>> itself?  If it were broken into two atomic operations:
>>
>> <map:generate type="binary" src="foo.doc"/>
>> <map:serialize type="binary"/>
>>
>> then we could have a <map:view from-position="first"/> using a
>> content-aware pipeline, and everything would work.
>
>
> Well, why can't the view simply start from a reader?
>
>  <map:read src="foo.doc"/> 


Because a view finishes a partial XML pipeline, meaning it requires a 
generator to be already present...

>> I have the feeling that handling non-XML content in Cocoon is Just 
>> Wrong,
>> and that <map:read> is just a hack.  The fact that it doesn't integrate
>> with Views is a symptom of this.  In a theoretically pure world, we'd
>> either make Cocoon an XML-only framework and kill <map:read>, or make
>> Cocoon a generic data pipelining framework capable of handling and
>> transforming binary content.
>
>
> Well, it can be done easily by allowing more than one reader and by 
> allowing readers in the xml pipeline.
>
> Some time back I had proposed the following to be possible (and got 
> touted as the usual FS man)
>
>  <map:read src="foo1.doc"/>
>  <map:read type="stripstuff"/>
>  <map:read type="otherfilter"/> 


Mhhh... I guess "stripstuff" and "otherfilter" are actually 
<map:transform-binary> and not <map:read> as they do have an input. Now 
how do we "close" the pipeline ? Is there a <map:serialize-binary> ?

> And also:
>
>  <map:read src="foo1.doc"/>
>  <map:generate src="foo1.doc"/>
>  <map:serialize src="foo1.doc"/>
>  <map:read type="zip"/> 


Wow! What's the result of this ??

> We can already do this BTW by using the Cocooon protocol, but it's 
> such a hack! 


Sounds interesting. Can you elaborate on the hack ?

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Nicola Ken Barozzi <ni...@apache.org>.
Jeff Turner wrote, On 14/08/2003 14.17:

...
> Isn't the problem there that a <map:read> is a whole little pipeline unto
> itself?  If it were broken into two atomic operations:
> 
> <map:generate type="binary" src="foo.doc"/>
> <map:serialize type="binary"/>
> 
> then we could have a <map:view from-position="first"/> using a
> content-aware pipeline, and everything would work.

Well, why can't the view simply start from a reader?

  <map:read src="foo.doc"/>

> I have the feeling that handling non-XML content in Cocoon is Just Wrong,
> and that <map:read> is just a hack.  The fact that it doesn't integrate
> with Views is a symptom of this.  In a theoretically pure world, we'd
> either make Cocoon an XML-only framework and kill <map:read>, or make
> Cocoon a generic data pipelining framework capable of handling and
> transforming binary content.

Well, it can be done easily by allowing more than one reader and by 
allowing readers in the xml pipeline.

Some time back I had proposed the following to be possible (and got 
touted as the usual FS man)

  <map:read src="foo1.doc"/>
  <map:read type="stripstuff"/>
  <map:read type="otherfilter"/>

And also:

  <map:read src="foo1.doc"/>
  <map:generate src="foo1.doc"/>
  <map:serialize src="foo1.doc"/>
  <map:read type="zip"/>

We can already do this BTW by using the Cocooon protocol, but it's such 
a hack!

> Well it's a RT after all.. ;)

*sigh*

If Cocoon had this capability and could be embedded more easily 
*without* the sitemap, it would be a cool transformation library...

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------



Re: [RT] Views for readers

Posted by Jeff Turner <je...@apache.org>.
On Thu, Aug 14, 2003 at 01:41:55PM +0200, Sylvain Wallez wrote:
> Jeff Turner wrote:
...
> ><map:view name="indexablecontent" from-position="first">
> > <map:select type="xml-type">
> >   <map:when test="docbook">
> >     <map:transform src="docbook2whatever.xsl"/>
> >   </map:when>
> >   <map:when test="tei">
> >     <map:transform src="tei2whatever.xsl"/>
> >   </map:when>
> >   <map:when test="msword">
> >     <map:transform src="word2whatever.xsl"/>
> >   </map:when>
> > </map:select>
> ></map:view>
> >
> 
> Ah, ok, the "strongly type pipelines" are a different wording for 
> "content-aware selectors" !

Ah yes.  Strange how the same concept can live two separate lives in
one's head ;)  Like the same class in two classloaders.

> >So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
> >return XML representing the content of the .doc file.
> >
> >I described the same thing in a mail with subject 'Type-aware Views (Re:
> >Link view goodness)'.  Same need, different context, same proposed
> >solution.
> >
> 
> Not exactly : the use case here is that we have a binary file which is 
> normally sent as is to the browser using a reader. It is _not_ parsed as 
> an XML stream. So we can't attach a view to these kinds of URLs since 
> views provide a different _ending_ to a pipeline, meaning there must 
> exist at least a generator and optionnaly one or more transformers at 
> the point where processing is directed to the view.
> 
> So even content-aware selectors don't solve this problem...

Isn't the problem there that a <map:read> is a whole little pipeline unto
itself?  If it were broken into two atomic operations:

<map:generate type="binary" src="foo.doc"/>
<map:serialize type="binary"/>

then we could have a <map:view from-position="first"/> using a
content-aware pipeline, and everything would work.

I have the feeling that handling non-XML content in Cocoon is Just Wrong,
and that <map:read> is just a hack.  The fact that it doesn't integrate
with Views is a symptom of this.  In a theoretically pure world, we'd
either make Cocoon an XML-only framework and kill <map:read>, or make
Cocoon a generic data pipelining framework capable of handling and
transforming binary content.

Well it's a RT after all.. ;)

--Jeff

> Sylvain
> 
> -- 
> Sylvain Wallez                                  Anyware Technologies
> http://www.apache.org/~sylvain           http://www.anyware-tech.com
> { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
> Orixo, the opensource XML business alliance  -  http://www.orixo.com
> 
> 

Re: [RT] Views for readers

Posted by Sylvain Wallez <sy...@anyware-tech.com>.
Jeff Turner wrote:

>On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
>  
>
>>Frederic's question about search engine integration led me to 
>>questioning myself at how Cocoon's Lucene integration could be able to 
>>transparently index Word & PDF documents along with XML-produced documents.
>>
>>There exists some text-extraction libraries for Word & PDF (e.g. 
>>http://www.textmining.org/). Now how can we integrate this as 
>>transparently as possible in Cocoon's search functionnality ?
>>
>>The Lucene indexer crawls a website and asks for a particular view 
>>("content") which is used to fill the index. But Word and PDF documents 
>>being binary files, they're handled by a <map:read> statement, which 
>>does not handle views. On the other hand, this use case shows that 
>>having views on binary content may make sense : the "normal" requests 
>>just sends back the binary content, while a view can use a text/XML 
>>extraction on these binary files.
>>
>>So the question is : how could views be plugged to readers ? I must say 
>>that I don't have an answer, as views contain transformers and a 
>>serializer, but no generator. So how could we express in the sitemap 
>>that a particular view on a reader should "replace" that reader by a 
>>particular generator ? Or should this go through some special readers 
>>that could also act as generators ?
>>
>>Or maybe these are silly thoughts and we should use a <map:select> 
>>directing to a <map:read> or <map:generate> depending on the view. But 
>>this introduces explicit view management in the pipelines, which doesn't 
>>seem nice to me.
>>    
>>
>
>Solution: strongly typed pipelines! :)
>
>Imagine if, at each node in the sitemap, we knew what type of content we
>were dealing with (usually some flavour of XML).  Then we could write a
>single view that behaves differently depending on the _type_ of data:
>
><map:view name="indexablecontent" from-position="first">
>  <map:select type="xml-type">
>    <map:when test="docbook">
>      <map:transform src="docbook2whatever.xsl"/>
>    </map:when>
>    <map:when test="tei">
>      <map:transform src="tei2whatever.xsl"/>
>    </map:when>
>    <map:when test="msword">
>      <map:transform src="word2whatever.xsl"/>
>    </map:when>
>  </map:select>
></map:view>
>

Ah, ok, the "strongly type pipelines" are a different wording for 
"content-aware selectors" !

>So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
>return XML representing the content of the .doc file.
>
>I described the same thing in a mail with subject 'Type-aware Views (Re:
>Link view goodness)'.  Same need, different context, same proposed
>solution.
>

Not exactly : the use case here is that we have a binary file which is 
normally sent as is to the browser using a reader. It is _not_ parsed as 
an XML stream. So we can't attach a view to these kinds of URLs since 
views provide a different _ending_ to a pipeline, meaning there must 
exist at least a generator and optionnaly one or more transformers at 
the point where processing is directed to the view.

So even content-aware selectors don't solve this problem...

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
Orixo, the opensource XML business alliance  -  http://www.orixo.com



Re: [RT] Views for readers

Posted by Jeff Turner <je...@apache.org>.
On Wed, Aug 13, 2003 at 12:02:04PM +0200, Sylvain Wallez wrote:
> Frederic's question about search engine integration led me to 
> questioning myself at how Cocoon's Lucene integration could be able to 
> transparently index Word & PDF documents along with XML-produced documents.
> 
> There exists some text-extraction libraries for Word & PDF (e.g. 
> http://www.textmining.org/). Now how can we integrate this as 
> transparently as possible in Cocoon's search functionnality ?
> 
> The Lucene indexer crawls a website and asks for a particular view 
> ("content") which is used to fill the index. But Word and PDF documents 
> being binary files, they're handled by a <map:read> statement, which 
> does not handle views. On the other hand, this use case shows that 
> having views on binary content may make sense : the "normal" requests 
> just sends back the binary content, while a view can use a text/XML 
> extraction on these binary files.
> 
> So the question is : how could views be plugged to readers ? I must say 
> that I don't have an answer, as views contain transformers and a 
> serializer, but no generator. So how could we express in the sitemap 
> that a particular view on a reader should "replace" that reader by a 
> particular generator ? Or should this go through some special readers 
> that could also act as generators ?
> 
> Or maybe these are silly thoughts and we should use a <map:select> 
> directing to a <map:read> or <map:generate> depending on the view. But 
> this introduces explicit view management in the pipelines, which doesn't 
> seem nice to me.

Solution: strongly typed pipelines! :)

Imagine if, at each node in the sitemap, we knew what type of content we
were dealing with (usually some flavour of XML).  Then we could write a
single view that behaves differently depending on the _type_ of data:

<map:view name="indexablecontent" from-position="first">
  <map:select type="xml-type">
    <map:when test="docbook">
      <map:transform src="docbook2whatever.xsl"/>
    </map:when>
    <map:when test="tei">
      <map:transform src="tei2whatever.xsl"/>
    </map:when>
    <map:when test="msword">
      <map:transform src="word2whatever.xsl"/>
    </map:when>
  </map:select>
</map:view>

So http://mycocoonsite.com/foo.doc?cocoon_view=indexablecontent would
return XML representing the content of the .doc file.

I described the same thing in a mail with subject 'Type-aware Views (Re:
Link view goodness)'.  Same need, different context, same proposed
solution.


--Jeff


> Any thoughts ?
> 
> Sylvain
> 
> -- 
> Sylvain Wallez                                  Anyware Technologies
> http://www.apache.org/~sylvain           http://www.anyware-tech.com
> { XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }
> Orixo, the opensource XML business alliance  -  http://www.orixo.com
> 
> 

Re: [RT] Views for readers

Posted by Sam Coward <sa...@atnet.net.au>.
Hmm,

> Frederic's question about search engine integration led me to 
> questioning myself at how Cocoon's Lucene integration could be able to 
> transparently index Word & PDF documents along with XML-produced 
> documents.

I have been wondering that too. At my company, we put together a simple 
web management tool to put small collections of documents into a web 
frame for a client. Pretty useless, but it's what he wanted.

At the time I had thought it may be possible to just improve Lucene so 
it could understand binary files by introducing mime-type triggerable 
filter modules that converted to text on the input stream. After all, if 
the text were only going to be used for indexing, it wouldn't matter if 
the text wasn't available within Cocoon itself. In any case he's happy 
with what he has and we're happily doing other stuff.

Perhaps if the individual extractors are part of specialised readers for 
specific types of documents, then you could configure the label for the 
XML they return? That would allow for the duality of that behaviour to 
be mostly concealed and managed from within Cocoon with little effect to 
the sitemap.

I personally find it tempting to think that it may be possible to  rip 
out XML from any of these formats, and do with it as we wish, 
particulary when I saw that programs like catdoc could recognize the 
tables even from Word 2k documents. But I often find myself thinking 
back against that, and that maybe I should represent all content (even 
document content) semantically in XML and let rendering technologies 
(PDFSerializer, POI) handle binary output, and perhaps leverage document 
importers that map those documents back to XML (they all seem to be 
proprietary, big buck solutions from what I see currently, though). In 
any case, it does seem that is certainly a ways off in the future *sigh*

Hmm, an OCR extractor would be way cool for faxes too!

just my 2c, i never say anything most of the time, anyway
Sam