You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@forrest.apache.org by Jeff Turner <je...@apache.org> on 2002/12/10 17:24:30 UTC

file: implemented (Re: cvs commit: ...)

On Tue, Dec 10, 2002 at 11:56:20AM -0000, jefft@apache.org wrote:
> jefft       2002/12/10 03:56:20
> 
>   Modified:    .        build.xml status.xml
>                src/resources/conf sitemap.xmap
>                src/resources/forrest-shbat forrest.build.xml
>                src/resources/fresh-site/src/documentation/content/xdocs
>                         sample.xml
>                src/resources/library/xslt filterlinks.xsl
>   Added:       src/resources/forrest-shbat/tasks/org/apache/forrest
>                         UncopiedFileSelector.java
>                src/resources/fresh-site/src/documentation/content hello.pdf
>                src/resources/library/xslt filterlinks-html.xsl
>                         linkutils.xsl
>   Log:
>   Add special handling of links that start with 'file:'.
>   These links are:
>    1) Not passed on to Cocoon (see filterlinks.xsl)
>    2) The 'file:' prefix is stripped from the HTML (see filterlinks-html.xsl)
>    3) All file: links encountered during crawling are recorded in a file,
>    'unprocessed-files.txt' (see filterlinks.xsl and linkutils.xsl)
>    4) After running Cocoon, forrest.build.xml copies all files listed in
>    unprocessed-files.txt to build/site/ manually.  This is achieved with a custom
>    selector, UncopiedFileSelector.java


Btw, the seed webapp includes an example of this.  The file
src/documentation/content/hello.pdf is linked to in samples.xml, with
<link href="file:hello.pdf">.

The implementation is pretty ugly.  I found that appending to an external
file in XSLT is a royal PITA.  It currently works with a Xalan
<redirect:write> extension.  While XSLT 2.0 (Saxon) implements an
equivalent <xsl:redirect-document>, it can't append to an existing file.
I'm currently trying to write a Transformer that records 'file:' links to
a WriteableSource, to replace all this hacky XSLT.


--Jeff

Re: file: implemented (Re: cvs commit: ...)

Posted by Bruno Dumon <br...@outerthought.org>.

On Wed, 2002-12-11 at 10:01, Steven Noels wrote:
[...]
> Jeff is demonstrating to us that Cocoon _has_ certain areas where it can 
> be difficult to apply, just as we tried to do with CAPs. There has
> been quite violent disagreement about the idea of content-aware
> pipelines a long time ago. Now however, it is proposed as being a
> solution for this particular problem.

Hmmm, I'm probably nitpicking again, but I think we should stop calling
CAP CAP, since the current solution isn't CAP. What I understand as CAP
is some kind of selector that sits in the SAX-pipeline and chooses
another pipeline based on the content of the SAX-events. There's still
violent disagreement about content-aware pipelines, since they were
rejected from Cocoon.

The current SourceTypeAction is just an action that assigns a type to a
file based on metadata of the file. This metadata happens to be in the
file itself, but that doesn't change a thing. The pipeline is selected
based on metadata about the file, not based on the content of the file.
Thus it is not at all that much different from selecting a pipeline
based on file-extensions.

>  So opinions might change over the 
> time. Mind you however that the current SourceTypeAction is based on XML 
> & SAX, and would not help for 'binary' documents.

Not based on SAX, but on pull-parsing.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org

Re: file: implemented (Re: cvs commit: ...)

Posted by Steven Noels <st...@outerthought.org>.

Sylvain Wallez wrote:

> And maybe some more will come after this : 
> http://www.anyware-tech.com/blogs/sylvain/archives/000009.html ;-)

Seen it already... looks like my comment disappeared somehow?

http://radio.weblogs.com/0103539/2002/12/11.html#a102

> I don't consider equally SWT and writeable sources. Yes, SWT relies on 
> writeable sources, but writeable sources have a wider usage range, 
> including in "traditional" code. They simply allow you to get an 
> OutputStream on whatever you want and write whatever you want in it.

Yes, sure, sorry about creating that confusion!

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org

Re: file: implemented (Re: cvs commit: ...)

Posted by Sylvain Wallez <sy...@anyware-tech.com>.

Steven Noels wrote:

<snip/>

> (wow - I should forward this to Sylvain: it must be the first time 
> Imake favourable comments on the SWT/writeable sources ;-)

And maybe some more will come after this : 
http://www.anyware-tech.com/blogs/sylvain/archives/000009.html ;-)

I don't consider equally SWT and writeable sources. Yes, SWT relies on 
writeable sources, but writeable sources have a wider usage range, 
including in "traditional" code. They simply allow you to get an 
OutputStream on whatever you want and write whatever you want in it.

Sylvain

-- 
Sylvain Wallez                                  Anyware Technologies
http://www.apache.org/~sylvain           http://www.anyware-tech.com
{ XML, Java, Cocoon, OpenSource }*{ Training, Consulting, Projects }

Re: file: implemented (Re: cvs commit: ...)

Posted by Steven Noels <st...@outerthought.org>.

Nicola Ken Barozzi wrote:

> We must also remember that our goal is also not to copy anything but 
> to have it processed by Cocoon in a "natural" manner.

Cocoon is based on streaming SAX events across a pipeline, selected by a
switchboard named 'sitemap' based on the request environment. Even a
map:reader can be considered a hack in that respect, still it isn't
regarded as such, since not everything in the world is SAXable.

Jeff is demonstrating to us that Cocoon _has_ certain areas where it can 
be difficult to apply, just as we tried to do with CAPs. There has
been quite violent disagreement about the idea of content-aware
pipelines a long time ago. Now however, it is proposed as being a
solution for this particular problem. So opinions might change over the 
time. Mind you however that the current SourceTypeAction is based on XML 
& SAX, and would not help for 'binary' documents.

I prefer proper work to be done on specific Cocoon components to solve
this problem and to get rid of the ugly XSLT hack, and not try to fit it
into the existing way of thinking, which will make the sitemap overly
complex, and doing wizardry stuff an end-user will not understand. It's
sometimes better to let the user decide (explicitizing a 'scheme') so
that he knows what he will get, than to expose behaviour which is not
obvious at first sight.

Trying to fit the linking/rendition (copying) of non-XML resources into
the process configuration model of an XML-based system (the sitemap)
seems like the ultimate hack to me. We better have this done in parallel
using a SWT/copying-component as Jeff is suggesting, IMHO.

(wow - I should forward this to Sylvain: it must be the first time I
make favourable comments on the SWT/writeable sources ;-)

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org

Re: file: implemented (Re: cvs commit: ...)

Posted by Vadim Gritsenko <va...@verizon.net>.

Robert Koberg wrote:

>>-----Original Message-----
>>From: Vadim Gritsenko [mailto:vadim.gritsenko@verizon.net]
>>Sent: Friday, December 13, 2002 3:39 PM
>>    
>>
>
>  
>
>>>Have you seen this:
>>>http://stx.sourceforge.net/documents/spec-stx-20021101.html 
>>>
>>>      
>>>
>>Went through cocoon-dev half year or more ago. It could be a base for 
>>XSP transformers (XSPT?) syntax...
>>
>>    
>>
>
>You don't see it as a replacement for XSL?
>  
>

Yes, it can replace lots of simple stylesheets and should give better 
performance, and it should not be hard to integrate XST into Cocoon. But 
back then I was puzzled by dynamic transformers and the missing piece 
was good syntax.

Vadim

<snip/>

RE: file: implemented (Re: cvs commit: ...)

Posted by Robert Koberg <ro...@koberg.com>.

> -----Original Message-----
> From: Vadim Gritsenko [mailto:vadim.gritsenko@verizon.net]
> Sent: Friday, December 13, 2002 3:39 PM

> >
> >Have you seen this:
> >http://stx.sourceforge.net/documents/spec-stx-20021101.html
> >  
> >
> 
> Went through cocoon-dev half year or more ago. It could be a base for 
> XSP transformers (XSPT?) syntax...
> 

You don't see it as a replacement for XSL?

-Rob



> Vadim
> 
> 
> 
> >Abstract
> >STX is an XML-based language for transforming XML documents into other XML
> >documents without building a tree in memory. An STX processor 
> transforms one or
> >more source streams of SAX2 events according to rules given in an 
> XML document
> >called STX stylesheet and generates one or more result SAX2 streams. Each
> >incoming event invokes one or more rules, that can e.g. emit events to the
> >result stream or access to a working storage.

Re: file: implemented (Re: cvs commit: ...)

Posted by Vadim Gritsenko <va...@verizon.net>.

Robert Koberg wrote:

>Hi,
>
>Have you seen this:
>http://stx.sourceforge.net/documents/spec-stx-20021101.html
>  
>

Went through cocoon-dev half year or more ago. It could be a base for 
XSP transformers (XSPT?) syntax...

Vadim



>Abstract
>STX is an XML-based language for transforming XML documents into other XML
>documents without building a tree in memory. An STX processor transforms one or
>more source streams of SAX2 events according to rules given in an XML document
>called STX stylesheet and generates one or more result SAX2 streams. Each
>incoming event invokes one or more rules, that can e.g. emit events to the
>result stream or access to a working storage.
>
>-----
>just subscribed to another list...
>
>best,
>-Rob
>

RE: file: implemented (Re: cvs commit: ...)

Posted by Robert Koberg <ro...@koberg.com>.

Hi,

Have you seen this:
http://stx.sourceforge.net/documents/spec-stx-20021101.html


Abstract
STX is an XML-based language for transforming XML documents into other XML
documents without building a tree in memory. An STX processor transforms one or
more source streams of SAX2 events according to rules given in an XML document
called STX stylesheet and generates one or more result SAX2 streams. Each
incoming event invokes one or more rules, that can e.g. emit events to the
result stream or access to a working storage.

-----
just subscribed to another list...

best,
-Rob

Re: file: implemented (Re: cvs commit: ...)

Posted by Stefano Mazzocchi <st...@apache.org>.

Steven Noels wrote:
> Stefano Mazzocchi wrote:
> 
>> If you go the wget path you have to implement a link parser and 
>> translator for *every* hypertext-capable binary files our serializers 
>> can come up with.
> 
> 
> Still, since Cocoon pipelines are XML-only, I wonder how and where we 
> could plug in a CSS parser that feeds image requests back into the 
> process method. Should we have 'intelligent readers' then, augmented 
> with some sort of LinkSerializer?

What about this

  <map:match pattern="*.css">
   <map:read src="./styles/*.css" mime-type="text/css"/>
   <map:read label="link" type="css-link-parser" src="./styles/*.css"/>
  </map:match>

where 'css-link-parser' is a link emitting reader that uses a CSS parser.

wouldn't that fit your needs without sacrificing coherence?

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Steven Noels <st...@outerthought.org>.

Stefano Mazzocchi wrote:

> If you go the wget path you have to implement a link parser and 
> translator for *every* hypertext-capable binary files our serializers 
> can come up with.

Still, since Cocoon pipelines are XML-only, I wonder how and where we 
could plug in a CSS parser that feeds image requests back into the 
process method. Should we have 'intelligent readers' then, augmented 
with some sort of LinkSerializer?

And what about other formats?

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org

Crawlers (Re: file: implemented)

Posted by Jeff Turner <je...@apache.org>.

On Fri, Dec 13, 2002 at 09:53:07AM +0000, Andrew Savory wrote:
> 
> On Fri, 13 Dec 2002, Jeff Turner wrote:
> 
> > Because in the long run,  I would prefer to develop a separate wget-like
> > tool with cocoon-view hacks added to it, than to develop the CLI into a
> > full-blown threaded crawler.  Why?  Because a separate tool has a _much_
> > larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
> > elegant, but a separate tool can grow geometrically while the CLI grows
> > linearly.
> 
> I can see some serious advantages to splitting the crawler from the CLI:
> when the crawler is there, it would be fantastic to add a "precacher"
> using the crawler (go hit my entire site, including internal cocoon-views)
> rather than the "traditional" approach of running wget on a site. I
> suspect various other things that rely on crawling (such as search
> implementations like the Lucene code) would benefit from the speed
> increase of a dedicated crawler, too.

Yes, in fact the only decent threaded Java crawler I've found so far is
in Lucene's sandbox:

http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html

Reading that overview shows what a tricky business it is to write a
_good_ crawler.  Trying to evolve the Cocoon CLI to this level of
sophistication seems.. silly :)  I would rather start with this good,
external implementation, and add any Cocoon-specific hacks required.

> I think it would be best done as part of Cocoon rather than Forrest though
> (or am I missing the point *again*? ;-), as there are more ways it would
> be used there.

As a general-purpose tool, I think it should be developed outside of both
Cocoon and Forrest, to attract the greatest possible number of
users/developers.

--Jeff

> Andrew.
> 
> -- 
> Andrew Savory                                Email: andrew@luminas.co.uk
> Managing Director                              Tel:  +44 (0)870 741 6658
> Luminas Internet Applications                  Fax:  +44 (0)700 598 1135
> This is not an official statement or order.    Web:    www.luminas.co.uk
>

Re: file: implemented (Re: cvs commit: ...)

Posted by Jeff Turner <je...@apache.org>.

On Thu, Dec 12, 2002 at 11:48:49PM -0800, Stefano Mazzocchi wrote:
> Jeff Turner wrote:
...
> >  "To rave in violent, high-sounding, or extravagant language, without
> >  dignity of thought"
> >
> >Please remember the context; Nicola was suggesting an implementation of a
> >new feature (schemes) that would tie Forrest even tighter to the CLI.  If
> >it helps, more context is that I was writing at 3am after a day's
> >fighting with Transformers :P
> 
> I know I shouldn't (and I'm getting year after year better on that) but 
> when somebody says that I did a "blindingly stupid" thing, I tend to get 
> pissed no matter what their context is :-Prrr

:) Sorry.  See above definition of 'rant'.

...
> >Or just hack it to support cocoon-view=links when it becomes necessary.
> 
> FYI, the Cocoon CLI uses link views for both GET and POST. The GET part 
> is to retrive the list of hyperlinks that depart from that resource, the 
> POST request is to send the link of "translated links" that cocoon must 
> translate right before serializing.
> 
> If you decouple the CLI from Cocoon, that POST view must be made public, 
> and this can create a *major* security hole, basically allowing anybody 
> to come up with a page with links translated with client-injected 
> information! Which is cross-side scripting attacks for dummies!

The intention is to run a Cocoon webapp locally, only for as long as it
takes the crawler to do it's job.  Security isn't an issue.

...
> >Yes, of course it's more elegant.  But _practically_, it is slow and full
> >of bugs which no-one has volunteered to fix, and Forrest is suffering
> >because of this.
> >
> >Now why don't I stop whining, get in there and fix it?
> >
> >Because in the long run,  I would prefer to develop a separate wget-like
> >tool with cocoon-view hacks added to it, than to develop the CLI into a
> >full-blown threaded crawler.  Why?  Because a separate tool has a _much_
> >larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
> >elegant, but a separate tool can grow geometrically while the CLI grows
> >linearly.
> 
> Hey, know what? you'd have my full support if you took some of the CLI 
> code out of Cocoon and made it part of Forrest. (not all of it, some XSP 
> precompilation technology uses it) because I agree with you: the wrong 
> community is currently maintaing that code.

There are plenty of Cocoon committers here, so I don't think moving the
code achieves much.

> [BTW, to give you context, I'm writing this while Jon (Stevens) saw me 
> replying to this and now he's going around the house saying 'anakia 
> rulez', 'dvsl is the way to go', 'you have to figure out a way to beat 
> anakia's speed or you're doomed'... gotta love open source! :)]

:) Remind him that Forrest allows edited docs to be immediately viewed
with a live Cocoon.  So Forrest's edit/view cycle is inherently faster
than Anakia's edit/compile/view cycle, just as Anakia's CLI is inherently
faster than Cocoon's.


--Jeff

Re: file: implemented (Re: cvs commit: ...)

Posted by Keiron Liddle <ke...@aftexsw.com>.

> Now why don't I stop whining, get in there and fix it?
> 
> Because in the long run,  I would prefer to develop a separate wget-like
> tool with cocoon-view hacks added to it, than to develop the CLI into a
> full-blown threaded crawler.  Why?  Because a separate tool has a _much_
> larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
> elegant, but a separate tool can grow geometrically while the CLI grows
> linearly.

If both are doing the same thing then why not let one "project" do both things.

Surely the basic concept is the same:
- get starting url
- parse for links
- continue on links

So would it be so impossible to have interfaces that are implemented
by the cocoon CLI that do it directly and for a standalone wget it could
have its own way of handling it.

Re: file: implemented (Re: cvs commit: ...)

Posted by Nicola Ken Barozzi <ni...@apache.org>.


Stefano Mazzocchi wrote:
[...]
> [BTW, to give you context, I'm writing this while Jon (Stevens) saw me 
> replying to this and now he's going around the house saying 'anakia 
> rulez', 'dvsl is the way to go', 'you have to figure out a way to beat 
> anakia's speed or you're doomed'... gotta love open source! :)]

Then tell him that Jason just dumped DVSL in Maven to use something else 
based on Jelly. :-PPPPP

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Stefano Mazzocchi <st...@apache.org>.

Jeff Turner wrote:

>>>>><rant>
>>>>>The CLI is evil and should have been drowned at birth.  The Cocoon CLI
>>>>>can best be described as a crappy 'wget' implementation tacked onto the
>>>>>side of Cocoon.  It is slow as hell, full of bugs (eg css images) and
>>>>>practically unmaintained.  Rewriting wget in a corner of Cocoon was a
>>>>>blindingly stupid thing to do, and I am not about to waste my time fixing
>>>>>its bugs.  I would rather find a _real_ wget implementation in Java, that
>>>>>can handle CSS and doesn't do screwy things with filenames, and IF
>>>>>invoking Cocoon through the HTTP interface proves too slow (unlikely),
>>>>>then I'd wrap Cocoon in an Avalon block and feed it URLs passed over RMI.
>>>>></rant>
>>>>
>>>>Jeff, tell me, are you aware of how *exactly* the Cocoon CLI works?
>>>
>>>
>>>No.  <rant> should be <uninformed rant>.
>>
>>When I talk about something I don't know, I tend to ask questions first, 
>>than express my opinions. But that's me.
> 
> 
> I was not talking about something I don't know: I was _ranting_ about
> something whose code I am fairly familiar with, and of which I have 4
> months of painful experience.  The <rant> tags are a hint that what
> follows is not a carefully reasoned critique.  Websters defines 'rant'
> as:
> 
>   "To rave in violent, high-sounding, or extravagant language, without
>   dignity of thought"
> 
> Please remember the context; Nicola was suggesting an implementation of a
> new feature (schemes) that would tie Forrest even tighter to the CLI.  If
> it helps, more context is that I was writing at 3am after a day's
> fighting with Transformers :P

I know I shouldn't (and I'm getting year after year better on that) but 
when somebody says that I did a "blindingly stupid" thing, I tend to get 
pissed no matter what their context is :-Prrr

>>The Cocoon CLI extensively uses the cocoon-view to do two major things:
>>
>> 1) obtaining links
>> 2) pushing back translated links
>>
>>Cocoon CLI does link translation but it's Cocoon *ITSELF* that places 
>>them in the right position and this happens *before* things gets serialized.
>>
>>If you go the wget path you have to implement a link parser and 
>>translator for *every* hypertext-capable binary files our serializers 
>>can come up with.
> 
> 
> Or just hack it to support cocoon-view=links when it becomes necessary.

FYI, the Cocoon CLI uses link views for both GET and POST. The GET part 
is to retrive the list of hyperlinks that depart from that resource, the 
POST request is to send the link of "translated links" that cocoon must 
translate right before serializing.

If you decouple the CLI from Cocoon, that POST view must be made public, 
and this can create a *major* security hole, basically allowing anybody 
to come up with a page with links translated with client-injected 
information! Which is cross-side scripting attacks for dummies!

Believe me, dude, I've thought about this so much when I designed the 
CLI that my head hurt and when I tried to discuss this on the mail list 
*nobody* cared (at that point, I think only a few people even 
*understood* what a cocoon view was supposed to be)

But nothing is carved in stone and I don't care what solution we (in 
forrest) can come up with.

>>On the other hand, by implementing a Cocoon-aware CLI, we are gaining 
>>insights from the actual semantic content of the data and we can 
>>manipulate it when it's *still* semantically meaningful (thus earier to 
>>process).
> 
> 
> cocoon-view=links returns links from the decidedly unsemantic HTML, in
> order to get things like skin images.

??? the hyperlink semantics in HTML are only one thing that is 
semantically carved in stone on the web. Otherwise, there wouldn't be 
any google out there.

>>Don't know about others, but I think it's a much more elegant (and 
>>code-wise cheaper) solution than a semantically-unaware wget-like one.
> 
> 
> Yes, of course it's more elegant.  But _practically_, it is slow and full
> of bugs which no-one has volunteered to fix, and Forrest is suffering
> because of this.
 >
> Now why don't I stop whining, get in there and fix it?
> 
> Because in the long run,  I would prefer to develop a separate wget-like
> tool with cocoon-view hacks added to it, than to develop the CLI into a
> full-blown threaded crawler.  Why?  Because a separate tool has a _much_
> larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
> elegant, but a separate tool can grow geometrically while the CLI grows
> linearly.

Hey, know what? you'd have my full support if you took some of the CLI 
code out of Cocoon and made it part of Forrest. (not all of it, some XSP 
precompilation technology uses it) because I agree with you: the wrong 
community is currently maintaing that code.

[BTW, to give you context, I'm writing this while Jon (Stevens) saw me 
replying to this and now he's going around the house saying 'anakia 
rulez', 'dvsl is the way to go', 'you have to figure out a way to beat 
anakia's speed or you're doomed'... gotta love open source! :)]

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Andrew Savory wrote:
> On Fri, 13 Dec 2002, Jeff Turner wrote:
> 
> 
>>Because in the long run,  I would prefer to develop a separate wget-like
>>tool with cocoon-view hacks added to it, than to develop the CLI into a
>>full-blown threaded crawler.  Why?  Because a separate tool has a _much_
>>larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
>>elegant, but a separate tool can grow geometrically while the CLI grows
>>linearly.
> 
> 
> I can see some serious advantages to splitting the crawler from the CLI:
> when the crawler is there, it would be fantastic to add a "precacher"
> using the crawler (go hit my entire site, including internal cocoon-views)
> rather than the "traditional" approach of running wget on a site. I
> suspect various other things that rely on crawling (such as search
> implementations like the Lucene code) would benefit from the speed
> increase of a dedicated crawler, too.
> 
> I think it would be best done as part of Cocoon rather than Forrest though
> (or am I missing the point *again*? ;-), as there are more ways it would
> be used there.

In Cocoon CVS, there is a scratchpad effort to decouple the crawling 
from the CLI, and an Ant task that also can use that crawler.

So yes, the crawler will most probably be indipendent from the CLI.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Andrew Savory <an...@luminas.co.uk>.

On Fri, 13 Dec 2002, Jeff Turner wrote:

> Because in the long run,  I would prefer to develop a separate wget-like
> tool with cocoon-view hacks added to it, than to develop the CLI into a
> full-blown threaded crawler.  Why?  Because a separate tool has a _much_
> larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
> elegant, but a separate tool can grow geometrically while the CLI grows
> linearly.

I can see some serious advantages to splitting the crawler from the CLI:
when the crawler is there, it would be fantastic to add a "precacher"
using the crawler (go hit my entire site, including internal cocoon-views)
rather than the "traditional" approach of running wget on a site. I
suspect various other things that rely on crawling (such as search
implementations like the Lucene code) would benefit from the speed
increase of a dedicated crawler, too.

I think it would be best done as part of Cocoon rather than Forrest though
(or am I missing the point *again*? ;-), as there are more ways it would
be used there.

Andrew.

-- 
Andrew Savory                                Email: andrew@luminas.co.uk
Managing Director                              Tel:  +44 (0)870 741 6658
Luminas Internet Applications                  Fax:  +44 (0)700 598 1135
This is not an official statement or order.    Web:    www.luminas.co.uk

Re: file: implemented (Re: cvs commit: ...)

Posted by Jeff Turner <je...@apache.org>.

On Thu, Dec 12, 2002 at 11:10:29AM -0800, Stefano Mazzocchi wrote:
> Jeff Turner wrote:
> >On Thu, Dec 12, 2002 at 12:13:06AM -0800, Stefano Mazzocchi wrote:
> >
> >>Jeff Turner wrote:
> >>
> >>
> >>><rant>
> >>>The CLI is evil and should have been drowned at birth.  The Cocoon CLI
> >>>can best be described as a crappy 'wget' implementation tacked onto the
> >>>side of Cocoon.  It is slow as hell, full of bugs (eg css images) and
> >>>practically unmaintained.  Rewriting wget in a corner of Cocoon was a
> >>>blindingly stupid thing to do, and I am not about to waste my time fixing
> >>>its bugs.  I would rather find a _real_ wget implementation in Java, that
> >>>can handle CSS and doesn't do screwy things with filenames, and IF
> >>>invoking Cocoon through the HTTP interface proves too slow (unlikely),
> >>>then I'd wrap Cocoon in an Avalon block and feed it URLs passed over RMI.
> >>></rant>
> >>
> >>Jeff, tell me, are you aware of how *exactly* the Cocoon CLI works?
> >
> >
> >No.  <rant> should be <uninformed rant>.
> 
> When I talk about something I don't know, I tend to ask questions first, 
> than express my opinions. But that's me.

I was not talking about something I don't know: I was _ranting_ about
something whose code I am fairly familiar with, and of which I have 4
months of painful experience.  The <rant> tags are a hint that what
follows is not a carefully reasoned critique.  Websters defines 'rant'
as:

  "To rave in violent, high-sounding, or extravagant language, without
  dignity of thought"

Please remember the context; Nicola was suggesting an implementation of a
new feature (schemes) that would tie Forrest even tighter to the CLI.  If
it helps, more context is that I was writing at 3am after a day's
fighting with Transformers :P

..
> The Cocoon CLI extensively uses the cocoon-view to do two major things:
> 
>  1) obtaining links
>  2) pushing back translated links
> 
> Cocoon CLI does link translation but it's Cocoon *ITSELF* that places 
> them in the right position and this happens *before* things gets serialized.
> 
> If you go the wget path you have to implement a link parser and 
> translator for *every* hypertext-capable binary files our serializers 
> can come up with.

Or just hack it to support cocoon-view=links when it becomes necessary.

> On the other hand, by implementing a Cocoon-aware CLI, we are gaining 
> insights from the actual semantic content of the data and we can 
> manipulate it when it's *still* semantically meaningful (thus earier to 
> process).

cocoon-view=links returns links from the decidedly unsemantic HTML, in
order to get things like skin images.

> Don't know about others, but I think it's a much more elegant (and 
> code-wise cheaper) solution than a semantically-unaware wget-like one.

Yes, of course it's more elegant.  But _practically_, it is slow and full
of bugs which no-one has volunteered to fix, and Forrest is suffering
because of this.

Now why don't I stop whining, get in there and fix it?

Because in the long run,  I would prefer to develop a separate wget-like
tool with cocoon-view hacks added to it, than to develop the CLI into a
full-blown threaded crawler.  Why?  Because a separate tool has a _much_
larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
elegant, but a separate tool can grow geometrically while the CLI grows
linearly.

--Jeff

> -- 
> Stefano Mazzocchi                               <st...@apache.org>
> --------------------------------------------------------------------
> 
>

Re: file: implemented (Re: cvs commit: ...)

Posted by Stefano Mazzocchi <st...@apache.org>.

Jeff Turner wrote:
> On Thu, Dec 12, 2002 at 12:13:06AM -0800, Stefano Mazzocchi wrote:
> 
>>Jeff Turner wrote:
>>
>>
>>><rant>
>>>The CLI is evil and should have been drowned at birth.  The Cocoon CLI
>>>can best be described as a crappy 'wget' implementation tacked onto the
>>>side of Cocoon.  It is slow as hell, full of bugs (eg css images) and
>>>practically unmaintained.  Rewriting wget in a corner of Cocoon was a
>>>blindingly stupid thing to do, and I am not about to waste my time fixing
>>>its bugs.  I would rather find a _real_ wget implementation in Java, that
>>>can handle CSS and doesn't do screwy things with filenames, and IF
>>>invoking Cocoon through the HTTP interface proves too slow (unlikely),
>>>then I'd wrap Cocoon in an Avalon block and feed it URLs passed over RMI.
>>></rant>
>>
>>Jeff, tell me, are you aware of how *exactly* the Cocoon CLI works?
> 
> 
> No.  <rant> should be <uninformed rant>.

When I talk about something I don't know, I tend to ask questions first, 
than express my opinions. But that's me.

> Still, can you tell my why Cocoon + lightweight HTTP server + a threaded
> crawler like:
> 
> http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html
> 
> won't be a zillion times faster?  And have a healthier user community,
> because it is sufficiently general to interest multiple parties.

Extracted from o.a.c.Main.java

     /**
      * Processes the given URI and return all links. The algorithm is 
the following:
      *
      * <ul>
      *  <li>file name for the URI is generated. URI MIME type is 
checked for
      *      consistency with the URI and, if the extension is inconsistent
      *      or absent, the file name is changed</li>
      *  <li>the link view of the given URI is called and the file names 
for linked
      *      resources are generated and stored.</li>
      *  <li>for each link, absolute file name is translated to relative 
path.</li>
      *  <li>after the complete list of links is translated, the 
link-translating
      *      view of the resource is called to obtain a link-translated 
version
      *      of the resource with the given link map</li>
      *  <li>list of absolute URI is returned, for every URI which is 
not yet
      *      present in list of all translated URIs</li>
      * </ul>
      * @param uri a <code>String</code> URI to process
      * @return a <code>Collection</code> containing all links found
      * @exception Exception if an error occurs
      */
public Collection processURI(String uri) throws Exception {

The Cocoon CLI extensively uses the cocoon-view to do two major things:

  1) obtaining links
  2) pushing back translated links

Cocoon CLI does link translation but it's Cocoon *ITSELF* that places 
them in the right position and this happens *before* things gets serialized.

If you go the wget path you have to implement a link parser and 
translator for *every* hypertext-capable binary files our serializers 
can come up with.

On the other hand, by implementing a Cocoon-aware CLI, we are gaining 
insights from the actual semantic content of the data and we can 
manipulate it when it's *still* semantically meaningful (thus earier to 
process).

Don't know about others, but I think it's a much more elegant (and 
code-wise cheaper) solution than a semantically-unaware wget-like one.

But again, that's me.

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Jeff Turner <je...@apache.org>.

On Thu, Dec 12, 2002 at 12:13:06AM -0800, Stefano Mazzocchi wrote:
> Jeff Turner wrote:
> 
> ><rant>
> >The CLI is evil and should have been drowned at birth.  The Cocoon CLI
> >can best be described as a crappy 'wget' implementation tacked onto the
> >side of Cocoon.  It is slow as hell, full of bugs (eg css images) and
> >practically unmaintained.  Rewriting wget in a corner of Cocoon was a
> >blindingly stupid thing to do, and I am not about to waste my time fixing
> >its bugs.  I would rather find a _real_ wget implementation in Java, that
> >can handle CSS and doesn't do screwy things with filenames, and IF
> >invoking Cocoon through the HTTP interface proves too slow (unlikely),
> >then I'd wrap Cocoon in an Avalon block and feed it URLs passed over RMI.
> ></rant>
> 
> Jeff, tell me, are you aware of how *exactly* the Cocoon CLI works?

No.  <rant> should be <uninformed rant>.

Still, can you tell my why Cocoon + lightweight HTTP server + a threaded
crawler like:

http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html

won't be a zillion times faster?  And have a healthier user community,
because it is sufficiently general to interest multiple parties.


--Jeff

Re: file: implemented (Re: cvs commit: ...)

Posted by Stefano Mazzocchi <st...@apache.org>.

Jeff Turner wrote:

> <rant>
> The CLI is evil and should have been drowned at birth.  The Cocoon CLI
> can best be described as a crappy 'wget' implementation tacked onto the
> side of Cocoon.  It is slow as hell, full of bugs (eg css images) and
> practically unmaintained.  Rewriting wget in a corner of Cocoon was a
> blindingly stupid thing to do, and I am not about to waste my time fixing
> its bugs.  I would rather find a _real_ wget implementation in Java, that
> can handle CSS and doesn't do screwy things with filenames, and IF
> invoking Cocoon through the HTTP interface proves too slow (unlikely),
> then I'd wrap Cocoon in an Avalon block and feed it URLs passed over RMI.
> </rant>

Jeff, tell me, are you aware of how *exactly* the Cocoon CLI works?

-- 
Stefano Mazzocchi                               <st...@apache.org>
--------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Jeff Turner <je...@apache.org>.

On Wed, Dec 11, 2002 at 02:55:10PM +0100, Nicola Ken Barozzi wrote:
> 
> Jeff Turner wrote:
> >On Wed, Dec 11, 2002 at 12:32:01AM +0100, Nicola Ken Barozzi wrote:
> [...]
> >>Sorry if I'm a bit strong on it, but it creates more confusion and 
> >>convolution.
> >>
> >>If we make links not be traversed in the definition,
> >
> >Who suggested that?  All links are traversed _except_ for those with
> >specific schemes like 'file:' or 'javadoc:', which are handled specially.
> >That is why I said the implied scheme is 'cocoon:'.
> 
> I mean if we define that a naming scheme blocks the traversing.

The filterlink.xsl stylesheet strips out links starting with 'file:', so
when Cocoon does a ?cocoon-view=links, the file: links aren't there.

>From that, I have a hard time seeing how to deduce:

> >>we are opening wide a door to abuse and failing of link checking.

<snip red herring on http: link checking>

> >>My proposed solution is to
> >>1) make Cocoon use resource-exists to see if it has to process it or not
> >
> >
> >We already have a resource-exists check.  The problem is that the CLI
> >adds a '.html' to the URLs of static resource, because it doesn't know
> >that they are static.  
> 
> It's because it doesn't know the MimeType.

Why does MIME type info matter?  What if a link has no easily guessable
MIME type, like <link href="foo.obj">?

> Hence we neeed the MimeTypeAction and fix the CLI so that it doesn't add 
> the "default" html to unknown mimetypes.

I think you'd have to remove the "append .html" behaviour altogether.

> It's a CLI bug and a missing feature, let's fix those instead of going 
> round them.

Let's fix the CLI...

<rant>
The CLI is evil and should have been drowned at birth.  The Cocoon CLI
can best be described as a crappy 'wget' implementation tacked onto the
side of Cocoon.  It is slow as hell, full of bugs (eg css images) and
practically unmaintained.  Rewriting wget in a corner of Cocoon was a
blindingly stupid thing to do, and I am not about to waste my time fixing
its bugs.  I would rather find a _real_ wget implementation in Java, that
can handle CSS and doesn't do screwy things with filenames, and IF
invoking Cocoon through the HTTP interface proves too slow (unlikely),
then I'd wrap Cocoon in an Avalon block and feed it URLs passed over RMI.
</rant>

Any enlightenment you can provide as to why the CLI _doesn't_ suck in
concept and implementation would be gratefully received.

But that is a side issue.  Yes, getting the CLI to stop appending '.html'
would fix the immediate problem.  Then when we come to implement a
'javadoc:' protocol, which will trigger further CLI problems.

> >To fix this, the CLI would need access to the
> >xlink:role attribute.  Every new scheme would require a Cocoon CLI hack.
> >
> >If we're going to have a proliferation of schemes, like 'site:',
> >'person:', 'javadoc:', 'mailinglist:', etc etc, doesn't it make sense to
> >have 'file:' as well?  And deal with all of them in Forrest's sitemap,
> >rather than the Cocoon CLI?  For instance, we _could_ hack the CLI to
> >ignore links starting with 'javadoc:', but isn't it easier to prevent
> >them being passed to Cocoon in the first place?  Then all support for the
> >'javadoc:' scheme is within Forrest XSLTs.
> 
> No no no, what has this to do with ignoring schemes?

Cocoon can't handle these schemes, so we must either:

1) Hack the CLI to ignore them and rewrite the HTML
2) Edit filterlinks.xsl to prevent Cocoon from even seeing them.

> You are mixing concerns, I will make a new mail on this to rey and
> unroll the loops.

Please do.

--Jeff

Re: file: implemented (Re: cvs commit: ...)

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Jeff Turner wrote:
> On Wed, Dec 11, 2002 at 12:32:01AM +0100, Nicola Ken Barozzi wrote:
[...]
>>Sorry if I'm a bit strong on it, but it creates more confusion and 
>>convolution.
>>
>>If we make links not be traversed in the definition,
> 
> Who suggested that?  All links are traversed _except_ for those with
> specific schemes like 'file:' or 'javadoc:', which are handled specially.
> That is why I said the implied scheme is 'cocoon:'.

I mean if we define that a naming scheme blocks the traversing.

>>we are opening wide a door to abuse and failing of link checking.
> 
> Actually with the current implementation, we can do _more_ link checking.
> Eg, we can record http: links (without rewriting them), and then later in
> forrest.build.xml, check if they really exist.

Cocoon has to do it, let's not do Ant post-processing on this.

>>My proposed solution is to
>>1) make Cocoon use resource-exists to see if it has to process it or not
> 
> 
> We already have a resource-exists check.  The problem is that the CLI
> adds a '.html' to the URLs of static resource, because it doesn't know
> that they are static.  

It's because it doesn't know the MimeType.
Hence we neeed the MimeTypeAction and fix the CLI so that it doesn't add 
the "default" html to unknown mimetypes.

It's a CLI bug and a missing feature, let's fix those instead of going 
round them.

> To fix this, the CLI would need access to the
> xlink:role attribute.  Every new scheme would require a Cocoon CLI hack.
> 
> If we're going to have a proliferation of schemes, like 'site:',
> 'person:', 'javadoc:', 'mailinglist:', etc etc, doesn't it make sense to
> have 'file:' as well?  And deal with all of them in Forrest's sitemap,
> rather than the Cocoon CLI?  For instance, we _could_ hack the CLI to
> ignore links starting with 'javadoc:', but isn't it easier to prevent
> them being passed to Cocoon in the first place?  Then all support for the
> 'javadoc:' scheme is within Forrest XSLTs.

No no no, what has this to do with ignoring schemes?
You are mixing concerns, I will make a new mail on this to rey and 
unroll the loops.

>>2) if it has to process it, use CAPs to understand how
>>3) have the possibility of "mounting" documents from outside, ie if I 
>>mount the /my/javadocs to the javadocs: protocol, I link to 
>>"javadocs:index.html" and it gets resolved to the correct path.
> 
> Yes.  I could implement javadocs: in half an hour if we can agree on
> where to obtain '/my/javadocs/' from.

Which is actually not the point, it has to be configurable ala linkmap...

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Jeff Turner <je...@apache.org>.

On Wed, Dec 11, 2002 at 12:32:01AM +0100, Nicola Ken Barozzi wrote:
...
> I think you jumped a bit too quickly on this, I'm -1 on it.

Reverted.

> It really looks like a big hack to get round an issue we still have to
> finish discussing.

Well it's all clear in _my_ head.. isn't that enough?  sheesh.. ;P

> I really don't like it, and would have preferred you wait a bit more
> before committing it.

Lazy consensus and all..

Actually, I just wanted to solve the "can't link to external file"
problem, but the implementation implies a larger design choice, so I've
reverted it.

> Sorry if I'm a bit strong on it, but it creates more confusion and 
> convolution.
> 
> If we make links not be traversed in the definition,

Who suggested that?  All links are traversed _except_ for those with
specific schemes like 'file:' or 'javadoc:', which are handled specially.
That is why I said the implied scheme is 'cocoon:'.

> we are opening wide a door to abuse and failing of link checking.

Actually with the current implementation, we can do _more_ link checking.
Eg, we can record http: links (without rewriting them), and then later in
forrest.build.xml, check if they really exist.

> My proposed solution is to
> 1) make Cocoon use resource-exists to see if it has to process it or not

We already have a resource-exists check.  The problem is that the CLI
adds a '.html' to the URLs of static resource, because it doesn't know
that they are static.  To fix this, the CLI would need access to the
xlink:role attribute.  Every new scheme would require a Cocoon CLI hack.

If we're going to have a proliferation of schemes, like 'site:',
'person:', 'javadoc:', 'mailinglist:', etc etc, doesn't it make sense to
have 'file:' as well?  And deal with all of them in Forrest's sitemap,
rather than the Cocoon CLI?  For instance, we _could_ hack the CLI to
ignore links starting with 'javadoc:', but isn't it easier to prevent
them being passed to Cocoon in the first place?  Then all support for the
'javadoc:' scheme is within Forrest XSLTs.

> 2) if it has to process it, use CAPs to understand how
> 3) have the possibility of "mounting" documents from outside, ie if I 
> mount the /my/javadocs to the javadocs: protocol, I link to 
> "javadocs:index.html" and it gets resolved to the correct path.

Yes.  I could implement javadocs: in half an hour if we can agree on
where to obtain '/my/javadocs/' from.

--Jeff

Re: file: implemented (Re: cvs commit: ...)

Posted by Jeff Turner <je...@apache.org>.

On Tue, Dec 10, 2002 at 05:04:06PM -0900, Matt Jones wrote:
...
> Jeffs commit looks great!  It would exactly solve a simple problem 
> simply.

Which is the best compliment.. thanks :)

> My only minor concern is that you used a scheme name that may be
> commonly found in URI's, ("file"), so you might want to consider
> something less prone to conflicts.

We could use raw: or source: instead, to emphasise that this scheme is
totally specific to xdocs interpreted with Forrest, and will be rewritten
in the output HTML.

However I think that when we have XMLs packed with interesting schemes
like site:, person:, javadoc:, mailinglist:, then file: will fit in
naturally as "this URL refers to a local file".

> I'm not as excited about Nicola's proposal:
> 
> >My proposed solution is to
> >1) make Cocoon use resource-exists to see if it has to process it or not
> >2) if it has to process it, use CAPs to understand how
> >3) have the possibility of "mounting" documents from outside, ie if I 
> >mount the /my/javadocs to the javadocs: protocol, I link to 
> >"javadocs:index.html" and it gets resolved to the correct path.
> >
> 
> This is confusing.  I've been poring over Forrest for the last week, 
> trying to get it to do what I want, and this still doesn't make much 
> sense to me.  I've read cap.html and it seems mildly relevant, but 
> pretty darn indirect compared to having the document author directly 
> specify what they want done with files.  What would I do?  How would I 
> specify that x.pdf is not to be processed nor have its links mangled, 
> but y.pdf should be?  This proposal from a naive perspective seems 
> overly complex for what it provides -- very cocoon-like :)

:)

> I can see a shadow of an idea forming there in which the user need not
> specify at all how to process these files, but cocoon would
> automatically know what to do with files.

Yes, that's the idea.

> But my guess is this will actually require complex configuration for
> each site that is not at all transparent to the user.

I don't think so.. the main problem is that it pushes handling of static
files into the Cocoon command-line.  With an explicit file: system,
everything is kept within the Forrest sitemap.  Proactive link handling,
rather than reactive.

But primarily, I prefer file: because it fits in with the broader
lets-have-lots-of-schemes theory of where Forrest should go.  We still
need to vote on that, hence I reverted the patch.  To get the file:
version of Forrest, type 'cvs update -r with_file_scheme'

--Jeff

> Matt
>

Re: file: implemented (Re: cvs commit: ...)

Posted by Matt Jones <jo...@nceas.ucsb.edu>.

Nicola Ken Barozzi wrote:
> 
> 
> Jeff Turner wrote:
> 
>> On Tue, Dec 10, 2002 at 11:56:20AM -0000, jefft@apache.org wrote:
<snip>

>>>  Log:
>>>  Add special handling of links that start with 'file:'.
>>>  These links are:
>>>   1) Not passed on to Cocoon (see filterlinks.xsl)
>>>   2) The 'file:' prefix is stripped from the HTML (see 
>>> filterlinks-html.xsl)
>>>   3) All file: links encountered during crawling are recorded in a file,
>>>   'unprocessed-files.txt' (see filterlinks.xsl and linkutils.xsl)
>>>   4) After running Cocoon, forrest.build.xml copies all files listed in
>>>   unprocessed-files.txt to build/site/ manually.  This is achieved 
>>> with a custom
>>>   selector, UncopiedFileSelector.java
>>

Jeffs commit looks great!  It would exactly solve a simple problem 
simply.  As linking in static files will be incredibly common, it would 
be best if it is as simple and transparent to do so as possible. I 
understood this approach immediately upon reading your description, and 
it makes sense.  It would easily let me include those pesky pdf files 
that don't need transformations.  My only minor concern is that you used 
a scheme name that may be commonly found in URI's, ("file"), so you 
might want to consider something less prone to conflicts.

I'm not as excited about Nicola's proposal:

> My proposed solution is to
> 1) make Cocoon use resource-exists to see if it has to process it or not
> 2) if it has to process it, use CAPs to understand how
> 3) have the possibility of "mounting" documents from outside, ie if I 
> mount the /my/javadocs to the javadocs: protocol, I link to 
> "javadocs:index.html" and it gets resolved to the correct path.
> 

This is confusing.  I've been poring over Forrest for the last week, 
trying to get it to do what I want, and this still doesn't make much 
sense to me.  I've read cap.html and it seems mildly relevant, but 
pretty darn indirect compared to having the document author directly 
specify what they want done with files.  What would I do?  How would I 
specify that x.pdf is not to be processed nor have its links mangled, 
but y.pdf should be?  This proposal from a naive perspective seems 
overly complex for what it provides -- very cocoon-like :)  I can see a 
shadow of an idea forming there in which the user need not specify at 
all how to process these files, but cocoon would automatically know what 
to do with files.  But my guess is this will actually require complex 
configuration for each site that is not at all transparent to the user.

I'm psyched to check out Jeff's changes and see if it works!

My $0.02.  Thanks for the great work.

Matt

Re: file: implemented (Re: cvs commit: ...)

Posted by Nicola Ken Barozzi <ni...@apache.org>.


Jeff Turner wrote:
> On Tue, Dec 10, 2002 at 11:56:20AM -0000, jefft@apache.org wrote:
> 
>>jefft       2002/12/10 03:56:20
>>
>>  Modified:    .        build.xml status.xml
>>               src/resources/conf sitemap.xmap
>>               src/resources/forrest-shbat forrest.build.xml
>>               src/resources/fresh-site/src/documentation/content/xdocs
>>                        sample.xml
>>               src/resources/library/xslt filterlinks.xsl
>>  Added:       src/resources/forrest-shbat/tasks/org/apache/forrest
>>                        UncopiedFileSelector.java
>>               src/resources/fresh-site/src/documentation/content hello.pdf
>>               src/resources/library/xslt filterlinks-html.xsl
>>                        linkutils.xsl
>>  Log:
>>  Add special handling of links that start with 'file:'.
>>  These links are:
>>   1) Not passed on to Cocoon (see filterlinks.xsl)
>>   2) The 'file:' prefix is stripped from the HTML (see filterlinks-html.xsl)
>>   3) All file: links encountered during crawling are recorded in a file,
>>   'unprocessed-files.txt' (see filterlinks.xsl and linkutils.xsl)
>>   4) After running Cocoon, forrest.build.xml copies all files listed in
>>   unprocessed-files.txt to build/site/ manually.  This is achieved with a custom
>>   selector, UncopiedFileSelector.java
> 
> 
> 
> Btw, the seed webapp includes an example of this.  The file
> src/documentation/content/hello.pdf is linked to in samples.xml, with
> <link href="file:hello.pdf">.
> 
> The implementation is pretty ugly.  I found that appending to an external
> file in XSLT is a royal PITA.  It currently works with a Xalan
> <redirect:write> extension.  While XSLT 2.0 (Saxon) implements an
> equivalent <xsl:redirect-document>, it can't append to an existing file.
> I'm currently trying to write a Transformer that records 'file:' links to
> a WriteableSource, to replace all this hacky XSLT.

I think you jumped a bit too quickly on this, I'm -1 on it.

It really looks like a big hack to get round an issue we still have to 
finish discussing. I really don't like it, and would have preferred you 
wait a bit more before committing it.

Sorry if I'm a bit strong on it, but it creates more confusion and 
convolution.

If we make links not be traversed in the definition, we are opening wide 
a door to abuse and failing of link checking.

My proposed solution is to
1) make Cocoon use resource-exists to see if it has to process it or not
2) if it has to process it, use CAPs to understand how
3) have the possibility of "mounting" documents from outside, ie if I 
mount the /my/javadocs to the javadocs: protocol, I link to 
"javadocs:index.html" and it gets resolved to the correct path.

We mus talso remember that our goal is also not to copy anything but to 
have it processed by Cocoon in a "natural" manner.

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------

Re: file: implemented (Re: cvs commit: ...)

Posted by Jeff Turner <je...@apache.org>.

On Tue, Dec 10, 2002 at 07:07:24PM +0100, Marc Portier wrote:
> Jeff Turner wrote:
...
> I'm still wondering about other ways, and I have some vague 
> memory of using a file-exists action before deciding on read or 
> skinned pipeline (which would be based on CAP)...
> 
> but since this is what we have currently, I'ld like to support it 
>  a bit further first
> 
> one small question: suppose I take up an 
> file:./build/api/index.html would it then also browse over all 
> the html that is referenced from there?

No, file: URIs can only specify files in src/documentation/content.  For
example, index.xml could link to its source XML with <link
href="file:xdocs/index.xml">source</link>.

There is a deeper misunderstanding here that _everyone_ is making :)
which I'll address in another email.

...
> >I'm currently trying to write a Transformer that records 'file:' links to
> >a WriteableSource, to replace all this hacky XSLT.
> 
> would you scan all the attributes (e.g. figure/@src as well?)

Not currently, but it could be arranged if required.. it currently looks like
this:

    <map:transform type="linklogger">
       <map:parameter name="tofile" value="context:/linkfile.log"/>
       <map:parameter name="schemes" value="file: java: person: info:"/>
       <map:parameter name="exclude-schemes" value="http: https:"/>
    </map:transform>

Where 'schemes' and 'exclude-schemes' specify which links to log.

Attached is the not-finished-but-working source.

> if you think I (we) could be of help, just check in what you 
> already have, and express some direction of thought in there...

Will do.

--Jeff

Re: file: implemented (Re: cvs commit: ...)

Posted by Marc Portier <mp...@outerthought.org>.

Jeff Turner wrote:
> On Tue, Dec 10, 2002 at 11:56:20AM -0000, jefft@apache.org wrote:
> 
>>jefft       2002/12/10 03:56:20
>>
>>  Modified:    .        build.xml status.xml
>>               src/resources/conf sitemap.xmap
>>               src/resources/forrest-shbat forrest.build.xml
>>               src/resources/fresh-site/src/documentation/content/xdocs
>>                        sample.xml
>>               src/resources/library/xslt filterlinks.xsl
>>  Added:       src/resources/forrest-shbat/tasks/org/apache/forrest
>>                        UncopiedFileSelector.java
>>               src/resources/fresh-site/src/documentation/content hello.pdf
>>               src/resources/library/xslt filterlinks-html.xsl
>>                        linkutils.xsl
>>  Log:
>>  Add special handling of links that start with 'file:'.
>>  These links are:
>>   1) Not passed on to Cocoon (see filterlinks.xsl)
>>   2) The 'file:' prefix is stripped from the HTML (see filterlinks-html.xsl)
>>   3) All file: links encountered during crawling are recorded in a file,
>>   'unprocessed-files.txt' (see filterlinks.xsl and linkutils.xsl)
>>   4) After running Cocoon, forrest.build.xml copies all files listed in
>>   unprocessed-files.txt to build/site/ manually.  This is achieved with a custom
>>   selector, UncopiedFileSelector.java
> 
> 
I'm still wondering about other ways, and I have some vague 
memory of using a file-exists action before deciding on read or 
skinned pipeline (which would be based on CAP)...

but since this is what we have currently, I'ld like to support it 
  a bit further first

one small question: suppose I take up an 
file:./build/api/index.html would it then also browse over all 
the html that is referenced from there?

> 
> Btw, the seed webapp includes an example of this.  The file
> src/documentation/content/hello.pdf is linked to in samples.xml, with
> <link href="file:hello.pdf">.
> 
> The implementation is pretty ugly.  I found that appending to an external
> file in XSLT is a royal PITA.  It currently works with a Xalan
> <redirect:write> extension.  While XSLT 2.0 (Saxon) implements an
> equivalent <xsl:redirect-document>, it can't append to an existing file.
> I'm currently trying to write a Transformer that records 'file:' links to
> a WriteableSource, to replace all this hacky XSLT.

would you scan all the attributes (e.g. figure/@src as well?)

if you think I (we) could be of help, just check in what you 
already have, and express some direction of thought in there...

> 
> 
> --Jeff
> 

regards,
-marc=
-- 
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
mpo@outerthought.org                              mpo@apache.org