You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by Bruno Dumon <br...@outerthought.org> on 2002/12/13 11:02:56 UTC

cocoon crawler, wget, the problem of extracting links

After all the discussions about the crawler, it might be good to come
back to the original problem: suppose a user has a bunch of files
generated by some foreign tool (e.g. javadoc, but could be anything),
and wants to publish these as part of a forrest site, how should this
work?

In a live webapp there's no problem, since the browser will send
requests for specific files which will then be served using map:read.

The crawler on the other hand, should be able to somehow find out all
the links in these files. While we might be able to implement a
link-view for css and html, it becomes practically impossible to
retrieve links from flash movies or pdf files, or maybe some special
file type interpreted by some special browser plugin. There's no way
that we'll ever be able to support extracting links from all existing
file types. (and this is a problem both with the crawler or any
wget-like solution) (but we could of course choose to not support these
special file types)

So maybe we should start thinking about a completely other way to solve
this?

The easy solution for us would be to tell the user to make the files
somewhere available on a http server, and use http: links to link to
those files.

Another solution would be to make a list of URL's for all these files
and feed that to the crawler. The thing that makes this list would of
course need to have some assumptions on how files on the filesystem or
mapped in the URL space.

-- 
Bruno Dumon                             http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
bruno@outerthought.org


Re: cocoon crawler, wget, the problem of extracting links

Posted by Keiron Liddle <ke...@aftexsw.com>.
On Fri, 2002-12-13 at 11:02, Bruno Dumon wrote:
> After all the discussions about the crawler, it might be good to come
> back to the original problem: suppose a user has a bunch of files
> generated by some foreign tool (e.g. javadoc, but could be anything),
> and wants to publish these as part of a forrest site, how should this
> work?

If it is a clearly defined set of files then why not copy all the data
across and keep a list of the files for checking any links to that file
set. The difference being that the servlet it will return only one file
whereas the crawler could copy all files.
Would still need to deal with links coming out of that set of files.

> In a live webapp there's no problem, since the browser will send
> requests for specific files which will then be served using map:read.
> 
> The crawler on the other hand, should be able to somehow find out all
> the links in these files. While we might be able to implement a
> link-view for css and html, it becomes practically impossible to
> retrieve links from flash movies or pdf files, or maybe some special
> file type interpreted by some special browser plugin. There's no way
> that we'll ever be able to support extracting links from all existing
> file types. (and this is a problem both with the crawler or any
> wget-like solution) (but we could of course choose to not support these
> special file types)

Getting the links from a pdf file would be quite easy, all that is
needed is a simple pdf format parser and something to read the links.

The point I would make is make it easy to plug in such a link-view and
encourage it to be done.

> So maybe we should start thinking about a completely other way to solve
> this?
> 
> The easy solution for us would be to tell the user to make the files
> somewhere available on a http server, and use http: links to link to
> those files.
> 
> Another solution would be to make a list of URL's for all these files
> and feed that to the crawler. The thing that makes this list would of
> course need to have some assumptions on how files on the filesystem or
> mapped in the URL space.


Re: cocoon crawler, wget, the problem of extracting links

Posted by Nicola Ken Barozzi <ni...@apache.org>.

Steven Noels wrote:
> Bruno Dumon wrote:
> 
>> Another solution would be to make a list of URL's for all these files
>> and feed that to the crawler. The thing that makes this list would of
>> course need to have some assumptions on how files on the filesystem or
>> mapped in the URL space.
> 
> Or vice-versa.
> 
> I'm still stuck with this idea of having a LinkResolverTranformer which, 
> given a configuration of schemes and their respective source resolution, 
> would rewrite links as needed. It might be "boneheaded me", and 
> orthogonal/supplementary to the sitemap and what is currently put 
> forward, but I want to do my thinking in public.

[...]

> Does this make sense at all?

Yes, it does.

It's exactly the same concept in my "Concern 1" section about link 
lookup and resolving. I modeled it as an action, but forgot to add the 
transformation of links too, that you explain here.

+1 (about the concept, we will see what makes more sense 
implementation-wise)

-- 
Nicola Ken Barozzi                   nicolaken@apache.org
             - verba volant, scripta manent -
    (discussions get forgotten, just code remains)
---------------------------------------------------------------------


Re: cocoon crawler, wget, the problem of extracting links

Posted by Steven Noels <st...@outerthought.org>.
Bruno Dumon wrote:

> Another solution would be to make a list of URL's for all these files
> and feed that to the crawler. The thing that makes this list would of
> course need to have some assumptions on how files on the filesystem or
> mapped in the URL space.

Or vice-versa.

I'm still stuck with this idea of having a LinkResolverTranformer which, 
given a configuration of schemes and their respective source resolution, 
would rewrite links as needed. It might be "boneheaded me", and 
orthogonal/supplementary to the sitemap and what is currently put 
forward, but I want to do my thinking in public.

Let me try to explain where I'm aiming at:

<warning>Steven's massive FS capabilities ahead ;-)</warning>

instance plop.xml:

<?xml version="1.0"?>
<document>
   <p>This is a <link href="file:images/plop.png"/>plop</link></p>
</document>

pipeline:

<generate src="plop.xml"/>
<transform type="link" name="linkresolutionset1"/>
<transform ...
<serialize/>

and some config, perhaps using inputmodules, for that transformer:

<linkresolver>
   <scheme name="file">
     <match pattern="**">
       <pipeline target="cocoon:/{1}"/>
     </match>
   </scheme>
   <scheme name="javadoc">
     <match pattern="**">
       <static src="{context}/../ROOT/static/javadoc/{1}"/>
     </match>
   </scheme>
   <scheme name="ldap">
     <match pattern="**">
       <ldapquery...
     </match>
   </scheme>

Most of what this transfromer does could be done using XSLT, but doing 
it in code, using some hierarchical configuration à la JXPath would be 
coolio.

Does this make sense at all?

</Steven>
-- 
Steven Noels                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at              http://radio.weblogs.com/0103539/
stevenn at outerthought.org                stevenn at apache.org