You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@forrest.apache.org by Jeff Turner <je...@apache.org> on 2002/12/13 11:14:26 UTC

Crawlers (Re: file: implemented)

On Fri, Dec 13, 2002 at 09:53:07AM +0000, Andrew Savory wrote:
> 
> On Fri, 13 Dec 2002, Jeff Turner wrote:
> 
> > Because in the long run,  I would prefer to develop a separate wget-like
> > tool with cocoon-view hacks added to it, than to develop the CLI into a
> > full-blown threaded crawler.  Why?  Because a separate tool has a _much_
> > larger audience, so will evolve faster.  Yes, a Cocoon CLI may be more
> > elegant, but a separate tool can grow geometrically while the CLI grows
> > linearly.
> 
> I can see some serious advantages to splitting the crawler from the CLI:
> when the crawler is there, it would be fantastic to add a "precacher"
> using the crawler (go hit my entire site, including internal cocoon-views)
> rather than the "traditional" approach of running wget on a site. I
> suspect various other things that rely on crawling (such as search
> implementations like the Lucene code) would benefit from the speed
> increase of a dedicated crawler, too.

Yes, in fact the only decent threaded Java crawler I've found so far is
in Lucene's sandbox:

http://jakarta.apache.org/lucene/docs/lucene-sandbox/larm/overview.html

Reading that overview shows what a tricky business it is to write a
_good_ crawler.  Trying to evolve the Cocoon CLI to this level of
sophistication seems.. silly :)  I would rather start with this good,
external implementation, and add any Cocoon-specific hacks required.

> I think it would be best done as part of Cocoon rather than Forrest though
> (or am I missing the point *again*? ;-), as there are more ways it would
> be used there.

As a general-purpose tool, I think it should be developed outside of both
Cocoon and Forrest, to attract the greatest possible number of
users/developers.

--Jeff

> Andrew.
> 
> -- 
> Andrew Savory                                Email: andrew@luminas.co.uk
> Managing Director                              Tel:  +44 (0)870 741 6658
> Luminas Internet Applications                  Fax:  +44 (0)700 598 1135
> This is not an official statement or order.    Web:    www.luminas.co.uk
>