You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cocoon.apache.org by Bernhard Huber <be...@a1.net> on 2002/02/02 20:07:07 UTC
Crawler/Indexer redesign
hi,
As I'm not totally happy with the Crawler, Indexer component interfaces
I want to address issues here:
Today CocoonCrawler exposes:
void crawl(URL), and Iterator iterator();
crawl sets the base url, and iterator() delivers one more URL reachable
from the base url.
I have some head-aches using URL objects in the commandline environment.
The only simple possibility is to use file: URLs which implicits storing
the xml document which has been crawled to the filesystem. But storing
it to the filesystem I want to avoid for sake of performance.
Thus I was thinking changing the interface to:
void crawl(Source) , and Iterator iterator();
Thus working with Source objects instead of URL objects.
The LuceneCocoonIndexer should also change from using URL to using Source.
The main reason for this change is implementing crawling and indexing
today works only using the http: protocol.
If you want to index xml documents of the local cocoon, or if you want
to create an index in the command line version of Cocoon, you may not be
able to use the http protocol.
Thus I was thinking about using Source.
Perhaps someone having a broader, and more detailed understanding of the
Cocoon internas could help me a bit.
bye bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
RE: Crawler/Indexer redesign
Posted by Vadim Gritsenko <va...@verizon.net>.
> From: Bernhard Huber [mailto:berni_huber@a1.net]
>
> hi,
>
> >How about
> >
> > Collection crawl(Source)
> >
> >? Then crawler can be ThreadSafe.
> >
> Yes, it would be ThreadSafe, storing all crawled resources in the
> collection.
> Does this work for crawling huge sites?
>
> My idea was to handle that problem by introducing the Iterator.
> Using Iterator might allow to process some crawled resources quite
early.
> Using collection might delay the processing of the crawled resources
> until the crawling has terminated,
> that might take quite some time.
>
> Hence it might be better:
> Iterator crawl( Source)
Go for it. Just make sure you are not buffering results from this
Iterator somewhere down the pipe ;)
Vadim
>
> bye bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
Re: Crawler/Indexer redesign
Posted by Bernhard Huber <be...@a1.net>.
hi,
>How about
>
> Collection crawl(Source)
>
>? Then crawler can be ThreadSafe.
>
Yes, it would be ThreadSafe, storing all crawled resources in the
collection.
Does this work for crawling huge sites?
My idea was to handle that problem by introducing the Iterator.
Using Iterator might allow to process some crawled resources quite early.
Using collection might delay the processing of the crawled resources
until the crawling has terminated,
that might take quite some time.
Hence it might be better:
Iterator crawl( Source)
bye bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org
RE: Crawler/Indexer redesign
Posted by Vadim Gritsenko <va...@verizon.net>.
> From: Bernhard Huber [mailto:berni_huber@a1.net]
>
> hi,
>
> As I'm not totally happy with the Crawler, Indexer component
interfaces
> I want to address issues here:
>
> Today CocoonCrawler exposes:
> void crawl(URL), and Iterator iterator();
> crawl sets the base url, and iterator() delivers one more URL
reachable
> from the base url.
> I have some head-aches using URL objects in the commandline
environment.
> The only simple possibility is to use file: URLs which implicits
storing
> the xml document which has been crawled to the filesystem. But storing
> it to the filesystem I want to avoid for sake of performance.
>
> Thus I was thinking changing the interface to:
> void crawl(Source) , and Iterator iterator();
> Thus working with Source objects instead of URL objects.
How about
Collection crawl(Source)
? Then crawler can be ThreadSafe.
Vadim
> The LuceneCocoonIndexer should also change from using URL to using
Source.
>
> The main reason for this change is implementing crawling and indexing
> today works only using the http: protocol.
> If you want to index xml documents of the local cocoon, or if you want
> to create an index in the command line version of Cocoon, you may not
be
> able to use the http protocol.
> Thus I was thinking about using Source.
>
> Perhaps someone having a broader, and more detailed understanding of
the
> Cocoon internas could help me a bit.
>
> bye bernhard
---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org