You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cocoon.apache.org by Bernhard Huber <be...@a1.net> on 2002/02/02 20:07:07 UTC

Crawler/Indexer redesign

  hi,

As I'm not totally happy with the Crawler, Indexer component interfaces 
I want to address issues here:

Today CocoonCrawler exposes:
 void crawl(URL), and Iterator iterator();
crawl sets the base url, and iterator() delivers one more URL reachable 
from the base url.
I have some head-aches using URL objects in the commandline environment.
The only simple possibility is to use file: URLs which implicits storing 
the xml document which has been crawled to the filesystem. But storing 
it to the filesystem I want to avoid for sake of performance.

Thus I was thinking changing the interface to:
void crawl(Source) , and Iterator iterator();
Thus working with Source objects instead of URL objects.

The LuceneCocoonIndexer should also change from using URL to using Source.

The main reason for this change is implementing crawling and indexing 
today works only using the http: protocol.
If you want to index xml documents of the local cocoon, or if you want 
to create an index in the command line version of Cocoon, you may not be 
able to use the http protocol.
Thus I was thinking about using Source.

Perhaps someone having a broader, and more detailed understanding of the 
Cocoon internas could help me a bit.

bye bernhard




---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Crawler/Indexer redesign

Posted by Vadim Gritsenko <va...@verizon.net>.

> From: Bernhard Huber [mailto:berni_huber@a1.net]
> 
> hi,
> 
> >How about
> >
> >  Collection crawl(Source)
> >
> >? Then crawler can be ThreadSafe.
> >
> Yes, it would be ThreadSafe, storing all crawled resources in the
> collection.
> Does this work for crawling huge sites?
> 
> My idea was to handle that problem by introducing the Iterator.
> Using Iterator might allow to process some crawled resources quite
early.
> Using collection might delay the processing of the crawled resources
> until the crawling has terminated,
> that might take quite some time.
> 
> Hence it might be better:
> Iterator crawl( Source)

Go for it. Just make sure you are not buffering results from this
Iterator somewhere down the pipe ;)

Vadim

> 
> bye bernhard


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

Re: Crawler/Indexer redesign

Posted by Bernhard Huber <be...@a1.net>.

hi,

>How about 
>
>  Collection crawl(Source)
>
>? Then crawler can be ThreadSafe.
>
Yes, it would be ThreadSafe, storing all crawled resources in the 
collection.
Does this work for crawling huge sites?

My idea was to handle that problem by introducing the Iterator.
Using Iterator might allow to process some crawled resources quite early.
Using collection might delay the processing of the crawled resources 
until the crawling has terminated,
that might take quite some time.

Hence it might be better:
Iterator crawl( Source)

bye bernhard





---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org

RE: Crawler/Indexer redesign

Posted by Vadim Gritsenko <va...@verizon.net>.

> From: Bernhard Huber [mailto:berni_huber@a1.net]
> 
>   hi,
> 
> As I'm not totally happy with the Crawler, Indexer component
interfaces
> I want to address issues here:
> 
> Today CocoonCrawler exposes:
>  void crawl(URL), and Iterator iterator();
> crawl sets the base url, and iterator() delivers one more URL
reachable
> from the base url.
> I have some head-aches using URL objects in the commandline
environment.
> The only simple possibility is to use file: URLs which implicits
storing
> the xml document which has been crawled to the filesystem. But storing
> it to the filesystem I want to avoid for sake of performance.
> 
> Thus I was thinking changing the interface to:
> void crawl(Source) , and Iterator iterator();
> Thus working with Source objects instead of URL objects.

How about 

  Collection crawl(Source)

? Then crawler can be ThreadSafe.


Vadim

 
> The LuceneCocoonIndexer should also change from using URL to using
Source.
> 
> The main reason for this change is implementing crawling and indexing
> today works only using the http: protocol.
> If you want to index xml documents of the local cocoon, or if you want
> to create an index in the command line version of Cocoon, you may not
be
> able to use the http protocol.
> Thus I was thinking about using Source.
> 
> Perhaps someone having a broader, and more detailed understanding of
the
> Cocoon internas could help me a bit.
> 
> bye bernhard


---------------------------------------------------------------------
To unsubscribe, e-mail: cocoon-dev-unsubscribe@xml.apache.org
For additional commands, email: cocoon-dev-help@xml.apache.org