You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Guenter, Matthias" <Ma...@ipi.ch> on 2006/01/24 11:08:19 UTC

Two possible extensions

Hi
Would it be of interest for the project to have an extension of crawl that allows:
- shaping the bandwidth used (inbound)
- keeping the number of request per second in a certain limit
- is able to schedule that with a difference between working hours and night

And an extension that crawls only file: /http: requests which have changed after a given date.
Sort of  sh ./nutch crawl -changedafter="2006-01-04"?

The code could be delivered end of April as part of a student project.

Kind regards

Matthias Günter

Re: Two possible extensions

Posted by Stefan Groschupf <sg...@media-style.com>.

Hi.
Check the mail archive, some of theses things was already discussed  
and I guess people already have some code / plans but it is not yet  
part of the sources.
In any cases such contributions are very welcome from my point of view.

Stefan


Am 24.01.2006 um 11:08 schrieb Guenter, Matthias:

> Hi
> Would it be of interest for the project to have an extension of  
> crawl that allows:
> - shaping the bandwidth used (inbound)
> - keeping the number of request per second in a certain limit
> - is able to schedule that with a difference between working hours  
> and night
>
> And an extension that crawls only file: /http: requests which have  
> changed after a given date.
> Sort of  sh ./nutch crawl -changedafter="2006-01-04"?
>
> The code could be delivered end of April as part of a student project.
>
> Kind regards
>
> Matthias Günter
>
>

Re: Two possible extensions

Posted by Andrzej Bialecki <ab...@getopt.org>.

Guenter, Matthias wrote:
> Hi
> Would it be of interest for the project to have an extension of crawl that allows:
> - shaping the bandwidth used (inbound)
> - keeping the number of request per second in a certain limit
> - is able to schedule that with a difference between working hours and night
>   

Assuming we are talking about the SVN trunk/ (other branches are in the 
maintenance mode only, no new features). With the current trunk/ being 
based on map-reduce, I think this would require something like a central 
"lock manager" - this would come very handy for other plugins, too. E.g. 
the protocol plugins currently don't split the fetchlists (i.e. fetching 
is performed by a single task) because they have no way to coordinate 
the access to target hosts among distributed fetching tasks.

> And an extension that crawls only file: /http: requests which have changed after a given date.
>   

Please see the code in NUTCH-61 .

> Sort of  sh ./nutch crawl -changedafter="2006-01-04"?
>
> The code could be delivered end of April as part of a student project.
>   

Certainly it sounds interesting. However, I think it's essential for the 
acceptance by the community and general usefulness that this should be 
coordinated with the existing efforts, and discussed on the mailing lists.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com