You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Guenter, Matthias" <Ma...@ipi.ch> on 2006/01/24 11:08:19 UTC
Two possible extensions
Hi
Would it be of interest for the project to have an extension of crawl that allows:
- shaping the bandwidth used (inbound)
- keeping the number of request per second in a certain limit
- is able to schedule that with a difference between working hours and night
And an extension that crawls only file: /http: requests which have changed after a given date.
Sort of sh ./nutch crawl -changedafter="2006-01-04"?
The code could be delivered end of April as part of a student project.
Kind regards
Matthias Günter
Re: Two possible extensions
Posted by Stefan Groschupf <sg...@media-style.com>.
Hi.
Check the mail archive, some of theses things was already discussed
and I guess people already have some code / plans but it is not yet
part of the sources.
In any cases such contributions are very welcome from my point of view.
Stefan
Am 24.01.2006 um 11:08 schrieb Guenter, Matthias:
> Hi
> Would it be of interest for the project to have an extension of
> crawl that allows:
> - shaping the bandwidth used (inbound)
> - keeping the number of request per second in a certain limit
> - is able to schedule that with a difference between working hours
> and night
>
> And an extension that crawls only file: /http: requests which have
> changed after a given date.
> Sort of sh ./nutch crawl -changedafter="2006-01-04"?
>
> The code could be delivered end of April as part of a student project.
>
> Kind regards
>
> Matthias Günter
>
>
Re: Two possible extensions
Posted by Andrzej Bialecki <ab...@getopt.org>.
Guenter, Matthias wrote:
> Hi
> Would it be of interest for the project to have an extension of crawl that allows:
> - shaping the bandwidth used (inbound)
> - keeping the number of request per second in a certain limit
> - is able to schedule that with a difference between working hours and night
>
Assuming we are talking about the SVN trunk/ (other branches are in the
maintenance mode only, no new features). With the current trunk/ being
based on map-reduce, I think this would require something like a central
"lock manager" - this would come very handy for other plugins, too. E.g.
the protocol plugins currently don't split the fetchlists (i.e. fetching
is performed by a single task) because they have no way to coordinate
the access to target hosts among distributed fetching tasks.
> And an extension that crawls only file: /http: requests which have changed after a given date.
>
Please see the code in NUTCH-61 .
> Sort of sh ./nutch crawl -changedafter="2006-01-04"?
>
> The code could be delivered end of April as part of a student project.
>
Certainly it sounds interesting. However, I think it's essential for the
acceptance by the community and general usefulness that this should be
coordinated with the existing efforts, and discussed on the mailing lists.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com