You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Andrzej Bialecki (JIRA)" <ji...@apache.org> on 2006/05/27 22:47:30 UTC

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

    [ http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413604 ] 

Andrzej Bialecki  commented on NUTCH-289:
-----------------------------------------

I'm not sure how to address round-robin DNS with your approach ...

Also, I think the best place to resolve and record the IPs is in the fetcher, because it has to do it anyway. When generating we won't know the IPs until the next cycle, but the load on DNS will be much lower / more evenly distributed.

> CrawlDatum should store IP address
> ----------------------------------
>
>          Key: NUTCH-289
>          URL: http://issues.apache.org/jira/browse/NUTCH-289
>      Project: Nutch
>         Type: Bug

>   Components: fetcher
>     Versions: 0.8-dev
>     Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira