You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2007/06/27 08:39:27 UTC

[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

    [ https://issues.apache.org/jira/browse/NUTCH-289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508445 ] 

Doğacan Güney commented on NUTCH-289:
-------------------------------------

It seems this issue has kind of died down, but this would be a great feature to have. 

Here is how I think we can do this one (my proposal is _heavily_ based on Stefan Groschupf's work):

* Add ip as a field to CrawlDatum

* Fetcher always resolves ip and stores it in crawl_fetch (even if CrawlDatum already has an ip).

* A similar IpAddressResolver tool that reads crawl_fetch, crawl_parse (and probably crawldb) that (optionally) runs before updatedb. 
  - map: <url, CrawlDatum> ->  <host of url, <url, CrawlDatum>> . Add a field to CrawlDatum's metadata to indicate where (crawldb, crawl_fetch, crawl_parse) it is coming from(which will be removed in reduce). No lookup is performed in map().

  - reduce: <host, list(<url, CrawlDatum>)> -> <url, CrawlDatum>. If any CrawlDatum already contains an ip address (ip addresses in crawl_fetch having precedence over ones in crawldb) then output all crawl_parse datums with this ip address. Otherwise, perform a lookup. This way, we will not have to resolve ip for most of urls (in a way, we will still be getting the benefits of jvm cache :).

A downside of this approach is that we will either have to read crawldb twice or perform ip lookups for hosts in crawldb (but not in crawl_fetch).

* use cached ip during generation, if it exists.


> CrawlDatum should store IP address
> ----------------------------------
>
>                 Key: NUTCH-289
>                 URL: https://issues.apache.org/jira/browse/NUTCH-289
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>            Reporter: Doug Cutting
>         Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, ipInCrawlDatumDraftV5.1.patch, ipInCrawlDatumDraftV5.patch
>
>
> If the CrawlDatum stored the IP address of the host of it's URL, then one could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.