You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2014/01/23 16:15:37 UTC

[jira] [Updated] (NUTCH-1713) IpAddressResolver and DNSCache

     [ https://issues.apache.org/jira/browse/NUTCH-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lewis John McGibbney updated NUTCH-1713:
----------------------------------------

    Attachment: NUTCH-1713-trunk.patch

Patch contributed by [~wal]. I forgot to open a new issue for this contribution so it does not get lost if anyone would like to use it.

> IpAddressResolver and DNSCache
> ------------------------------
>
>                 Key: NUTCH-1713
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1713
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Lewis John McGibbney
>            Priority: Minor
>             Fix For: 2.3, 1.8
>
>         Attachments: NUTCH-1713-trunk.patch
>
>
> Hi Lewis,
> according to the mail I sent to you, I provide my patch for storing ip addresses in apache-nutch-1.5.1 as attachment.
> ( https://issues.apache.org/jira/browse/NUTCH-289 might also be appropriate!)
> In our project MIA (http://mia-marktplatz.de/) we spider the german www. To stay polite we had to switch to a 'byIP' policy to guarantee request frequencies of at least one minute per server. Crawling 'byHost' was no option, because many sites use up to some thousand subdomains hosted at a single server with one ip address.
> In proceeding with our crawl I realized that crawling by IP seemed to slow down, because in the process of generating the url lists nutch has to determine the ip address to build up the queues for urls according to their ip addresses.
> This solution is a simple solution which writes the once determined ip address into the metadata field of the CrawlDatum object. When a crawl cycle has finished its fetch job an additional map-reduce job is started to determine the ip addresses of newly fetched and parsed urls. New urls are inserted into the crawldb with their ip addresses if an ip address could have been determined.
> In this solution there exist also the two classes IpAddressResolver.java and DNSCache.java which cache already fetched ip addresses from the DNS and control the number of concurrent calls to the DNS from each map job. Since many urls with the same ip address should be generated into a queue I wanted to minimize the load which is taken to build up the queues. Caching ip addresses in-memory shouldn't be memory-consuming. To avoid to many concurrent requests to a DNS from the crawler, I added some code to restrict the number of parallel requests to the DNS.
> I use this piece of code in production since about three-quarters this year and it seems to work fine. The four configuration entries should be self-explaining.
> Cheers, Walter



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)