You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Markus Jelsma <ma...@openindex.io> on 2016/02/01 11:18:22 UTC

RE: DNS caching best practices

Otis - we tried local DNS caching when we did very large scale crawls but decided to get rid of it as soon as possible because it got us too much overhead. Instead, we relied on an, apparently, powerful DNS server put available by the ISP in the network center. If the server is fast and has a lot of RAM the mapper won't quickly overwhelm it.

Markus
 
 
-----Original message-----
> From:Otis Gospodnetić <ot...@gmail.com>
> Sent: Sunday 31st January 2016 23:36
> To: Nutch User List <nu...@lucene.apache.org>
> Subject: DNS caching best practices
> 
> Hi,
> 
> The first item on http://wiki.apache.org/nutch/OptimizingCrawls is DNS
> caching.  Is this still something people regularly do?  Even when running
> in EC2, which I assume has nameservers that are relatively close to
> instances doing crawling and nameserver lookups?
> 
> If so, are there any recommendations for the best DNS caching server/config
> to use?
> 
> Thanks,
> Otis
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>

Re: DNS caching best practices

Posted by Alexander Sibiryakov <si...@yandex.ru>.

Otis, Marcus,
it depends on the speed you operate your crawler. If it’s relatively slow, than that’s ok using ISP general purpose DNS for it.

I think below information could be useful, just to realize what kind of problems we cause to internet infrastructure.

I was talking with one of the guys from https://selectel.ru/ <https://selectel.ru/> (huge cloud and hosting provider) responsible for DNS service, and he said they built a dedicated DNS cache for various crawlers and bots, to help persist the cache in their main DNS server. Before that, during the night time (the crawlers time!) the cache were changing significantly and causing slow downs for typical users next day.

The recommendation from him was to use http://unbound.net/ <http://unbound.net/> as a local caching DNS service, and configuring it without upstream, so it will resolve DNS recursively on it’s own. It even provides a way to dump/load a cache on disk.

Linux OS has no internal DNS cache, so it makes sense if your crawler makes repetitive requests to the same website.

A.

> 1 февр. 2016 г., в 11:18, Markus Jelsma <ma...@openindex.io> написал(а):
> 
> Otis - we tried local DNS caching when we did very large scale crawls but decided to get rid of it as soon as possible because it got us too much overhead. Instead, we relied on an, apparently, powerful DNS server put available by the ISP in the network center. If the server is fast and has a lot of RAM the mapper won't quickly overwhelm it.
> 
> Markus
> 
> 
> -----Original message-----
>> From:Otis Gospodnetić <ot...@gmail.com>
>> Sent: Sunday 31st January 2016 23:36
>> To: Nutch User List <nu...@lucene.apache.org>
>> Subject: DNS caching best practices
>> 
>> Hi,
>> 
>> The first item on http://wiki.apache.org/nutch/OptimizingCrawls is DNS
>> caching.  Is this still something people regularly do?  Even when running
>> in EC2, which I assume has nameservers that are relatively close to
>> instances doing crawling and nameserver lookups?
>> 
>> If so, are there any recommendations for the best DNS caching server/config
>> to use?
>> 
>> Thanks,
>> Otis
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>