You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Martin Aesch <ma...@googlemail.com> on 2013/12/09 20:06:45 UTC

DNS setup and issues

Dear nutchers,

how is your DNS-setup for your production cluster? Is DNS also your bottleneck?

Currently, I am using bind for me (pre-)production cluster. What I have learnt so far/will do:

1) Recursor/recursive+caching: takes a lot of time until dns are initially resolved. 

2) Forwarding+caching: using dns-servers of my provider. Their server is too slow / is slown down for too many requests. I get angry emails.

3) next idea: switching to powerdns-recursor and using some scripts to heat up and fill the dns-cache. Unfortunately, 
   the TTL is one hour or less (some are 5 minutes only), which is not exactly helpful. This is also the problem for 1) and 2), 
   so that resolved, cached entries are quickly invalidated.

Any suggestions are greatly appreciated.

Martin




Re: DNS setup and issues

Posted by Martin Aesch <ma...@googlemail.com>.
Thanks Julien, thanks, Markus,

seems my provider is somehow in particular picky, I was querying just ~
100/sec.

However, just for the records, I found a greatly working solution for my
problem.

pdns-recursor offers to set the TTL of cached records explicitly (I set
it to one week) and I am seeing more and more cache hits. Seems to work
at first glance.

-----Original Message-----
From: Julien Nioche <li...@gmail.com>
Reply-to: user@nutch.apache.org
To: user@nutch.apache.org <us...@nutch.apache.org>
Subject: Re: DNS setup and issues
Date: Tue, 10 Dec 2013 12:27:20 +0000

Hi Martin,

We used local DNS caches on the slave nodes when we were running the crawl
for SimilarPages (10+ billion pages in Crawldb) and IIRC were using some
external dns servers as the ones on EC2 at the time were not very robust +
they were getting quite angry with us. Can't quite remember what we used,
maybe the Google ones.

HTH

Julien


On 9 December 2013 19:06, Martin Aesch <ma...@googlemail.com> wrote:

> Dear nutchers,
>
> how is your DNS-setup for your production cluster? Is DNS also your
> bottleneck?
>
> Currently, I am using bind for me (pre-)production cluster. What I have
> learnt so far/will do:
>
> 1) Recursor/recursive+caching: takes a lot of time until dns are initially
> resolved.
>
> 2) Forwarding+caching: using dns-servers of my provider. Their server is
> too slow / is slown down for too many requests. I get angry emails.
>
> 3) next idea: switching to powerdns-recursor and using some scripts to
> heat up and fill the dns-cache. Unfortunately,
>    the TTL is one hour or less (some are 5 minutes only), which is not
> exactly helpful. This is also the problem for 1) and 2),
>    so that resolved, cached entries are quickly invalidated.
>
> Any suggestions are greatly appreciated.
>
> Martin
>
>
>
>




Re: DNS setup and issues

Posted by Julien Nioche <li...@gmail.com>.
Hi Martin,

We used local DNS caches on the slave nodes when we were running the crawl
for SimilarPages (10+ billion pages in Crawldb) and IIRC were using some
external dns servers as the ones on EC2 at the time were not very robust +
they were getting quite angry with us. Can't quite remember what we used,
maybe the Google ones.

HTH

Julien


On 9 December 2013 19:06, Martin Aesch <ma...@googlemail.com> wrote:

> Dear nutchers,
>
> how is your DNS-setup for your production cluster? Is DNS also your
> bottleneck?
>
> Currently, I am using bind for me (pre-)production cluster. What I have
> learnt so far/will do:
>
> 1) Recursor/recursive+caching: takes a lot of time until dns are initially
> resolved.
>
> 2) Forwarding+caching: using dns-servers of my provider. Their server is
> too slow / is slown down for too many requests. I get angry emails.
>
> 3) next idea: switching to powerdns-recursor and using some scripts to
> heat up and fill the dns-cache. Unfortunately,
>    the TTL is one hour or less (some are 5 minutes only), which is not
> exactly helpful. This is also the problem for 1) and 2),
>    so that resolved, cached entries are quickly invalidated.
>
> Any suggestions are greatly appreciated.
>
> Martin
>
>
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

RE: DNS setup and issues

Posted by Markus Jelsma <ma...@openindex.io>.
Hi - when we did very large scale web crawling (over 1000 pages per second for many millions of domains)  we did not have issues with DNS. We did try using local dns caching tools but they did not improve anything but make things worse in our case. We tried unscd, it may help you, or not.
 
 
-----Original message-----
> From:Martin Aesch <ma...@googlemail.com>
> Sent: Monday 9th December 2013 20:07
> To: user@nutch.apache.org
> Subject: DNS setup and issues
> 
> Dear nutchers,
> 
> how is your DNS-setup for your production cluster? Is DNS also your bottleneck?
> 
> Currently, I am using bind for me (pre-)production cluster. What I have learnt so far/will do:
> 
> 1) Recursor/recursive+caching: takes a lot of time until dns are initially resolved. 
> 
> 2) Forwarding+caching: using dns-servers of my provider. Their server is too slow / is slown down for too many requests. I get angry emails.
> 
> 3) next idea: switching to powerdns-recursor and using some scripts to heat up and fill the dns-cache. Unfortunately, 
>    the TTL is one hour or less (some are 5 minutes only), which is not exactly helpful. This is also the problem for 1) and 2), 
>    so that resolved, cached entries are quickly invalidated.
> 
> Any suggestions are greatly appreciated.
> 
> Martin
> 
> 
> 
>