You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/08/03 10:19:51 UTC

dns lookup cache?

Hi there,
does anyhow nutch cache dns lookups.
I found this paper and section 3.7 gives some very interesting  
information.
We notice that our crawlers often crash after a set of unknown host  
exceptions.
We have already one dual cpu box with a 1Gbit network connection  
running BIND.

So I have 2 questions:
People think is may java domain lookup may be a bottleneck that  
crashs the crawlers?
Other crawlers have a kind of dns cache would that make sense to  
introduce it to nutch as well?

Thanks for any comments.
Stefan


Re: dns lookup cache?

Posted by Jay Pound <we...@poundwebhosting.com>.
run it on a non-ppc machine, i hate to say it but I've benchmarked nutch on
all platforms but sparc, it runs best on a amd 64-opteron machine, I work
next to a printshop that has g4's and g5's, they didn't fare well with
nutch, if your problem is the dns server then try an intel machine
microsoft's dns server is very fast on intel hardware I never have cpu over
2% when rolling with 30 queries a sec lookup(I do have a quad xeon though)
Bind is not as fast unfortunately :(   some day though
I had bind crash on me when I was testing it I set it up to restart but it
would drop the pages between restarts.
(maybe someday java will take advantage of the altivec and nutch will scream
on ppc, then we can run it on our xbox 360's!!!)
I'm not bashing mac or ppc hardware, its just if you don't have a Power5
then its not going to do it well

-Jay


----- Original Message ----- 
From: "Stefan Groschupf" <sg...@media-style.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, August 03, 2005 11:53 AM
Subject: Re: dns lookup cache?


> Ok, I see what you mean. You run a dns server inside your network.
> Right?
> We run bind 9.2.2 with a 1000 mbit internat connection to our crawler
> boxes.
> The box has 2 ppc cpus and 2 gig of ram, however in peaks the cpus
> are 50 % bussy just by doing dns lookups.
> However I don't think the real problem is cpu power or memory.
> As the link i posted earlier today mention the bind and java lookup
> implementation is not multithreaded and that#s why I personal guess
> the nutch crawler runs in a kind of deadlock.
> A good reading is:
> http://buzzsurf.com/java/dns/
> Hertitrix use javadns and I'm working on a similar / simple solution
> that may not need any code change just some system administration
> however use a real crawler box local dns cache.
> If people are interested I can post results.
>
> Stefan
>
>
>
> Am 03.08.2005 um 17:30 schrieb Jay Pound:
>
> > you setup your own dns server, a separate machine to your crawling
> > box, it
> > doesn't have to be powerful, it can be a 500mhz Pentium 3, but you
> > need to
> > have at least 512mb of ram in it, 1gb recommended, you point your
> > fetcher
> > machine to the dns server as its primary dns server and presto
> > internal dns
> > caching!!!
> > -J
> > PS: the easiest dns server to setup if your a windows person is
> > windows 2000
> > server or windows 2003 server, you just enable it and it runs,
> > there are
> > many dns servers for linux, most distributions come with it on cd,
> > mac osx
> > server has it also.
> > ----- Original Message -----
> > From: "Stefan Groschupf" <sg...@media-style.com>
> > To: <nu...@lucene.apache.org>
> > Sent: Wednesday, August 03, 2005 11:05 AM
> > Subject: Re: dns lookup cache?
> >
> >
> >
> >> How you do 'internal' domain caching?
> >> Thanks.
> >> Stefan
> >> Am 03.08.2005 um 16:51 schrieb Jay Pound:
> >>
> >>
> >>> I've got a fast internal dns cache so nutch wont need one, and it
> >>> did stop a
> >>> lot of the errors with nutch host not found-timeout, most isp's dns
> >>> server
> >>> is bogged down allready by client requests, if you dump 10000
> >>> clients worth
> >>> of dns traffic they can break or not return results so I made my own
> >>> internal dns server cache, the machine a quad xeon 4gb ram uses
> >>> over 500mb
> >>> of ram just for caching of the domains in memory!!!
> >>> -Jay
> >>>
> >>> ----- Original Message -----
> >>> From: "Stefan Groschupf" <sg...@media-style.com>
> >>> To: <nu...@lucene.apache.org>
> >>> Sent: Wednesday, August 03, 2005 4:19 AM
> >>> Subject: dns lookup cache?
> >>>
> >>>
> >>>
> >>>
> >>>> Hi there,
> >>>> does anyhow nutch cache dns lookups.
> >>>> I found this paper and section 3.7 gives some very interesting
> >>>> information.
> >>>> We notice that our crawlers often crash after a set of unknown host
> >>>> exceptions.
> >>>> We have already one dual cpu box with a 1Gbit network connection
> >>>> running BIND.
> >>>>
> >>>> So I have 2 questions:
> >>>> People think is may java domain lookup may be a bottleneck that
> >>>> crashs the crawlers?
> >>>> Other crawlers have a kind of dns cache would that make sense to
> >>>> introduce it to nutch as well?
> >>>>
> >>>> Thanks for any comments.
> >>>> Stefan
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >
> >
> >
> >
>
>



Re: dns lookup cache?

Posted by Stefan Groschupf <sg...@media-style.com>.
Ok, I see what you mean. You run a dns server inside your network.  
Right?
We run bind 9.2.2 with a 1000 mbit internat connection to our crawler  
boxes.
The box has 2 ppc cpus and 2 gig of ram, however in peaks the cpus  
are 50 % bussy just by doing dns lookups.
However I don't think the real problem is cpu power or memory.
As the link i posted earlier today mention the bind and java lookup  
implementation is not multithreaded and that#s why I personal guess  
the nutch crawler runs in a kind of deadlock.
A good reading is:
http://buzzsurf.com/java/dns/
Hertitrix use javadns and I'm working on a similar / simple solution  
that may not need any code change just some system administration  
however use a real crawler box local dns cache.
If people are interested I can post results.

Stefan



Am 03.08.2005 um 17:30 schrieb Jay Pound:

> you setup your own dns server, a separate machine to your crawling  
> box, it
> doesn't have to be powerful, it can be a 500mhz Pentium 3, but you  
> need to
> have at least 512mb of ram in it, 1gb recommended, you point your  
> fetcher
> machine to the dns server as its primary dns server and presto  
> internal dns
> caching!!!
> -J
> PS: the easiest dns server to setup if your a windows person is  
> windows 2000
> server or windows 2003 server, you just enable it and it runs,  
> there are
> many dns servers for linux, most distributions come with it on cd,  
> mac osx
> server has it also.
> ----- Original Message -----
> From: "Stefan Groschupf" <sg...@media-style.com>
> To: <nu...@lucene.apache.org>
> Sent: Wednesday, August 03, 2005 11:05 AM
> Subject: Re: dns lookup cache?
>
>
>
>> How you do 'internal' domain caching?
>> Thanks.
>> Stefan
>> Am 03.08.2005 um 16:51 schrieb Jay Pound:
>>
>>
>>> I've got a fast internal dns cache so nutch wont need one, and it
>>> did stop a
>>> lot of the errors with nutch host not found-timeout, most isp's dns
>>> server
>>> is bogged down allready by client requests, if you dump 10000
>>> clients worth
>>> of dns traffic they can break or not return results so I made my own
>>> internal dns server cache, the machine a quad xeon 4gb ram uses
>>> over 500mb
>>> of ram just for caching of the domains in memory!!!
>>> -Jay
>>>
>>> ----- Original Message -----
>>> From: "Stefan Groschupf" <sg...@media-style.com>
>>> To: <nu...@lucene.apache.org>
>>> Sent: Wednesday, August 03, 2005 4:19 AM
>>> Subject: dns lookup cache?
>>>
>>>
>>>
>>>
>>>> Hi there,
>>>> does anyhow nutch cache dns lookups.
>>>> I found this paper and section 3.7 gives some very interesting
>>>> information.
>>>> We notice that our crawlers often crash after a set of unknown host
>>>> exceptions.
>>>> We have already one dual cpu box with a 1Gbit network connection
>>>> running BIND.
>>>>
>>>> So I have 2 questions:
>>>> People think is may java domain lookup may be a bottleneck that
>>>> crashs the crawlers?
>>>> Other crawlers have a kind of dns cache would that make sense to
>>>> introduce it to nutch as well?
>>>>
>>>> Thanks for any comments.
>>>> Stefan
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
>


Re: dns lookup cache?

Posted by Jay Pound <we...@poundwebhosting.com>.
you setup your own dns server, a separate machine to your crawling box, it
doesn't have to be powerful, it can be a 500mhz Pentium 3, but you need to
have at least 512mb of ram in it, 1gb recommended, you point your fetcher
machine to the dns server as its primary dns server and presto internal dns
caching!!!
-J
PS: the easiest dns server to setup if your a windows person is windows 2000
server or windows 2003 server, you just enable it and it runs, there are
many dns servers for linux, most distributions come with it on cd, mac osx
server has it also.
----- Original Message ----- 
From: "Stefan Groschupf" <sg...@media-style.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, August 03, 2005 11:05 AM
Subject: Re: dns lookup cache?


> How you do 'internal' domain caching?
> Thanks.
> Stefan
> Am 03.08.2005 um 16:51 schrieb Jay Pound:
>
> > I've got a fast internal dns cache so nutch wont need one, and it
> > did stop a
> > lot of the errors with nutch host not found-timeout, most isp's dns
> > server
> > is bogged down allready by client requests, if you dump 10000
> > clients worth
> > of dns traffic they can break or not return results so I made my own
> > internal dns server cache, the machine a quad xeon 4gb ram uses
> > over 500mb
> > of ram just for caching of the domains in memory!!!
> > -Jay
> >
> > ----- Original Message -----
> > From: "Stefan Groschupf" <sg...@media-style.com>
> > To: <nu...@lucene.apache.org>
> > Sent: Wednesday, August 03, 2005 4:19 AM
> > Subject: dns lookup cache?
> >
> >
> >
> >> Hi there,
> >> does anyhow nutch cache dns lookups.
> >> I found this paper and section 3.7 gives some very interesting
> >> information.
> >> We notice that our crawlers often crash after a set of unknown host
> >> exceptions.
> >> We have already one dual cpu box with a 1Gbit network connection
> >> running BIND.
> >>
> >> So I have 2 questions:
> >> People think is may java domain lookup may be a bottleneck that
> >> crashs the crawlers?
> >> Other crawlers have a kind of dns cache would that make sense to
> >> introduce it to nutch as well?
> >>
> >> Thanks for any comments.
> >> Stefan
> >>
> >>
> >>
> >
> >
> >
> >
>
>



Re: dns lookup cache?

Posted by Stefan Groschupf <sg...@media-style.com>.
How you do 'internal' domain caching?
Thanks.
Stefan
Am 03.08.2005 um 16:51 schrieb Jay Pound:

> I've got a fast internal dns cache so nutch wont need one, and it  
> did stop a
> lot of the errors with nutch host not found-timeout, most isp's dns  
> server
> is bogged down allready by client requests, if you dump 10000  
> clients worth
> of dns traffic they can break or not return results so I made my own
> internal dns server cache, the machine a quad xeon 4gb ram uses  
> over 500mb
> of ram just for caching of the domains in memory!!!
> -Jay
>
> ----- Original Message -----
> From: "Stefan Groschupf" <sg...@media-style.com>
> To: <nu...@lucene.apache.org>
> Sent: Wednesday, August 03, 2005 4:19 AM
> Subject: dns lookup cache?
>
>
>
>> Hi there,
>> does anyhow nutch cache dns lookups.
>> I found this paper and section 3.7 gives some very interesting
>> information.
>> We notice that our crawlers often crash after a set of unknown host
>> exceptions.
>> We have already one dual cpu box with a 1Gbit network connection
>> running BIND.
>>
>> So I have 2 questions:
>> People think is may java domain lookup may be a bottleneck that
>> crashs the crawlers?
>> Other crawlers have a kind of dns cache would that make sense to
>> introduce it to nutch as well?
>>
>> Thanks for any comments.
>> Stefan
>>
>>
>>
>
>
>
>


Re: dns lookup cache?

Posted by Jay Pound <we...@poundwebhosting.com>.
I've got a fast internal dns cache so nutch wont need one, and it did stop a
lot of the errors with nutch host not found-timeout, most isp's dns server
is bogged down allready by client requests, if you dump 10000 clients worth
of dns traffic they can break or not return results so I made my own
internal dns server cache, the machine a quad xeon 4gb ram uses over 500mb
of ram just for caching of the domains in memory!!!
-Jay

----- Original Message ----- 
From: "Stefan Groschupf" <sg...@media-style.com>
To: <nu...@lucene.apache.org>
Sent: Wednesday, August 03, 2005 4:19 AM
Subject: dns lookup cache?


> Hi there,
> does anyhow nutch cache dns lookups.
> I found this paper and section 3.7 gives some very interesting
> information.
> We notice that our crawlers often crash after a set of unknown host
> exceptions.
> We have already one dual cpu box with a 1Gbit network connection
> running BIND.
>
> So I have 2 questions:
> People think is may java domain lookup may be a bottleneck that
> crashs the crawlers?
> Other crawlers have a kind of dns cache would that make sense to
> introduce it to nutch as well?
>
> Thanks for any comments.
> Stefan
>
>



RE: dns lookup cache?

Posted by Chirag Chaman <de...@filangy.com>.
Stefan,

We have seen the crawler crashing, but never been able to pin-point why. We
made a "brute-force" (read very non-elegant) workaround. A script runs just
before the fetcher removing all the domains that were unreachable/blocked in
the last few days and populates the DNS with entries that are good  -- this
stopped crashes and cut crawl time by half. 

Given that we don't use the WebDB anymore it's a very specific solution but
one that has proved to be successful. Maybe someone can come up with a more
elegant solution based on our collective experience.


-----Original Message-----
From: Stefan Groschupf [mailto:sg@media-style.com] 
Sent: Wednesday, August 03, 2005 4:20 AM
To: nutch-dev@lucene.apache.org
Subject: dns lookup cache?

Hi there,
does anyhow nutch cache dns lookups.
I found this paper and section 3.7 gives some very interesting information.
We notice that our crawlers often crash after a set of unknown host
exceptions.
We have already one dual cpu box with a 1Gbit network connection running
BIND.

So I have 2 questions:
People think is may java domain lookup may be a bottleneck that crashs the
crawlers?
Other crawlers have a kind of dns cache would that make sense to introduce
it to nutch as well?

Thanks for any comments.
Stefan