You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by og...@yahoo.com on 2005/09/18 06:49:00 UTC

Re: [Nutch-general] UnknownHostException for known hosts

A shot in the dark, but I'd look at your DNS server and make sure it
keeps responding even when hit with a larger number of requests at
once.
Maybe you are simply overwhelming it.

Otis


--- AJ Chen <ca...@gmail.com> wrote:

> I injected 275 root urls to a new webdb, but fetch failed on most of 
> these urls in the first segment due to UnknownHostException. This is
> not 
> expected because these urls are pre-verified using nutch in a series
> of 
> smaller tests. Has anybody seen large number of UnknownHostException 
> errors?
> 
> I'm doing deep crawl on selected sites.  I notice when 
> fetcher.threads.fetch is larger (50 or 100), there is higher chance
> to 
> get more UnknownHostException. But, this type of error also happens 
> sometime even when fetcher.threads.fetch is small (5 or 10).  Any
> idea 
> what's going on?
> 
> error message example:
> 050917 202418 fetching http://www.anaspec.com/
> 050917 202435 fetch of http://www.anaspec.com/ failed with: 
> java.lang.Exception: java.net.UnknownHostException: www.anaspec.com
> 
> segment status example:
> 050917 204030 status: segment 20050917203714, 93 pages, 183 errors, 
> 1814813 bytes, 195091 ms
> 050917 204030 status: 0.4767006 pages/s, 72.674934 kb/s, 19514.12
> bytes/page
> 050917 204107 status: segment 20050917204033, 100 pages, 5 errors, 
> 2104690 bytes, 31806 ms
> 050917 204107 status: 3.1440609 pages/s, 516.9745 kb/s, 21046.9
> bytes/page
> 
> Relevant settings that differ from the defaults:
> fetcher.threads.fetch=20
> http.max.delays=100
> fetcher.server.delay=1
> 
> Thanks,
> -AJ
> 
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server. 
> Download it for free - -and be entered to win a 42" plasma tv or your
> very
> own Sony(tm)PSP.  Click here to play:
> http://sourceforge.net/geronimo.php
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
> 


Re: [Nutch-general] UnknownHostException for known hosts

Posted by AJ Chen <an...@sbcglobal.net>.
DNS server is automatically obtained from my ISP. The DNS might be 
overwhelmed.  Is there a good way in nutch to control how many new hosts 
can be requested from DNS server?

Once a url generates UnknownHostException, it will be lost forever, 
right?  Not even a retry!  If so, I need to have a way to recover the 
injected urls that are valid but generate UnknownHostException. Any 
suggestion?

AJ


ogjunk-nutch@yahoo.com wrote:

>A shot in the dark, but I'd look at your DNS server and make sure it
>keeps responding even when hit with a larger number of requests at
>once.
>Maybe you are simply overwhelming it.
>
>Otis
>
>
>--- AJ Chen <ca...@gmail.com> wrote:
>
>  
>
>>I injected 275 root urls to a new webdb, but fetch failed on most of 
>>these urls in the first segment due to UnknownHostException. This is
>>not 
>>expected because these urls are pre-verified using nutch in a series
>>of 
>>smaller tests. Has anybody seen large number of UnknownHostException 
>>errors?
>>
>>I'm doing deep crawl on selected sites.  I notice when 
>>fetcher.threads.fetch is larger (50 or 100), there is higher chance
>>to 
>>get more UnknownHostException. But, this type of error also happens 
>>sometime even when fetcher.threads.fetch is small (5 or 10).  Any
>>idea 
>>what's going on?
>>
>>error message example:
>>050917 202418 fetching http://www.anaspec.com/
>>050917 202435 fetch of http://www.anaspec.com/ failed with: 
>>java.lang.Exception: java.net.UnknownHostException: www.anaspec.com
>>
>>segment status example:
>>050917 204030 status: segment 20050917203714, 93 pages, 183 errors, 
>>1814813 bytes, 195091 ms
>>050917 204030 status: 0.4767006 pages/s, 72.674934 kb/s, 19514.12
>>bytes/page
>>050917 204107 status: segment 20050917204033, 100 pages, 5 errors, 
>>2104690 bytes, 31806 ms
>>050917 204107 status: 3.1440609 pages/s, 516.9745 kb/s, 21046.9
>>bytes/page
>>
>>Relevant settings that differ from the defaults:
>>fetcher.threads.fetch=20
>>http.max.delays=100
>>fetcher.server.delay=1
>>
>>Thanks,
>>-AJ
>>
>>
>>
>>
>>-------------------------------------------------------
>>SF.Net email is sponsored by:
>>Tame your development challenges with Apache's Geronimo App Server. 
>>Download it for free - -and be entered to win a 42" plasma tv or your
>>very
>>own Sony(tm)PSP.  Click here to play:
>>http://sourceforge.net/geronimo.php
>>_______________________________________________
>>Nutch-general mailing list
>>Nutch-general@lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/nutch-general
>>
>>    
>>
>
>
>  
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------