You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by AJ Chen <ca...@gmail.com> on 2005/09/18 06:04:13 UTC

UnknownHostException for known hosts

I injected 275 root urls to a new webdb, but fetch failed on most of 
these urls in the first segment due to UnknownHostException. This is not 
expected because these urls are pre-verified using nutch in a series of 
smaller tests. Has anybody seen large number of UnknownHostException 
errors?

I'm doing deep crawl on selected sites.  I notice when 
fetcher.threads.fetch is larger (50 or 100), there is higher chance to 
get more UnknownHostException. But, this type of error also happens 
sometime even when fetcher.threads.fetch is small (5 or 10).  Any idea 
what's going on?

error message example:
050917 202418 fetching http://www.anaspec.com/
050917 202435 fetch of http://www.anaspec.com/ failed with: 
java.lang.Exception: java.net.UnknownHostException: www.anaspec.com

segment status example:
050917 204030 status: segment 20050917203714, 93 pages, 183 errors, 
1814813 bytes, 195091 ms
050917 204030 status: 0.4767006 pages/s, 72.674934 kb/s, 19514.12 bytes/page
050917 204107 status: segment 20050917204033, 100 pages, 5 errors, 
2104690 bytes, 31806 ms
050917 204107 status: 3.1440609 pages/s, 516.9745 kb/s, 21046.9 bytes/page

Relevant settings that differ from the defaults:
fetcher.threads.fetch=20
http.max.delays=100
fetcher.server.delay=1

Thanks,
-AJ



Re: [Nutch-general] UnknownHostException for known hosts

Posted by AJ Chen <an...@sbcglobal.net>.
DNS server is automatically obtained from my ISP. The DNS might be 
overwhelmed.  Is there a good way in nutch to control how many new hosts 
can be requested from DNS server?

Once a url generates UnknownHostException, it will be lost forever, 
right?  Not even a retry!  If so, I need to have a way to recover the 
injected urls that are valid but generate UnknownHostException. Any 
suggestion?

AJ


ogjunk-nutch@yahoo.com wrote:

>A shot in the dark, but I'd look at your DNS server and make sure it
>keeps responding even when hit with a larger number of requests at
>once.
>Maybe you are simply overwhelming it.
>
>Otis
>
>
>--- AJ Chen <ca...@gmail.com> wrote:
>
>  
>
>>I injected 275 root urls to a new webdb, but fetch failed on most of 
>>these urls in the first segment due to UnknownHostException. This is
>>not 
>>expected because these urls are pre-verified using nutch in a series
>>of 
>>smaller tests. Has anybody seen large number of UnknownHostException 
>>errors?
>>
>>I'm doing deep crawl on selected sites.  I notice when 
>>fetcher.threads.fetch is larger (50 or 100), there is higher chance
>>to 
>>get more UnknownHostException. But, this type of error also happens 
>>sometime even when fetcher.threads.fetch is small (5 or 10).  Any
>>idea 
>>what's going on?
>>
>>error message example:
>>050917 202418 fetching http://www.anaspec.com/
>>050917 202435 fetch of http://www.anaspec.com/ failed with: 
>>java.lang.Exception: java.net.UnknownHostException: www.anaspec.com
>>
>>segment status example:
>>050917 204030 status: segment 20050917203714, 93 pages, 183 errors, 
>>1814813 bytes, 195091 ms
>>050917 204030 status: 0.4767006 pages/s, 72.674934 kb/s, 19514.12
>>bytes/page
>>050917 204107 status: segment 20050917204033, 100 pages, 5 errors, 
>>2104690 bytes, 31806 ms
>>050917 204107 status: 3.1440609 pages/s, 516.9745 kb/s, 21046.9
>>bytes/page
>>
>>Relevant settings that differ from the defaults:
>>fetcher.threads.fetch=20
>>http.max.delays=100
>>fetcher.server.delay=1
>>
>>Thanks,
>>-AJ
>>
>>
>>
>>
>>-------------------------------------------------------
>>SF.Net email is sponsored by:
>>Tame your development challenges with Apache's Geronimo App Server. 
>>Download it for free - -and be entered to win a 42" plasma tv or your
>>very
>>own Sony(tm)PSP.  Click here to play:
>>http://sourceforge.net/geronimo.php
>>_______________________________________________
>>Nutch-general mailing list
>>Nutch-general@lists.sourceforge.net
>>https://lists.sourceforge.net/lists/listinfo/nutch-general
>>
>>    
>>
>
>
>  
>

-- 
AJ (Anjun) Chen, Ph.D.
Canova Bioconsulting 
Marketing * BD * Software Development
748 Matadero Ave., Palo Alto, CA 94306, USA
Cell 650-283-4091, anjun.chen@sbcglobal.net
---------------------------------------------------

Re: [Nutch-general] UnknownHostException for known hosts

Posted by og...@yahoo.com.
A shot in the dark, but I'd look at your DNS server and make sure it
keeps responding even when hit with a larger number of requests at
once.
Maybe you are simply overwhelming it.

Otis


--- AJ Chen <ca...@gmail.com> wrote:

> I injected 275 root urls to a new webdb, but fetch failed on most of 
> these urls in the first segment due to UnknownHostException. This is
> not 
> expected because these urls are pre-verified using nutch in a series
> of 
> smaller tests. Has anybody seen large number of UnknownHostException 
> errors?
> 
> I'm doing deep crawl on selected sites.  I notice when 
> fetcher.threads.fetch is larger (50 or 100), there is higher chance
> to 
> get more UnknownHostException. But, this type of error also happens 
> sometime even when fetcher.threads.fetch is small (5 or 10).  Any
> idea 
> what's going on?
> 
> error message example:
> 050917 202418 fetching http://www.anaspec.com/
> 050917 202435 fetch of http://www.anaspec.com/ failed with: 
> java.lang.Exception: java.net.UnknownHostException: www.anaspec.com
> 
> segment status example:
> 050917 204030 status: segment 20050917203714, 93 pages, 183 errors, 
> 1814813 bytes, 195091 ms
> 050917 204030 status: 0.4767006 pages/s, 72.674934 kb/s, 19514.12
> bytes/page
> 050917 204107 status: segment 20050917204033, 100 pages, 5 errors, 
> 2104690 bytes, 31806 ms
> 050917 204107 status: 3.1440609 pages/s, 516.9745 kb/s, 21046.9
> bytes/page
> 
> Relevant settings that differ from the defaults:
> fetcher.threads.fetch=20
> http.max.delays=100
> fetcher.server.delay=1
> 
> Thanks,
> -AJ
> 
> 
> 
> 
> -------------------------------------------------------
> SF.Net email is sponsored by:
> Tame your development challenges with Apache's Geronimo App Server. 
> Download it for free - -and be entered to win a 42" plasma tv or your
> very
> own Sony(tm)PSP.  Click here to play:
> http://sourceforge.net/geronimo.php
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>