You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by adu <du...@hzduozhun.com> on 2014/08/08 05:03:51 UTC

How to reduce the unfetched urls?

Hi all,
I use 10000 urls as the seeds , and crawl with depth 1. The result I got
is only 2000 urls are fetched.

I have checked the url filter. Also, i can't find any log about the http
connect failure. Are there any configs

should I notice in nutch-default.xml? Wait for your help.

Re: How to reduce the unfetched urls?

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

are unfetched URLs marked as such (status db_unfetched)?
You can check this using
  $NUTCH_HOME/bin/nutch readdb

If yes, and since exactly 2000 URLs are fetched,
it's more likely a problem with
 -topN  <size_of_fetch_list>

Which version of Nutch is used?

Best,
Sebastian



2014-08-08 5:03 GMT+02:00 adu <du...@hzduozhun.com>:

> Hi all,
> I use 10000 urls as the seeds , and crawl with depth 1. The result I got
> is only 2000 urls are fetched.
>
> I have checked the url filter. Also, i can't find any log about the http
> connect failure. Are there any configs
>
> should I notice in nutch-default.xml? Wait for your help.
>

Re: How to reduce the unfetched urls?

Posted by al...@aim.com.
What is  status of one of the unfetched urls  in the db? 
 

 

 

-----Original Message-----
From: adu <du...@hzduozhun.com>
To: user <us...@nutch.apache.org>
Sent: Thu, Aug 7, 2014 8:04 pm
Subject: How to reduce the unfetched urls?


Hi all,
I use 10000 urls as the seeds , and crawl with depth 1. The result I got
is only 2000 urls are fetched.

I have checked the url filter. Also, i can't find any log about the http
connect failure. Are there any configs

should I notice in nutch-default.xml? Wait for your help.