You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by pepe3059 <pe...@gmail.com> on 2012/02/29 02:33:24 UTC

too few db_fetched

Hello, I'm Jose, i have one question and i hope you can help me

I have nutch-1.4 and I'm crawling the web from a country (mx), for that 
reason i changed regex-urlfilter to add the correct regex. the second param
changed in nutch script was
the java heap amount because an error of memory space. Well my question is
because i am doing a crawling with depth 2 to two sites(seed) but i get so
few sites fetched. the result of readdb is below
TOTAL urls:	653
retry 0:	653
min score:	0.0
avg score:	0.0077212863
max score:	1.028
status 1 (db_unfetched):	504
status 2 (db_fetched):	139
status 3 (db_gone):	4
status 4 (db_redir_temp):	4
status 5 (db_redir_perm):	2
CrawlDb statistics: done

in some other posts i saw they changed "protocol-httpclient" for
"protocol-http" in nutch-site.xml but is the same with the two protocols. I
did a -dump from crawldb and verify manually some db_unfetched urls to see
if those are unavailable but are correct and with content, no robots.txt are
present in servers. What must i do to get more url's fetched?


sorry for my english, thank you


--
View this message in context: http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: too few db_fetched

Posted by Markus Jelsma <ma...@openindex.io>.

 Short anwer: continue crawling!


 When going to crawl a large amount of records i wouldn't encourage you 
 to use the crawl command. It's better to build a small shell script that 
 repeats the crawl cycle over and over.

 Remember, the depth parameter is nothing more than a crawl cycle 
 exectuted twice! You'll never get far with two cycles.

 On Wed, 29 Feb 2012 05:12:08 +0200, remi tassing 
 <ta...@gmail.com> wrote:
> Hi Jose,
>
> We have this question very often and the short answer, with regard to
> 'stats' printout, is that everything is probably fine. For a more 
> complete
> answer plz search in the mailing-list or Google.
>
> BTW, how did you change the heap size? I get some IOException when 
> the TopN
> is 'too' high
>
> Remi
>
> On Wednesday, February 29, 2012, pepe3059 <pe...@gmail.com> wrote:
>> Hello, I'm Jose, i have one question and i hope you can help me
>>
>> I have nutch-1.4 and I'm crawling the web from a country (mx), for 
>> that
>> reason i changed regex-urlfilter to add the correct regex. the 
>> second
> param
>> changed in nutch script was
>> the java heap amount because an error of memory space. Well my 
>> question is
>> because i am doing a crawling with depth 2 to two sites(seed) but i 
>> get so
>> few sites fetched. the result of readdb is below
>> TOTAL urls:     653
>> retry 0:        653
>> min score:      0.0
>> avg score:      0.0077212863
>> max score:      1.028
>> status 1 (db_unfetched):        504
>> status 2 (db_fetched):  139
>> status 3 (db_gone):     4
>> status 4 (db_redir_temp):       4
>> status 5 (db_redir_perm):       2
>> CrawlDb statistics: done
>>
>> in some other posts i saw they changed "protocol-httpclient" for
>> "protocol-http" in nutch-site.xml but is the same with the two 
>> protocols.
> I
>> did a -dump from crawldb and verify manually some db_unfetched urls 
>> to see
>> if those are unavailable but are correct and with content, no 
>> robots.txt
> are
>> present in servers. What must i do to get more url's fetched?
>>
>>
>> sorry for my english, thank you
>>
>>
>> --
>> View this message in context:
> 
> http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>

Re: too few db_fetched

Posted by pepe3059 <pe...@gmail.com>.

Thank you for your answers. remi tassing you can increase de java heap used
by Nutch modifying the variable "JAVA_HEAP_MAX=-Xmx1000m" included in the
script bin/nutch, 1gb is currently assigned.  



Another question for my problem is: I know mapred is used by default. I read
in one post that map and reduce tasks can interfere with the fetch process,
is that correct? where can i found information related with the status codes
or different values dumped by readdb?, i got information from one url with
the follow values 

http://cca.inegi.org.mx/en-contacto/foro-del-cca	Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Feb 28 17:11:55 CST 2012
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.030734694
Signature: null
Metadata: 

thank you


--
View this message in context: http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3788086.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: too few db_fetched

Posted by remi tassing <ta...@gmail.com>.

Hi Jose,

We have this question very often and the short answer, with regard to
'stats' printout, is that everything is probably fine. For a more complete
answer plz search in the mailing-list or Google.

BTW, how did you change the heap size? I get some IOException when the TopN
is 'too' high

Remi

On Wednesday, February 29, 2012, pepe3059 <pe...@gmail.com> wrote:
> Hello, I'm Jose, i have one question and i hope you can help me
>
> I have nutch-1.4 and I'm crawling the web from a country (mx), for that
> reason i changed regex-urlfilter to add the correct regex. the second
param
> changed in nutch script was
> the java heap amount because an error of memory space. Well my question is
> because i am doing a crawling with depth 2 to two sites(seed) but i get so
> few sites fetched. the result of readdb is below
> TOTAL urls:     653
> retry 0:        653
> min score:      0.0
> avg score:      0.0077212863
> max score:      1.028
> status 1 (db_unfetched):        504
> status 2 (db_fetched):  139
> status 3 (db_gone):     4
> status 4 (db_redir_temp):       4
> status 5 (db_redir_perm):       2
> CrawlDb statistics: done
>
> in some other posts i saw they changed "protocol-httpclient" for
> "protocol-http" in nutch-site.xml but is the same with the two protocols.
I
> did a -dump from crawldb and verify manually some db_unfetched urls to see
> if those are unavailable but are correct and with content, no robots.txt
are
> present in servers. What must i do to get more url's fetched?
>
>
> sorry for my english, thank you
>
>
> --
> View this message in context:
http://lucene.472066.n3.nabble.com/too-few-db-fetched-tp3785938p3785938.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>