You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by derevo <da...@inbox.ru> on 2007/05/09 19:29:23 UTC

fetch problem

hi, 

i have two servers with nutch 0.9  and hadoop 

i generate segment 
exec
bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 15000

in  mapred-default.xml

		<name>mapred.map.tasks</name>
		<value>2</value>

		<name>mapred.reduce.tasks</name>
		<value>2</value>


exec
bin/nutch fetch $segment 

after fetching 

exec
$ bin/nutch readseg -list $segment

NAME            GENERATED       FETCHER START           FETCHER END            
FETCHED PARSED
20070509104954  7500            2007-05-09T10:54:33     2007-05-09T10:56:26    
2470    2464


i'm try -topN 50000 and 100000  

FETCHED result near (2400 2600) all time

i,m fetching my host, injected link type

http://myhost.com/arc/*.txt 



10x


 




-- 
View this message in context: http://www.nabble.com/fetch-problem-tf3717200.html#a10399082
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: fetch problem

Posted by derevo <da...@inbox.ru>.

>Try
>bin/nutch readdb crawlbooks/crawldb -stats
>and see if there are more URL's than ~2500
>Are you doing updatedb after a fetch?
-> Espen


bin/nutch readdb crawlbooks/crawldb -stats
CrawlDb statistics start: crawlbooks/crawldb
Statistics for CrawlDb: crawlbooks/crawldb
TOTAL urls:     152437
retry 0:        152437
min score:      1.0
avg score:      1.0
max score:      1.0
status 1 (db_unfetched):        112795
status 2 (db_fetched):  39640
status 3 (db_gone):     2
CrawlDb statistics: done


then i try generate segment 

bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 20000
NAME            GENERATED       FETCHER START           FETCHER END            
FETCHED PARSED
20070510040137  10000                       


next step 

bin/nutch fetch $segment && bin/nutch updatedb crawlbooks/crawldb $segment


bin/nutch readseg -list crawlbooks/segments/20070510040137
NAME            GENERATED       FETCHER START           FETCHER END            
FETCHED PARSED
20070510040137  10000           2007-05-10T04:05:41     2007-05-10T04:14:54    
3036    2805

ONLY 3036 FETCHED 

The size of one downloaded file is near 40000 byte (txt files). 

im using Nutch Hadoop (formerly NDFS) distributed file system (HDFS) and
MapReduce
(two servers)






-- 
View this message in context: http://www.nabble.com/fetch-problem-tf3717200.html#a10410948
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: fetch problem

Posted by Espen Amble Kolstad <es...@trank.no>.

Try
bin/nutch readdb crawlbooks/crawldb -stats
and see if there are more URL's than ~2500

Are you doing updatedb after a fetch?

- Espen

derevo wrote:
> hi, 
> 
> i have two servers with nutch 0.9  and hadoop 
> 
> i generate segment 
> exec
> bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 15000
> 
> in  mapred-default.xml
> 
> 		<name>mapred.map.tasks</name>
> 		<value>2</value>
> 
> 		<name>mapred.reduce.tasks</name>
> 		<value>2</value>
> 
> 
> exec
> bin/nutch fetch $segment 
> 
> after fetching 
> 
> exec
> $ bin/nutch readseg -list $segment
> 
> NAME            GENERATED       FETCHER START           FETCHER END            
> FETCHED PARSED
> 20070509104954  7500            2007-05-09T10:54:33     2007-05-09T10:56:26    
> 2470    2464
> 
> 
> i'm try -topN 50000 and 100000  
> 
> FETCHED result near (2400 2600) all time
> 
> i,m fetching my host, injected link type
> 
> http://myhost.com/arc/*.txt 
> 
> 
> 
> 10x
> 
> 
>  
> 
> 
> 
>

Re: fetch problem

Posted by derevo <da...@inbox.ru>.

I think it download (fetch)  identical quantity of byte each time, but where
there is this restriction I dont know
-- 
View this message in context: http://www.nabble.com/fetch-problem-tf3717200.html#a10403631
Sent from the Nutch - User mailing list archive at Nabble.com.