You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by derevo <da...@inbox.ru> on 2007/05/09 19:29:23 UTC
fetch problem
hi,
i have two servers with nutch 0.9 and hadoop
i generate segment
exec
bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 15000
in mapred-default.xml
<name>mapred.map.tasks</name>
<value>2</value>
<name>mapred.reduce.tasks</name>
<value>2</value>
exec
bin/nutch fetch $segment
after fetching
exec
$ bin/nutch readseg -list $segment
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20070509104954 7500 2007-05-09T10:54:33 2007-05-09T10:56:26
2470 2464
i'm try -topN 50000 and 100000
FETCHED result near (2400 2600) all time
i,m fetching my host, injected link type
http://myhost.com/arc/*.txt
10x
--
View this message in context: http://www.nabble.com/fetch-problem-tf3717200.html#a10399082
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: fetch problem
Posted by derevo <da...@inbox.ru>.
>Try
>bin/nutch readdb crawlbooks/crawldb -stats
>and see if there are more URL's than ~2500
>Are you doing updatedb after a fetch?
-> Espen
bin/nutch readdb crawlbooks/crawldb -stats
CrawlDb statistics start: crawlbooks/crawldb
Statistics for CrawlDb: crawlbooks/crawldb
TOTAL urls: 152437
retry 0: 152437
min score: 1.0
avg score: 1.0
max score: 1.0
status 1 (db_unfetched): 112795
status 2 (db_fetched): 39640
status 3 (db_gone): 2
CrawlDb statistics: done
then i try generate segment
bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 20000
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20070510040137 10000
next step
bin/nutch fetch $segment && bin/nutch updatedb crawlbooks/crawldb $segment
bin/nutch readseg -list crawlbooks/segments/20070510040137
NAME GENERATED FETCHER START FETCHER END
FETCHED PARSED
20070510040137 10000 2007-05-10T04:05:41 2007-05-10T04:14:54
3036 2805
ONLY 3036 FETCHED
The size of one downloaded file is near 40000 byte (txt files).
im using Nutch Hadoop (formerly NDFS) distributed file system (HDFS) and
MapReduce
(two servers)
--
View this message in context: http://www.nabble.com/fetch-problem-tf3717200.html#a10410948
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: fetch problem
Posted by Espen Amble Kolstad <es...@trank.no>.
Try
bin/nutch readdb crawlbooks/crawldb -stats
and see if there are more URL's than ~2500
Are you doing updatedb after a fetch?
- Espen
derevo wrote:
> hi,
>
> i have two servers with nutch 0.9 and hadoop
>
> i generate segment
> exec
> bin/nutch generate crawlbooks/crawldb crawlbooks/segments -topN 15000
>
> in mapred-default.xml
>
> <name>mapred.map.tasks</name>
> <value>2</value>
>
> <name>mapred.reduce.tasks</name>
> <value>2</value>
>
>
> exec
> bin/nutch fetch $segment
>
> after fetching
>
> exec
> $ bin/nutch readseg -list $segment
>
> NAME GENERATED FETCHER START FETCHER END
> FETCHED PARSED
> 20070509104954 7500 2007-05-09T10:54:33 2007-05-09T10:56:26
> 2470 2464
>
>
> i'm try -topN 50000 and 100000
>
> FETCHED result near (2400 2600) all time
>
> i,m fetching my host, injected link type
>
> http://myhost.com/arc/*.txt
>
>
>
> 10x
>
>
>
>
>
>
>
Re: fetch problem
Posted by derevo <da...@inbox.ru>.
I think it download (fetch) identical quantity of byte each time, but where
there is this restriction I dont know
--
View this message in context: http://www.nabble.com/fetch-problem-tf3717200.html#a10403631
Sent from the Nutch - User mailing list archive at Nabble.com.