You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by James Ford <si...@gmail.com> on 2012/03/22 11:48:40 UTC
Generator taking time
Hello,
I am having problems with the Generator step of my crawls. It takes a lot of
time compared to indexing and fetching? Right now the generator step is
taking about 50min compared to fetching, parsing and indexing that only
takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up
the time:
2012-03-22 11:13:28,277 INFO regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2012-03-22 11:16:00,734 INFO crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule -
maxInterval=7776000
Crawldb dump:
20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - Statistics for
CrawlDb: crawldb/
20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - TOTAL urls: 7819485
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 0: 7811052
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 1: 2994
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 2: 1214
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 3: 1125
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 4: 1124
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 5: 1303
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 6: 673
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - min score: 0.0
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - avg score:
0.0015287232
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - max score: 2.0
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 1
(db_unfetched): 6946135
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 2
(db_fetched): 795070
20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 3 (db_gone):
34358
20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 4
(db_redir_temp): 21861
20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 5
(db_redir_perm): 22044
20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 6
(db_notmodified): 17
20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - CrawlDb statistics:
done
Does anyone have a clue how to fix this?
--
View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Generator taking time
Posted by Greg Fields <gr...@gmail.com>.
I have the same same problem. I have ~5000 urls in my seed and fetch 15000
pages each iteration. The fetching/indexing time is fast but the time for
running the RegexURLNormalizer doubles for each iteration. When should I use
the [-noFilter] [-noNorm] flag? Does the normalizer go through the whole
unfetched-list every time?
--
View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3851151.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Generator taking time
Posted by Markus Jelsma <ma...@openindex.io>.
bin/nutch generate
Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers
numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]
Use the noNorm and likely the noFilter option as well. But again, only do this
if you are sure the state of the CrawlDB is already normalized and properly
filtered.
On Thursday 22 March 2012 12:18:24 James Ford wrote:
> Thanks for answer Markus,
>
> But I don't think I follow you. I am new to nutch. How could I make nutch
> use the normalizer only when I have to? I tried removing the order of the
> normalizers in the config, but nothing happened.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158
> .html Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex
Re: Generator taking time
Posted by James Ford <si...@gmail.com>.
Thanks for answer Markus,
But I don't think I follow you. I am new to nutch. How could I make nutch
use the normalizer only when I have to? I tried removing the order of the
normalizers in the config, but nothing happened.
--
View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Generator taking time
Posted by Markus Jelsma <ma...@openindex.io>.
If the state of your CrawlDB is already normalized then do not use a
normalizer unless your really have to. Same is true for filtering in this
step.
On Thursday 22 March 2012 11:48:40 James Ford wrote:
> Hello,
>
> I am having problems with the Generator step of my crawls. It takes a lot
> of time compared to indexing and fetching? Right now the generator step is
> taking about 50min compared to fetching, parsing and indexing that only
> takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up
> the time:
>
> 2012-03-22 11:13:28,277 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'partition', using default
> 2012-03-22 11:16:00,734 INFO crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2012-03-22 11:16:00,734 INFO crawl.AbstractFetchSchedule -
> maxInterval=7776000
>
> Crawldb dump:
>
> 20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - Statistics for
> CrawlDb: crawldb/
> 20â2012-03-21 14:32:10,310 INFO crawl.CrawlDbReader - TOTAL urls: 7819485
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 0: 7811052
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 1: 2994
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 2: 1214
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 3: 1125
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 4: 1124
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 5: 1303
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - retry 6: 673
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - min score: 0.0
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - avg score:
> 0.0015287232
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - max score: 2.0
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 1
> (db_unfetched): 6946135
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 2
> (db_fetched): 795070
> 20â2012-03-21 14:32:10,311 INFO crawl.CrawlDbReader - status 3 (db_gone):
> 34358
> 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 4
> (db_redir_temp): 21861
> 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 5
> (db_redir_perm): 22044
> 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - status 6
> (db_notmodified): 17
> 20â2012-03-21 14:32:10,312 INFO crawl.CrawlDbReader - CrawlDb statistics:
> done
>
> Does anyone have a clue how to fix this?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106
> .html Sent from the Nutch - User mailing list archive at Nabble.com.
--
Markus Jelsma - CTO - Openindex