You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by James Ford <si...@gmail.com> on 2012/03/22 11:48:40 UTC

Generator taking time

Hello,

I am having problems with the Generator step of my crawls. It takes a lot of
time compared to indexing and fetching? Right now the generator step is
taking about 50min compared to fetching, parsing and indexing that only
takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up
the time:

2012-03-22 11:13:28,277 INFO  regex.RegexURLNormalizer - can't find rules
for scope 'partition', using default
2012-03-22 11:16:00,734 INFO  crawl.FetchScheduleFactory - Using
FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
defaultInterval=2592000
2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
maxInterval=7776000

Crawldb dump:

20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - Statistics for
CrawlDb: crawldb/                                                                   
20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - TOTAL urls: 7819485                                                                                
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 0:    7811052                                                                                
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 1:    2994                                                                                   
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 2:    1214                                                                                   
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 3:    1125                                                                                   
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 4:    1124                                                                                   
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 5:    1303                                                                                   
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 6:    673                                                                                    
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - min score:  0.0                                                                                    
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - avg score: 
0.0015287232                                                                           
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - max score:  2.0                                                                                    
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 1
(db_unfetched):    6946135                                                                
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 2
(db_fetched):      795070                                                                 
20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 3 (db_gone):
34358                                                                          
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 4
(db_redir_temp):   21861                                                                  
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 5
(db_redir_perm):   22044                                                                  
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 6
(db_notmodified):  17                                                                     
20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - CrawlDb statistics:
done   

Does anyone have a clue how to fix this?

--
View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Generator taking time

Posted by Greg Fields <gr...@gmail.com>.
I have the same same problem. I have ~5000 urls in my seed and fetch 15000
pages each iteration. The fetching/indexing time is fast but the time for
running the RegexURLNormalizer doubles for each iteration. When should I use
the [-noFilter] [-noNorm] flag? Does the normalizer go through the whole
unfetched-list every time?

--
View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3851151.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Generator taking time

Posted by Markus Jelsma <ma...@openindex.io>.
bin/nutch generate
Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers 
numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]

Use the noNorm and likely the noFilter option as well. But again, only do this 
if you are sure the state of the CrawlDB is already normalized and properly 
filtered.



On Thursday 22 March 2012 12:18:24 James Ford wrote:
> Thanks for answer Markus,
> 
> But I don't think I follow you. I am new to nutch. How could I make nutch
> use the normalizer only when I have to? I tried removing the order of the
> normalizers in the config, but nothing happened.
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158
> .html Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex

Re: Generator taking time

Posted by James Ford <si...@gmail.com>.
Thanks for answer Markus,

But I don't think I follow you. I am new to nutch. How could I make nutch
use the normalizer only when I have to? I tried removing the order of the
normalizers in the config, but nothing happened. 

--
View this message in context: http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848158.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Generator taking time

Posted by Markus Jelsma <ma...@openindex.io>.
If the state of your CrawlDB is already normalized then do not use a 
normalizer unless your really have to. Same is true for filtering in this 
step.

On Thursday 22 March 2012 11:48:40 James Ford wrote:
> Hello,
> 
> I am having problems with the Generator step of my crawls. It takes a lot
> of time compared to indexing and fetching? Right now the generator step is
> taking about 50min compared to fetching, parsing and indexing that only
> takes about 5-10mins. It seems like the "RegexUrlNormalizer" is taking up
> the time:
> 
> 2012-03-22 11:13:28,277 INFO  regex.RegexURLNormalizer - can't find rules
> for scope 'partition', using default
> 2012-03-22 11:16:00,734 INFO  crawl.FetchScheduleFactory - Using
> FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule
> 2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
> defaultInterval=2592000
> 2012-03-22 11:16:00,734 INFO  crawl.AbstractFetchSchedule -
> maxInterval=7776000
> 
> Crawldb dump:
> 
> 20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - Statistics for
> CrawlDb: crawldb/
> 20â2012-03-21 14:32:10,310 INFO  crawl.CrawlDbReader - TOTAL urls: 7819485
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 0:    7811052
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 1:    2994
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 2:    1214
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 3:    1125
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 4:    1124
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 5:    1303
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - retry 6:    673
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - min score:  0.0
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - avg score:
> 0.0015287232
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - max score:  2.0
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 1
> (db_unfetched):    6946135
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 2
> (db_fetched):      795070
> 20â2012-03-21 14:32:10,311 INFO  crawl.CrawlDbReader - status 3 (db_gone):
> 34358
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 4
> (db_redir_temp):   21861
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 5
> (db_redir_perm):   22044
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - status 6
> (db_notmodified):  17
> 20â2012-03-21 14:32:10,312 INFO  crawl.CrawlDbReader - CrawlDb statistics:
> done
> 
> Does anyone have a clue how to fix this?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Generator-taking-time-tp3848106p3848106
> .html Sent from the Nutch - User mailing list archive at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex