You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Ji <fj...@yahoo.com> on 2006/04/10 03:17:42 UTC

refetching interval

hi there,

I have webdb with over 60,000 pages (using nutch/admin
dumptxt command) and refetching interval is set as 1
day

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between
re-fetches of a page.
  </description>
</property>

But, when I do crawling based on this webdb next day,
the generate log only showing that around 8,000 pages
being generated for fetching and actually 7,500 pages
being fetched down.

Any reason why it behaves like that? Should 60,000
pages being fetching this time?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

RE: refetching interval

Posted by Michael Ji <fj...@yahoo.com>.
Hi Gal:

Yes, I set db.max.per.host == 1.

Another interesting thing I found is that when I
dubugging print out page information in
FetchListTool.java to go generation, I check the log,
found "...Next fetch: Fri Apr 14 19:49:3...". This
webdb I generate in April 9 and refetching interval is
set to 1 day. 

Should "Next fetch" date around Aril 10th?

Why this happens?

thanks,

Michael,

--- Gal Nitzan <gn...@usa.net> wrote:

> 
> What about db.max.per.host? is it set to -1 ?
> 
> 
> -----Original Message-----
> From: Michael Ji [mailto:fji_00@yahoo.com] 
> Sent: Monday, April 10, 2006 3:18 AM
> To: nutch-user@lucene.apache.org
> Subject: refetching interval
> 
> hi there,
> 
> I have webdb with over 60,000 pages (using
> nutch/admin
> dumptxt command) and refetching interval is set as 1
> day
> 
> <property>
>   <name>db.default.fetch.interval</name>
>   <value>1</value>
>   <description>The default number of days between
> re-fetches of a page.
>   </description>
> </property>
> 
> But, when I do crawling based on this webdb next
> day,
> the generate log only showing that around 8,000
> pages
> being generated for fetching and actually 7,500
> pages
> being fetched down.
> 
> Any reason why it behaves like that? Should 60,000
> pages being fetching this time?
> 
> thanks,
> 
> Michael,
> 
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam
> protection around 
> http://mail.yahoo.com 
> 
> 
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

RE: refetching interval

Posted by Gal Nitzan <gn...@usa.net>.
What about db.max.per.host? is it set to -1 ?


-----Original Message-----
From: Michael Ji [mailto:fji_00@yahoo.com] 
Sent: Monday, April 10, 2006 3:18 AM
To: nutch-user@lucene.apache.org
Subject: refetching interval

hi there,

I have webdb with over 60,000 pages (using nutch/admin
dumptxt command) and refetching interval is set as 1
day

<property>
  <name>db.default.fetch.interval</name>
  <value>1</value>
  <description>The default number of days between
re-fetches of a page.
  </description>
</property>

But, when I do crawling based on this webdb next day,
the generate log only showing that around 8,000 pages
being generated for fetching and actually 7,500 pages
being fetched down.

Any reason why it behaves like that? Should 60,000
pages being fetching this time?

thanks,

Michael,

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com