You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/05/08 09:01:27 UTC

nutch is loosing not modified pages

Hi,

in the fetcher line 192 in case the status is NOTMODIFIED we collect   
null as content but we already have the content.
I'm worry what is happen with a page that does not change for 60  
days, since the concept of nutch is do delete segments that are older  
than "db.default.fetch.interval", isn't it?

If this is true, may be someone with write access can change null to  
content.
Thanks for any comments.
Stefan




Re: nutch is loosing not modified pages

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:
> Hi,
>
> in the fetcher line 192 in case the status is NOTMODIFIED we collect  
> null as content but we already have the content.
> I'm worry what is happen with a page that does not change for 60 days, 
> since the concept of nutch is do delete segments that are older than 
> "db.default.fetch.interval", isn't it?
>
> If this is true, may be someone with write access can change null to 
> content.

This requires a more systematic approach, which is a part of the 
adaptive fetch patch. In that patch pages which are older than maximum 
fetch interval (a system-wide setting) will be forced on the fetchlist, 
no matter what their state. This also ensures that pages in the GONE 
state are checked from time to time.

I'll be working on this patch next week, with the goal of committing it, 
and I could use some testing and code review then ...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com