You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Piotr Kosiorowski <pk...@gmail.com> on 2005/06/06 15:54:46 UTC

-refetchonly investigation

Hello,
I started to investigate -reftechonly flag because of some questions on 
nutch-user mailing list. I was sure it works as described in one of the 
emails on the list:
"-refetchonly generates you an segment(FetchList) that only contains the 
urls that need to be refetched based on your refetch interval.
Right, new discovered links are not in the fetchlist that will be 
generated by using this option."

But after reading the code and performing some experiments it looks like 
it is not true.

I have inserted 1 url into WebDB - http://lucene.apache.org/nutch/.
I have generated the segment, fetched it, updated db.
There are 21 pages in WebDB after update.
When I do:
bin/nutch generate db segments/ -refetchonly
a new segment is created that contains 20 pages in fetchlist.
http://lucene.apache.org/nutch/ page is missing - as it should be 
because it has nextFetchTime greater than now. But all other new pages
are genarated into fetchlist.

They are not fetched when I run "bin/nutch fetch" because they all have 
fetch flag set to false so fetcher does not even try to fetch it.
During update they are handled as "pageContentsUnchanged".
So in fact they are not fetched but their nextFetchTime is updated - I 
am not sure why such feature might be useful.
They also take space in segment so it affects fetchlists generated with 
-topN option.

So in my opinion this behavior is not correct.
I would suggest performing following steps:
1) if we simply skip the page during  fetchlist generation - everything 
should run without problems and users would get expected behavior - I 
can prepare such patch (after finishing with others on my nutch patch 
list :)).
2) http://issues.apache.org/jira/browse/NUTCH-49 - patch presented in 
this place will have exactly the same problem (but working in opposite 
direction) - while preparing patch for 1) I can take it into account.
3) FetchListEntry.fetch field - I cannot find other things this field is 
responsible for right now. I will look deeper but at the moment I think 
this field can be removed from this object making fetchlist size smaller 
on disk (always a good thing) and removing handling of this field from 
fetcher and updatedb.

Maybe I am missing some important aspects of this issue so please 
correct me if I am wrong before I start coding.

Regards,
Piotr




Re: -refetchonly investigation

Posted by Doug Cutting <cu...@nutch.org>.
Piotr Kosiorowski wrote:
> I started to investigate -reftechonly flag because of some questions on 
> nutch-user mailing list. I was sure it works as described in one of the 
> emails on the list:
> "-refetchonly generates you an segment(FetchList) that only contains the 
> urls that need to be refetched based on your refetch interval.
> Right, new discovered links are not in the fetchlist that will be 
> generated by using this option."

The original rationale for the "-refetchonly" option was to permit 
indexing of all of the urls known the the database, with anchor text, 
but without fetching them.  Thus one can, e.g., provide an index of 10M 
urls while only actually fetching 1M urls.  I have never actually used 
this feature myseufl.  I don't know whether other folks have ever used 
it sucessfully, nor whether such a feature is in fact desired.

Doug