You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dima Mazmanov <nu...@proservice.ge> on 2006/07/12 14:18:16 UTC
Re[2]: Adddays confusion - easy question for the experts

Hi,Matthew.

Could you please show your reindex script once again?

You wrote 12 èþëÿ 2006 ã., 1:51:21:

> Honda-Search Administrator wrote:
>> Reader's Digest version:
>> How can I ensure that nutch only crawls the urls I inject into the 
>> fetchlist and not recrawl the entire webdb?
>> Can anyone explain to me (in simple terms) exactly what adddays does?
>>
>> Long version:
>> My setup is simple.  I crawl a number of internet forums.  This 
>> requires me to scan new posts every night to stay on top of things.
>>
>> I crawled all of the older posts on these forums a while ago, and now
>> have to just worry about newer posts.  I have written a small script
>> that injects the pages that have changed or the new pages each night.
>>
>> When I run the recrawl script, I only want to crawl the pages that are
>> injected into the fetchlist (via bin/nutch inject).  I have also 
>> changed the default nutch recrawl time interval (normally 30 days)  to
>> a VERY large number to ensure that nutch will not recrawl old pages
>> for a very long time.
>>
>> Anyway, back to my original question.
>>
>> i recrawled today hoping that nutch would ONLY recrawl the 3000 
>> documents I injected (via bin/nutch inject).  I used depth of 1 and
>> left the adddays parameter blank (because I really can't get a clear
>> idea of what it does). Depth of 1 is used because I only want to crawl
>> the URLs I have injected into the fetchlist and not have nutch go 
>> crazy on other domains, documents, etc.  Using the regex-urlfilter I
>> have also ensured that it will only crawl the domains I want it to crawl.
>>
>> So my command looks something like this:
>>
>> /home/nutch/recrawl.sh /home/nutch/database 1
>>
>> my recrawl script can be seen here:  
>> http://www.honda-search.com/script.html
>>
>> Much to my surprised Nutch is recrawling EVERY document in my webdb
>> (plus, I assume, the newly injected documents).  Is this because the
>> adddays variable is left blank?  Should I set the addays variable 
>> really high?  How can I ensure that it only crawls the urls that are
>> injected?
>>
>> Can anyone explain what adddays does (in easy to understand terms?)
>> The wiki isn't very clear for a newbie like myself.
>>
> I was looking for similar info. The adddays option advances the clock
> however many days you specify. The default for page reindexing is 30
> days, so every 30 days the page will expire and nutch will reindex it.
> However, if you pass the param -adddays 31, it will advance the clock 31
> days and cause every page to be reindexed.

> If you pass the param -adddays 27 and you have the default reindexing
> set to be 30 days, nutch will reindex all pages older than 3 days. 
> Correct me if I'm wrong.
>   Matt


> __________ NOD32 1.1654 (20060711) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge