You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dima Mazmanov <nu...@proservice.ge> on 2006/07/12 14:18:16 UTC
Re[2]: Adddays confusion - easy question for the experts
Hi,Matthew.
Could you please show your reindex script once again?
You wrote 12 èþëÿ 2006 ã., 1:51:21:
> Honda-Search Administrator wrote:
>> Reader's Digest version:
>> How can I ensure that nutch only crawls the urls I inject into the
>> fetchlist and not recrawl the entire webdb?
>> Can anyone explain to me (in simple terms) exactly what adddays does?
>>
>> Long version:
>> My setup is simple. I crawl a number of internet forums. This
>> requires me to scan new posts every night to stay on top of things.
>>
>> I crawled all of the older posts on these forums a while ago, and now
>> have to just worry about newer posts. I have written a small script
>> that injects the pages that have changed or the new pages each night.
>>
>> When I run the recrawl script, I only want to crawl the pages that are
>> injected into the fetchlist (via bin/nutch inject). I have also
>> changed the default nutch recrawl time interval (normally 30 days) to
>> a VERY large number to ensure that nutch will not recrawl old pages
>> for a very long time.
>>
>> Anyway, back to my original question.
>>
>> i recrawled today hoping that nutch would ONLY recrawl the 3000
>> documents I injected (via bin/nutch inject). I used depth of 1 and
>> left the adddays parameter blank (because I really can't get a clear
>> idea of what it does). Depth of 1 is used because I only want to crawl
>> the URLs I have injected into the fetchlist and not have nutch go
>> crazy on other domains, documents, etc. Using the regex-urlfilter I
>> have also ensured that it will only crawl the domains I want it to crawl.
>>
>> So my command looks something like this:
>>
>> /home/nutch/recrawl.sh /home/nutch/database 1
>>
>> my recrawl script can be seen here:
>> http://www.honda-search.com/script.html
>>
>> Much to my surprised Nutch is recrawling EVERY document in my webdb
>> (plus, I assume, the newly injected documents). Is this because the
>> adddays variable is left blank? Should I set the addays variable
>> really high? How can I ensure that it only crawls the urls that are
>> injected?
>>
>> Can anyone explain what adddays does (in easy to understand terms?)
>> The wiki isn't very clear for a newbie like myself.
>>
> I was looking for similar info. The adddays option advances the clock
> however many days you specify. The default for page reindexing is 30
> days, so every 30 days the page will expire and nutch will reindex it.
> However, if you pass the param -adddays 31, it will advance the clock 31
> days and cause every page to be reindexed.
> If you pass the param -adddays 27 and you have the default reindexing
> set to be 30 days, nutch will reindex all pages older than 3 days.
> Correct me if I'm wrong.
> Matt
> __________ NOD32 1.1654 (20060711) Information __________
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
--
Regards,
Dima mailto:nuther@proservice.ge