You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Matthew Holt <mh...@redhat.com> on 2006/07/11 22:51:21 UTC

Re: Adddays confusion - easy question for the experts

Honda-Search Administrator wrote:
> Reader's Digest version:
> How can I ensure that nutch only crawls the urls I inject into the 
> fetchlist and not recrawl the entire webdb?
> Can anyone explain to me (in simple terms) exactly what adddays does?
>
> Long version:
> My setup is simple.  I crawl a number of internet forums.  This 
> requires me to scan new posts every night to stay on top of things.
>
> I crawled all of the older posts on these forums a while ago, and now 
> have to just worry about newer posts.  I have written a small script 
> that injects the pages that have changed or the new pages each night.
>
> When I run the recrawl script, I only want to crawl the pages that are 
> injected into the fetchlist (via bin/nutch inject).  I have also 
> changed the default nutch recrawl time interval (normally 30 days)  to 
> a VERY large number to ensure that nutch will not recrawl old pages 
> for a very long time.
>
> Anyway, back to my original question.
>
> i recrawled today hoping that nutch would ONLY recrawl the 3000 
> documents I injected (via bin/nutch inject).  I used depth of 1 and 
> left the adddays parameter blank (because I really can't get a clear 
> idea of what it does). Depth of 1 is used because I only want to crawl 
> the URLs I have injected into the fetchlist and not have nutch go 
> crazy on other domains, documents, etc.  Using the regex-urlfilter I 
> have also ensured that it will only crawl the domains I want it to crawl.
>
> So my command looks something like this:
>
> /home/nutch/recrawl.sh /home/nutch/database 1
>
> my recrawl script can be seen here:  
> http://www.honda-search.com/script.html
>
> Much to my surprised Nutch is recrawling EVERY document in my webdb 
> (plus, I assume, the newly injected documents).  Is this because the 
> adddays variable is left blank?  Should I set the addays variable 
> really high?  How can I ensure that it only crawls the urls that are 
> injected?
>
> Can anyone explain what adddays does (in easy to understand terms?)  
> The wiki isn't very clear for a newbie like myself.
>
I was looking for similar info. The adddays option advances the clock 
however many days you specify. The default for page reindexing is 30 
days, so every 30 days the page will expire and nutch will reindex it. 
However, if you pass the param -adddays 31, it will advance the clock 31 
days and cause every page to be reindexed.

If you pass the param -adddays 27 and you have the default reindexing 
set to be 30 days, nutch will reindex all pages older than 3 days. 
Correct me if I'm wrong.
  Matt

Re: Adddays confusion - easy question for the experts

Posted by Chris Newton <ne...@radian6.com>.

On the original question...  I'm trying to do something similar, using
nutch... (crawl a list of sites... and just my list). ie: I dont want a
predefined refetch time, I don't want links on the pages I've crawled to be
crawled next time...  just my injected URLs please and thanks.  Anyone know
if this is possible in nutch?

chris

On 7/11/06, Honda-Search Administrator <ad...@honda-search.com> wrote:
>
> That's an awesome explanation Matt... Thanks :)
>
> ----- Original Message -----
> From: "Matthew Holt" <mh...@redhat.com>
> To: <nu...@lucene.apache.org>
> Sent: Tuesday, July 11, 2006 1:51 PM
> Subject: Re: Adddays confusion - easy question for the experts
>
>
> > Honda-Search Administrator wrote:
> >> Reader's Digest version:
> >> How can I ensure that nutch only crawls the urls I inject into the
> >> fetchlist and not recrawl the entire webdb?
> >> Can anyone explain to me (in simple terms) exactly what adddays does?
> >>
> >> Long version:
> >> My setup is simple.  I crawl a number of internet forums.  This
> requires
> >> me to scan new posts every night to stay on top of things.
> >>
> >> I crawled all of the older posts on these forums a while ago, and now
> >> have to just worry about newer posts.  I have written a small script
> that
> >> injects the pages that have changed or the new pages each night.
> >>
> >> When I run the recrawl script, I only want to crawl the pages that are
> >> injected into the fetchlist (via bin/nutch inject).  I have also
> changed
> >> the default nutch recrawl time interval (normally 30 days)  to a VERY
> >> large number to ensure that nutch will not recrawl old pages for a very
> >> long time.
> >>
> >> Anyway, back to my original question.
> >>
> >> i recrawled today hoping that nutch would ONLY recrawl the 3000
> documents
> >> I injected (via bin/nutch inject).  I used depth of 1 and left the
> >> adddays parameter blank (because I really can't get a clear idea of
> what
> >> it does). Depth of 1 is used because I only want to crawl the URLs I
> have
> >> injected into the fetchlist and not have nutch go crazy on other
> domains,
> >> documents, etc.  Using the regex-urlfilter I have also ensured that it
> >> will only crawl the domains I want it to crawl.
> >>
> >> So my command looks something like this:
> >>
> >> /home/nutch/recrawl.sh /home/nutch/database 1
> >>
> >> my recrawl script can be seen here:
> >> http://www.honda-search.com/script.html
> >>
> >> Much to my surprised Nutch is recrawling EVERY document in my webdb
> >> (plus, I assume, the newly injected documents).  Is this because the
> >> adddays variable is left blank?  Should I set the addays variable
> really
> >> high?  How can I ensure that it only crawls the urls that are injected?
> >>
> >> Can anyone explain what adddays does (in easy to understand
> terms?)  The
> >> wiki isn't very clear for a newbie like myself.
> >>
> > I was looking for similar info. The adddays option advances the clock
> > however many days you specify. The default for page reindexing is 30
> days,
> > so every 30 days the page will expire and nutch will reindex it.
> However,
> > if you pass the param -adddays 31, it will advance the clock 31 days and
> > cause every page to be reindexed.
> >
> > If you pass the param -adddays 27 and you have the default reindexing
> set
> > to be 30 days, nutch will reindex all pages older than 3 days. Correct
> me
> > if I'm wrong.
> >  Matt
> >
> >
>
>


-- 
Chris Newton,
CTO Radian6, www.radian6.com
Phone: 506-452-9039

Re: Adddays confusion - easy question for the experts

Posted by Honda-Search Administrator <ad...@honda-search.com>.

That's an awesome explanation Matt... Thanks :)

----- Original Message ----- 
From: "Matthew Holt" <mh...@redhat.com>
To: <nu...@lucene.apache.org>
Sent: Tuesday, July 11, 2006 1:51 PM
Subject: Re: Adddays confusion - easy question for the experts


> Honda-Search Administrator wrote:
>> Reader's Digest version:
>> How can I ensure that nutch only crawls the urls I inject into the 
>> fetchlist and not recrawl the entire webdb?
>> Can anyone explain to me (in simple terms) exactly what adddays does?
>>
>> Long version:
>> My setup is simple.  I crawl a number of internet forums.  This requires 
>> me to scan new posts every night to stay on top of things.
>>
>> I crawled all of the older posts on these forums a while ago, and now 
>> have to just worry about newer posts.  I have written a small script that 
>> injects the pages that have changed or the new pages each night.
>>
>> When I run the recrawl script, I only want to crawl the pages that are 
>> injected into the fetchlist (via bin/nutch inject).  I have also changed 
>> the default nutch recrawl time interval (normally 30 days)  to a VERY 
>> large number to ensure that nutch will not recrawl old pages for a very 
>> long time.
>>
>> Anyway, back to my original question.
>>
>> i recrawled today hoping that nutch would ONLY recrawl the 3000 documents 
>> I injected (via bin/nutch inject).  I used depth of 1 and left the 
>> adddays parameter blank (because I really can't get a clear idea of what 
>> it does). Depth of 1 is used because I only want to crawl the URLs I have 
>> injected into the fetchlist and not have nutch go crazy on other domains, 
>> documents, etc.  Using the regex-urlfilter I have also ensured that it 
>> will only crawl the domains I want it to crawl.
>>
>> So my command looks something like this:
>>
>> /home/nutch/recrawl.sh /home/nutch/database 1
>>
>> my recrawl script can be seen here: 
>> http://www.honda-search.com/script.html
>>
>> Much to my surprised Nutch is recrawling EVERY document in my webdb 
>> (plus, I assume, the newly injected documents).  Is this because the 
>> adddays variable is left blank?  Should I set the addays variable really 
>> high?  How can I ensure that it only crawls the urls that are injected?
>>
>> Can anyone explain what adddays does (in easy to understand terms?)  The 
>> wiki isn't very clear for a newbie like myself.
>>
> I was looking for similar info. The adddays option advances the clock 
> however many days you specify. The default for page reindexing is 30 days, 
> so every 30 days the page will expire and nutch will reindex it. However, 
> if you pass the param -adddays 31, it will advance the clock 31 days and 
> cause every page to be reindexed.
>
> If you pass the param -adddays 27 and you have the default reindexing set 
> to be 30 days, nutch will reindex all pages older than 3 days. Correct me 
> if I'm wrong.
>  Matt
>
>

Re[2]: Adddays confusion - easy question for the experts

Posted by Dima Mazmanov <nu...@proservice.ge>.

Hi,Matthew.

Could you please show your reindex script once again?

You wrote 12 èþëÿ 2006 ã., 1:51:21:

> Honda-Search Administrator wrote:
>> Reader's Digest version:
>> How can I ensure that nutch only crawls the urls I inject into the 
>> fetchlist and not recrawl the entire webdb?
>> Can anyone explain to me (in simple terms) exactly what adddays does?
>>
>> Long version:
>> My setup is simple.  I crawl a number of internet forums.  This 
>> requires me to scan new posts every night to stay on top of things.
>>
>> I crawled all of the older posts on these forums a while ago, and now
>> have to just worry about newer posts.  I have written a small script
>> that injects the pages that have changed or the new pages each night.
>>
>> When I run the recrawl script, I only want to crawl the pages that are
>> injected into the fetchlist (via bin/nutch inject).  I have also 
>> changed the default nutch recrawl time interval (normally 30 days)  to
>> a VERY large number to ensure that nutch will not recrawl old pages
>> for a very long time.
>>
>> Anyway, back to my original question.
>>
>> i recrawled today hoping that nutch would ONLY recrawl the 3000 
>> documents I injected (via bin/nutch inject).  I used depth of 1 and
>> left the adddays parameter blank (because I really can't get a clear
>> idea of what it does). Depth of 1 is used because I only want to crawl
>> the URLs I have injected into the fetchlist and not have nutch go 
>> crazy on other domains, documents, etc.  Using the regex-urlfilter I
>> have also ensured that it will only crawl the domains I want it to crawl.
>>
>> So my command looks something like this:
>>
>> /home/nutch/recrawl.sh /home/nutch/database 1
>>
>> my recrawl script can be seen here:  
>> http://www.honda-search.com/script.html
>>
>> Much to my surprised Nutch is recrawling EVERY document in my webdb
>> (plus, I assume, the newly injected documents).  Is this because the
>> adddays variable is left blank?  Should I set the addays variable 
>> really high?  How can I ensure that it only crawls the urls that are
>> injected?
>>
>> Can anyone explain what adddays does (in easy to understand terms?)
>> The wiki isn't very clear for a newbie like myself.
>>
> I was looking for similar info. The adddays option advances the clock
> however many days you specify. The default for page reindexing is 30
> days, so every 30 days the page will expire and nutch will reindex it.
> However, if you pass the param -adddays 31, it will advance the clock 31
> days and cause every page to be reindexed.

> If you pass the param -adddays 27 and you have the default reindexing
> set to be 30 days, nutch will reindex all pages older than 3 days. 
> Correct me if I'm wrong.
>   Matt


> __________ NOD32 1.1654 (20060711) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge