You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Peter Swoboda <pr...@gmx.de> on 2007/02/22 14:44:01 UTC

re-fetch

Hi,
what does the property

<property>
  <name>db.default.fetch.interval</name>
  <value>30</value>
  <description>The default number of days between re-fetches of a page.
  </description>
</property>

exactly do?
Does it mean, that any changes on an injected url will be mentioned?
Who(?)What re-fetches the page?
What did i have to do, if i want nutch to mention (in the search results) that an injected url is changed.
Do i have to make a complete recrawl (like in the script)?

Thanks
Peter



-- 
"Feel free" - 5 GB Mailbox, 50 FreeSMS/Monat ...
Jetzt GMX ProMail testen: www.gmx.net/de/go/mailfooter/promail-out

Re: Incremental crawl using Nutch

Posted by Andrzej Bialecki <ab...@getopt.org>.

rubdabadub wrote:
> On 2/23/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>> rubdabadub wrote:
>> > http://issues.apache.org/jira/browse/NUTCH-61
>> >
>> > Questions is can you fix it? :-) and share it with the rest. :-)
>> >
>>
>> I will upload the new version of the patch in a few days. The latest one
>> is incomplete.
>
> Great!. You mean the latest Jira patch is incomplete? or your latest
> one that you will upload :-0

Ah, sorry, it wasn't clear ... The latest patch in JIRA is incomplete. I 
hope my patch will be complete - that's where I shoot, but time will 
tell ... ;)

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Incremental crawl using Nutch

Posted by rubdabadub <ru...@gmail.com>.

On 2/23/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> rubdabadub wrote:
> > http://issues.apache.org/jira/browse/NUTCH-61
> >
> > Questions is can you fix it? :-) and share it with the rest. :-)
> >
>
> I will upload the new version of the patch in a few days. The latest one
> is incomplete.

Great!. You mean the latest Jira patch is incomplete? or your latest
one that you will upload :-0

> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Incremental crawl using Nutch

Posted by Andrzej Bialecki <ab...@getopt.org>.

rubdabadub wrote:
> http://issues.apache.org/jira/browse/NUTCH-61
>
> Questions is can you fix it? :-) and share it with the rest. :-)
>

I will upload the new version of the patch in a few days. The latest one 
is incomplete.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Incremental crawl using Nutch

Posted by rubdabadub <ru...@gmail.com>.

http://issues.apache.org/jira/browse/NUTCH-61

Questions is can you fix it? :-) and share it with the rest. :-)

Regards

On 2/23/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> sandeep pujar wrote:
> > By incremental I meant after a full crawl then next
> > crawls should fetch only the changed pages.
>
> The problem with fetching changed pages is you need to know what pages
> have changed.  Once you do you can load only the changed pages through
> an inject, generated, fetch, cycle and then merge crawldb and segments
> with previously fetched results.  The python script performs this type
> of process but not for changed pages, for new unfetched links.  You may
> be able to modify it to fetch only changed pages.
>
> Dennis Kubes
> >
> > I was not clear on how I could use the python
> > automation script for that.
> >
> > Is there something I am missing here ?
> >
> >
> > --- Dennis Kubes <nu...@dragonflymc.com> wrote:
> >
> >> You can use the python automation script found at:
> >>
> >>
> > http://wiki.apache.org/nutch/Automating_Fetches_with_Python
> >> I almost have a new version ready.  Will post it in
> >> the next couple of
> >> days to the wiki.
> >>
> >> Dennis Kubes
> >>
> >> sandeep pujar wrote:
> >>> Greetings,
> >>>
> >>> Are there ways we can initiate incremental
> >> crawl/index
> >>> using Nutch.
> >>>
> >>> I tried to lookup  wikis and other sources and did
> >> not
> >>> find much information.
> >>>
> >>> Any ideas pointers,
> >>>
> >>> Thanks,
> >>> Sandeep
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> > ____________________________________________________________________________________
> >>> Sucker-punch spam with award-winning protection.
> >>> Try the free Yahoo! Mail Beta.
> >>>
> > http://advision.webevents.yahoo.com/mailbeta/features_spam.html
> >
> >
> >
> >
> > ____________________________________________________________________________________
> > Don't get soaked.  Take a quick peak at the forecast
> > with the Yahoo! Search weather shortcut.
> > http://tools.search.yahoo.com/shortcuts/#loc_weather
>

Re: Incremental crawl using Nutch

Posted by Dennis Kubes <nu...@dragonflymc.com>.

sandeep pujar wrote:
> By incremental I meant after a full crawl then next
> crawls should fetch only the changed pages.

The problem with fetching changed pages is you need to know what pages 
have changed.  Once you do you can load only the changed pages through 
an inject, generated, fetch, cycle and then merge crawldb and segments 
with previously fetched results.  The python script performs this type 
of process but not for changed pages, for new unfetched links.  You may 
be able to modify it to fetch only changed pages.

Dennis Kubes
> 
> I was not clear on how I could use the python
> automation script for that.
> 
> Is there something I am missing here ?
> 
> 
> --- Dennis Kubes <nu...@dragonflymc.com> wrote:
> 
>> You can use the python automation script found at:
>>
>>
> http://wiki.apache.org/nutch/Automating_Fetches_with_Python
>> I almost have a new version ready.  Will post it in
>> the next couple of 
>> days to the wiki.
>>
>> Dennis Kubes
>>
>> sandeep pujar wrote:
>>> Greetings,
>>>
>>> Are there ways we can initiate incremental
>> crawl/index
>>> using Nutch.
>>>
>>> I tried to lookup  wikis and other sources and did
>> not
>>> find much information.
>>>
>>> Any ideas pointers,
>>>
>>> Thanks,
>>> Sandeep
>>>
>>>
>>>
>>>
>>>  
>>>
> ____________________________________________________________________________________
>>> Sucker-punch spam with award-winning protection. 
>>> Try the free Yahoo! Mail Beta.
>>>
> http://advision.webevents.yahoo.com/mailbeta/features_spam.html
> 
> 
> 
>  
> ____________________________________________________________________________________
> Don't get soaked.  Take a quick peak at the forecast
> with the Yahoo! Search weather shortcut.
> http://tools.search.yahoo.com/shortcuts/#loc_weather

Re: Incremental crawl using Nutch

Posted by sandeep pujar <sa...@yahoo.com>.

By incremental I meant after a full crawl then next
crawls should fetch only the changed pages.

I was not clear on how I could use the python
automation script for that.

Is there something I am missing here ?

--- Dennis Kubes <nu...@dragonflymc.com> wrote:

> You can use the python automation script found at:
> 
>
http://wiki.apache.org/nutch/Automating_Fetches_with_Python
> 
> I almost have a new version ready.  Will post it in
> the next couple of 
> days to the wiki.
> 
> Dennis Kubes
> 
> sandeep pujar wrote:
> > Greetings,
> > 
> > Are there ways we can initiate incremental
> crawl/index
> > using Nutch.
> > 
> > I tried to lookup  wikis and other sources and did
> not
> > find much information.
> > 
> > Any ideas pointers,
> > 
> > Thanks,
> > Sandeep
> > 
> > 
> > 
> > 
> >  
> >
>
____________________________________________________________________________________
> > Sucker-punch spam with award-winning protection. 
> > Try the free Yahoo! Mail Beta.
> >
>
http://advision.webevents.yahoo.com/mailbeta/features_spam.html
> 

____________________________________________________________________________________
Don't get soaked.  Take a quick peak at the forecast
with the Yahoo! Search weather shortcut.
http://tools.search.yahoo.com/shortcuts/#loc_weather

Re: Incremental crawl using Nutch

Posted by Dennis Kubes <nu...@dragonflymc.com>.

You can use the python automation script found at:

http://wiki.apache.org/nutch/Automating_Fetches_with_Python

I almost have a new version ready.  Will post it in the next couple of 
days to the wiki.

Dennis Kubes

sandeep pujar wrote:
> Greetings,
> 
> Are there ways we can initiate incremental crawl/index
> using Nutch.
> 
> I tried to lookup  wikis and other sources and did not
> find much information.
> 
> Any ideas pointers,
> 
> Thanks,
> Sandeep
> 
> 
> 
> 
>  
> ____________________________________________________________________________________
> Sucker-punch spam with award-winning protection. 
> Try the free Yahoo! Mail Beta.
> http://advision.webevents.yahoo.com/mailbeta/features_spam.html

Incremental crawl using Nutch

Posted by sandeep pujar <sa...@yahoo.com>.

Greetings,

Are there ways we can initiate incremental crawl/index
using Nutch.

I tried to lookup  wikis and other sources and did not
find much information.

Any ideas pointers,

Thanks,
Sandeep




 
____________________________________________________________________________________
Sucker-punch spam with award-winning protection. 
Try the free Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/features_spam.html

Re: re-fetch

Posted by Dennis Kubes <nu...@dragonflymc.com>.

Peter Swoboda wrote:
> Hi,
> what does the property
> 
> <property>
>   <name>db.default.fetch.interval</name>
>   <value>30</value>
>   <description>The default number of days between re-fetches of a page.
>   </description>
> </property>
> 
> exactly do?

Urls in the CrawlDb are set to be refetched after a given interval.  The 
default is 30 days. This variable set the interval.

> Does it mean, that any changes on an injected url will be mentioned?
> Who(?)What re-fetches the page?

Fetcher will once the interval has expired.  This does not happen 
automatically, a fetch job will have to be run.

> What did i have to do, if i want nutch to mention (in the search results) that an injected url is changed.
> Do i have to make a complete recrawl (like in the script)?

If you know specific urls have changed, you can create a fetch list  of 
only those urls (through a separate crawldb and segments on a separate 
inject, generate, fetch process...don't use the same path)  Then you can 
merge those results using mergedb for the CrawlDb and mergesegs for the 
Segments.  You should have to do a full recrawl unless you don't know 
what pages were changed.

Dennis Kubes
> 
> Thanks
> Peter
> 
> 
>