You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Peter Swoboda <pr...@gmx.de> on 2007/02/22 14:44:01 UTC
re-fetch
Hi,
what does the property
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>The default number of days between re-fetches of a page.
</description>
</property>
exactly do?
Does it mean, that any changes on an injected url will be mentioned?
Who(?)What re-fetches the page?
What did i have to do, if i want nutch to mention (in the search results) that an injected url is changed.
Do i have to make a complete recrawl (like in the script)?
Thanks
Peter
--
"Feel free" - 5 GB Mailbox, 50 FreeSMS/Monat ...
Jetzt GMX ProMail testen: www.gmx.net/de/go/mailfooter/promail-out
Re: Incremental crawl using Nutch
Posted by Andrzej Bialecki <ab...@getopt.org>.
rubdabadub wrote:
> On 2/23/07, Andrzej Bialecki <ab...@getopt.org> wrote:
>> rubdabadub wrote:
>> > http://issues.apache.org/jira/browse/NUTCH-61
>> >
>> > Questions is can you fix it? :-) and share it with the rest. :-)
>> >
>>
>> I will upload the new version of the patch in a few days. The latest one
>> is incomplete.
>
> Great!. You mean the latest Jira patch is incomplete? or your latest
> one that you will upload :-0
Ah, sorry, it wasn't clear ... The latest patch in JIRA is incomplete. I
hope my patch will be complete - that's where I shoot, but time will
tell ... ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Incremental crawl using Nutch
Posted by rubdabadub <ru...@gmail.com>.
On 2/23/07, Andrzej Bialecki <ab...@getopt.org> wrote:
> rubdabadub wrote:
> > http://issues.apache.org/jira/browse/NUTCH-61
> >
> > Questions is can you fix it? :-) and share it with the rest. :-)
> >
>
> I will upload the new version of the patch in a few days. The latest one
> is incomplete.
Great!. You mean the latest Jira patch is incomplete? or your latest
one that you will upload :-0
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
>
Re: Incremental crawl using Nutch
Posted by Andrzej Bialecki <ab...@getopt.org>.
rubdabadub wrote:
> http://issues.apache.org/jira/browse/NUTCH-61
>
> Questions is can you fix it? :-) and share it with the rest. :-)
>
I will upload the new version of the patch in a few days. The latest one
is incomplete.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Incremental crawl using Nutch
Posted by rubdabadub <ru...@gmail.com>.
http://issues.apache.org/jira/browse/NUTCH-61
Questions is can you fix it? :-) and share it with the rest. :-)
Regards
On 2/23/07, Dennis Kubes <nu...@dragonflymc.com> wrote:
> sandeep pujar wrote:
> > By incremental I meant after a full crawl then next
> > crawls should fetch only the changed pages.
>
> The problem with fetching changed pages is you need to know what pages
> have changed. Once you do you can load only the changed pages through
> an inject, generated, fetch, cycle and then merge crawldb and segments
> with previously fetched results. The python script performs this type
> of process but not for changed pages, for new unfetched links. You may
> be able to modify it to fetch only changed pages.
>
> Dennis Kubes
> >
> > I was not clear on how I could use the python
> > automation script for that.
> >
> > Is there something I am missing here ?
> >
> >
> > --- Dennis Kubes <nu...@dragonflymc.com> wrote:
> >
> >> You can use the python automation script found at:
> >>
> >>
> > http://wiki.apache.org/nutch/Automating_Fetches_with_Python
> >> I almost have a new version ready. Will post it in
> >> the next couple of
> >> days to the wiki.
> >>
> >> Dennis Kubes
> >>
> >> sandeep pujar wrote:
> >>> Greetings,
> >>>
> >>> Are there ways we can initiate incremental
> >> crawl/index
> >>> using Nutch.
> >>>
> >>> I tried to lookup wikis and other sources and did
> >> not
> >>> find much information.
> >>>
> >>> Any ideas pointers,
> >>>
> >>> Thanks,
> >>> Sandeep
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> > ____________________________________________________________________________________
> >>> Sucker-punch spam with award-winning protection.
> >>> Try the free Yahoo! Mail Beta.
> >>>
> > http://advision.webevents.yahoo.com/mailbeta/features_spam.html
> >
> >
> >
> >
> > ____________________________________________________________________________________
> > Don't get soaked. Take a quick peak at the forecast
> > with the Yahoo! Search weather shortcut.
> > http://tools.search.yahoo.com/shortcuts/#loc_weather
>
Re: Incremental crawl using Nutch
Posted by Dennis Kubes <nu...@dragonflymc.com>.
sandeep pujar wrote:
> By incremental I meant after a full crawl then next
> crawls should fetch only the changed pages.
The problem with fetching changed pages is you need to know what pages
have changed. Once you do you can load only the changed pages through
an inject, generated, fetch, cycle and then merge crawldb and segments
with previously fetched results. The python script performs this type
of process but not for changed pages, for new unfetched links. You may
be able to modify it to fetch only changed pages.
Dennis Kubes
>
> I was not clear on how I could use the python
> automation script for that.
>
> Is there something I am missing here ?
>
>
> --- Dennis Kubes <nu...@dragonflymc.com> wrote:
>
>> You can use the python automation script found at:
>>
>>
> http://wiki.apache.org/nutch/Automating_Fetches_with_Python
>> I almost have a new version ready. Will post it in
>> the next couple of
>> days to the wiki.
>>
>> Dennis Kubes
>>
>> sandeep pujar wrote:
>>> Greetings,
>>>
>>> Are there ways we can initiate incremental
>> crawl/index
>>> using Nutch.
>>>
>>> I tried to lookup wikis and other sources and did
>> not
>>> find much information.
>>>
>>> Any ideas pointers,
>>>
>>> Thanks,
>>> Sandeep
>>>
>>>
>>>
>>>
>>>
>>>
> ____________________________________________________________________________________
>>> Sucker-punch spam with award-winning protection.
>>> Try the free Yahoo! Mail Beta.
>>>
> http://advision.webevents.yahoo.com/mailbeta/features_spam.html
>
>
>
>
> ____________________________________________________________________________________
> Don't get soaked. Take a quick peak at the forecast
> with the Yahoo! Search weather shortcut.
> http://tools.search.yahoo.com/shortcuts/#loc_weather
Re: Incremental crawl using Nutch
Posted by sandeep pujar <sa...@yahoo.com>.
By incremental I meant after a full crawl then next
crawls should fetch only the changed pages.
I was not clear on how I could use the python
automation script for that.
Is there something I am missing here ?
--- Dennis Kubes <nu...@dragonflymc.com> wrote:
> You can use the python automation script found at:
>
>
http://wiki.apache.org/nutch/Automating_Fetches_with_Python
>
> I almost have a new version ready. Will post it in
> the next couple of
> days to the wiki.
>
> Dennis Kubes
>
> sandeep pujar wrote:
> > Greetings,
> >
> > Are there ways we can initiate incremental
> crawl/index
> > using Nutch.
> >
> > I tried to lookup wikis and other sources and did
> not
> > find much information.
> >
> > Any ideas pointers,
> >
> > Thanks,
> > Sandeep
> >
> >
> >
> >
> >
> >
>
____________________________________________________________________________________
> > Sucker-punch spam with award-winning protection.
> > Try the free Yahoo! Mail Beta.
> >
>
http://advision.webevents.yahoo.com/mailbeta/features_spam.html
>
____________________________________________________________________________________
Don't get soaked. Take a quick peak at the forecast
with the Yahoo! Search weather shortcut.
http://tools.search.yahoo.com/shortcuts/#loc_weather
Re: Incremental crawl using Nutch
Posted by Dennis Kubes <nu...@dragonflymc.com>.
You can use the python automation script found at:
http://wiki.apache.org/nutch/Automating_Fetches_with_Python
I almost have a new version ready. Will post it in the next couple of
days to the wiki.
Dennis Kubes
sandeep pujar wrote:
> Greetings,
>
> Are there ways we can initiate incremental crawl/index
> using Nutch.
>
> I tried to lookup wikis and other sources and did not
> find much information.
>
> Any ideas pointers,
>
> Thanks,
> Sandeep
>
>
>
>
>
> ____________________________________________________________________________________
> Sucker-punch spam with award-winning protection.
> Try the free Yahoo! Mail Beta.
> http://advision.webevents.yahoo.com/mailbeta/features_spam.html
Incremental crawl using Nutch
Posted by sandeep pujar <sa...@yahoo.com>.
Greetings,
Are there ways we can initiate incremental crawl/index
using Nutch.
I tried to lookup wikis and other sources and did not
find much information.
Any ideas pointers,
Thanks,
Sandeep
____________________________________________________________________________________
Sucker-punch spam with award-winning protection.
Try the free Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/features_spam.html
Re: re-fetch
Posted by Dennis Kubes <nu...@dragonflymc.com>.
Peter Swoboda wrote:
> Hi,
> what does the property
>
> <property>
> <name>db.default.fetch.interval</name>
> <value>30</value>
> <description>The default number of days between re-fetches of a page.
> </description>
> </property>
>
> exactly do?
Urls in the CrawlDb are set to be refetched after a given interval. The
default is 30 days. This variable set the interval.
> Does it mean, that any changes on an injected url will be mentioned?
> Who(?)What re-fetches the page?
Fetcher will once the interval has expired. This does not happen
automatically, a fetch job will have to be run.
> What did i have to do, if i want nutch to mention (in the search results) that an injected url is changed.
> Do i have to make a complete recrawl (like in the script)?
If you know specific urls have changed, you can create a fetch list of
only those urls (through a separate crawldb and segments on a separate
inject, generate, fetch process...don't use the same path) Then you can
merge those results using mergedb for the CrawlDb and mergesegs for the
Segments. You should have to do a full recrawl unless you don't know
what pages were changed.
Dennis Kubes
>
> Thanks
> Peter
>
>
>