You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by al...@aim.com on 2011/06/01 08:18:54 UTC
keeping index up to date
Hello,
I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files which do not change over time.
I wondered if there is a way of configuring nutch not to fetch unchanged documents again and again, but keep the old index for them.
Thanks.
Alex.
Re: keeping index up to date
Posted by Radim Kolar <hs...@sendmail.cz>.
Dne 26.7.2011 21:55, Markus Jelsma napsal(a):
> We have the injector for that ;)
>
What will injector do if injected URL is already in database? Will be
injected with priority 1.0 and re-scheduled for immediate fetch?
Re: keeping index up to date
Posted by Markus Jelsma <ma...@openindex.io>.
We have the injector for that ;)
> Hello,
>
> One more question. Is there a way of adding new urls to crawldb created in
> previous crawls to include in subsequent recrawls?
>
> Thanks.
> Alex.
>
>
>
> -----Original Message-----
> From: lewis john mcgibbney <le...@gmail.com>
> To: user <us...@nutch.apache.org>; markus.jelsma
> <ma...@openindex.io> Sent: Tue, Jun 7, 2011 1:16 pm
> Subject: Re: keeping index up to date
>
>
> Hi,
>
> To add to Markus' comments, if you take a look at the script it is written
> in such a way that if run in safe mode it protects us against an error
> which may occur. If this is the case we an recover segments etc and take
> appropriate actions to resolve.
>
> On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma
<ma...@openindex.io>wrote:
> > > Hi,
> > >
> > > I took a look to the recrawl script and noticed that all the steps
> >
> > except
> >
> > > urls injection are repeated at the consequent indexing and wondered why
> > > would we generate new segments? Is it possible to do fetch, update for
> >
> > all
> >
> > > previous $s1..$sn , invertlink and index steps.
> >
> > No, the generater generates a segment with a list of URL for the fetcher
> > to fetch. You can, if you like, then merge segments.
> >
> > > Thanks.
> > > Alex.
> > >
> > >
> > >
> > >
> > >
> > >
> > > -----Original Message-----
> > > From: Julien Nioche <li...@gmail.com>
> > > To: user <us...@nutch.apache.org>
> > > Sent: Wed, Jun 1, 2011 12:59 am
> > > Subject: Re: keeping index up to date
> > >
> > >
> > > You should use the adaptative fetch schedule. See
> > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> > >
> > >for
> > >
> > > details
> > >
> > > On 1 June 2011 07:18, <al...@aim.com> wrote:
> > > > Hello,
> > > >
> > > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> >
> > pdf
> >
> > > > files which do not change over time.
> > > > I wondered if there is a way of configuring nutch not to fetch
> >
> > unchanged
> >
> > > > documents again and again, but keep the old index for them.
> > > >
> > > >
> > > > Thanks.
> > > > Alex.
Re: keeping index up to date
Posted by al...@aim.com.
Hello,
One more question. Is there a way of adding new urls to crawldb created in previous crawls to include in subsequent recrawls?
Thanks.
Alex.
-----Original Message-----
From: lewis john mcgibbney <le...@gmail.com>
To: user <us...@nutch.apache.org>; markus.jelsma <ma...@openindex.io>
Sent: Tue, Jun 7, 2011 1:16 pm
Subject: Re: keeping index up to date
Hi,
To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.
On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma <ma...@openindex.io>wrote:
>
> > Hi,
> >
> > I took a look to the recrawl script and noticed that all the steps
> except
> > urls injection are repeated at the consequent indexing and wondered why
> > would we generate new segments? Is it possible to do fetch, update for
> all
> > previous $s1..$sn , invertlink and index steps.
>
> No, the generater generates a segment with a list of URL for the fetcher to
> fetch. You can, if you like, then merge segments.
>
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Julien Nioche <li...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Wed, Jun 1, 2011 12:59 am
> > Subject: Re: keeping index up to date
> >
> >
> > You should use the adaptative fetch schedule. See
> > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> >for
> > details
> >
> > On 1 June 2011 07:18, <al...@aim.com> wrote:
> > > Hello,
> > >
> > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> pdf
> > > files which do not change over time.
> > > I wondered if there is a way of configuring nutch not to fetch
> unchanged
> > > documents again and again, but keep the old index for them.
> > >
> > >
> > > Thanks.
> > > Alex.
>
--
*Lewis*
Re: keeping index up to date
Posted by lewis john mcgibbney <le...@gmail.com>.
Hi,
To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.
On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma <ma...@openindex.io>wrote:
>
> > Hi,
> >
> > I took a look to the recrawl script and noticed that all the steps
> except
> > urls injection are repeated at the consequent indexing and wondered why
> > would we generate new segments? Is it possible to do fetch, update for
> all
> > previous $s1..$sn , invertlink and index steps.
>
> No, the generater generates a segment with a list of URL for the fetcher to
> fetch. You can, if you like, then merge segments.
>
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Julien Nioche <li...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Wed, Jun 1, 2011 12:59 am
> > Subject: Re: keeping index up to date
> >
> >
> > You should use the adaptative fetch schedule. See
> > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> >for
> > details
> >
> > On 1 June 2011 07:18, <al...@aim.com> wrote:
> > > Hello,
> > >
> > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> pdf
> > > files which do not change over time.
> > > I wondered if there is a way of configuring nutch not to fetch
> unchanged
> > > documents again and again, but keep the old index for them.
> > >
> > >
> > > Thanks.
> > > Alex.
>
--
*Lewis*
Re: keeping index up to date
Posted by Markus Jelsma <ma...@openindex.io>.
> Hi,
>
> I took a look to the recrawl script and noticed that all the steps except
> urls injection are repeated at the consequent indexing and wondered why
> would we generate new segments? Is it possible to do fetch, update for all
> previous $s1..$sn , invertlink and index steps.
No, the generater generates a segment with a list of URL for the fetcher to
fetch. You can, if you like, then merge segments.
>
> Thanks.
> Alex.
>
>
>
>
>
>
> -----Original Message-----
> From: Julien Nioche <li...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Wed, Jun 1, 2011 12:59 am
> Subject: Re: keeping index up to date
>
>
> You should use the adaptative fetch schedule. See
> http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20>for
> details
>
> On 1 June 2011 07:18, <al...@aim.com> wrote:
> > Hello,
> >
> > I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
> > files which do not change over time.
> > I wondered if there is a way of configuring nutch not to fetch unchanged
> > documents again and again, but keep the old index for them.
> >
> >
> > Thanks.
> > Alex.
Re: keeping index up to date
Posted by al...@aim.com.
Hi,
I took a look to the recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments?
Is it possible to do fetch, update for all previous $s1..$sn , invertlink and index steps.
Thanks.
Alex.
-----Original Message-----
From: Julien Nioche <li...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Jun 1, 2011 12:59 am
Subject: Re: keeping index up to date
You should use the adaptative fetch schedule. See
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
<http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20>for
details
On 1 June 2011 07:18, <al...@aim.com> wrote:
> Hello,
>
> I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
> files which do not change over time.
> I wondered if there is a way of configuring nutch not to fetch unchanged
> documents again and again, but keep the old index for them.
>
>
> Thanks.
> Alex.
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
Re: keeping index up to date
Posted by Julien Nioche <li...@gmail.com>.
You should use the adaptative fetch schedule. See
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
<http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20>for
details
On 1 June 2011 07:18, <al...@aim.com> wrote:
> Hello,
>
> I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
> files which do not change over time.
> I wondered if there is a way of configuring nutch not to fetch unchanged
> documents again and again, but keep the old index for them.
>
>
> Thanks.
> Alex.
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com