You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by al...@aim.com on 2011/06/01 08:18:54 UTC

keeping index up to date

Hello,

I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf files which do not change over time. 
I wondered if there is a way of configuring nutch not to fetch unchanged documents again and again, but keep the old index for them.


Thanks.
Alex.

Re: keeping index up to date

Posted by Radim Kolar <hs...@sendmail.cz>.

Dne 26.7.2011 21:55, Markus Jelsma napsal(a):
> We have the injector for that ;)
>
What will injector do if injected URL is already in database? Will be 
injected with priority 1.0 and re-scheduled for immediate fetch?

Re: keeping index up to date

Posted by Markus Jelsma <ma...@openindex.io>.

We have the injector for that ;)

>  Hello,
> 
> One more question. Is there a way of adding new urls to crawldb created in
> previous crawls to include in subsequent recrawls?
> 
> Thanks.
> Alex.
> 
> 
> 
> -----Original Message-----
> From: lewis john mcgibbney <le...@gmail.com>
> To: user <us...@nutch.apache.org>; markus.jelsma
> <ma...@openindex.io> Sent: Tue, Jun 7, 2011 1:16 pm
> Subject: Re: keeping index up to date
> 
> 
> Hi,
> 
> To add to Markus' comments, if you take a look at the script it is written
> in such a way that if run in safe mode it protects us against an error
> which may occur. If this is the case we an recover segments etc and take
> appropriate actions to resolve.
> 
> On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma 
<ma...@openindex.io>wrote:
> > >  Hi,
> > > 
> > > I took a look to the  recrawl script and noticed that all the steps
> > 
> > except
> > 
> > > urls injection are repeated at the consequent indexing and wondered why
> > > would we generate new segments? Is it possible to do fetch, update for
> > 
> > all
> > 
> > > previous $s1..$sn , invertlink  and index steps.
> > 
> > No, the generater generates a segment with a list of URL for the fetcher
> > to fetch. You can, if you like, then merge segments.
> > 
> > > Thanks.
> > > Alex.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > -----Original Message-----
> > > From: Julien Nioche <li...@gmail.com>
> > > To: user <us...@nutch.apache.org>
> > > Sent: Wed, Jun 1, 2011 12:59 am
> > > Subject: Re: keeping index up to date
> > > 
> > > 
> > > You should use the adaptative fetch schedule. See
> > > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> > >
> > >for
> > >
> > > details
> > > 
> > > On 1 June 2011 07:18, <al...@aim.com> wrote:
> > > > Hello,
> > > > 
> > > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> > 
> > pdf
> > 
> > > > files which do not change over time.
> > > > I wondered if there is a way of configuring nutch not to fetch
> > 
> > unchanged
> > 
> > > > documents again and again, but keep the old index for them.
> > > > 
> > > > 
> > > > Thanks.
> > > > Alex.

Re: keeping index up to date

Posted by al...@aim.com.

 

 Hello,

One more question. Is there a way of adding new urls to crawldb created in previous crawls to include in subsequent recrawls? 

Thanks.
Alex. 

 

-----Original Message-----
From: lewis john mcgibbney <le...@gmail.com>
To: user <us...@nutch.apache.org>; markus.jelsma <ma...@openindex.io>
Sent: Tue, Jun 7, 2011 1:16 pm
Subject: Re: keeping index up to date


Hi,

To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.

On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma <ma...@openindex.io>wrote:

>
> >  Hi,
> >
> > I took a look to the  recrawl script and noticed that all the steps
> except
> > urls injection are repeated at the consequent indexing and wondered why
> > would we generate new segments? Is it possible to do fetch, update for
> all
> > previous $s1..$sn , invertlink  and index steps.
>
> No, the generater generates a segment with a list of URL for the fetcher to
> fetch. You can, if you like, then merge segments.
>
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Julien Nioche <li...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Wed, Jun 1, 2011 12:59 am
> > Subject: Re: keeping index up to date
> >
> >
> > You should use the adaptative fetch schedule. See
> > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> >for
> > details
> >
> > On 1 June 2011 07:18, <al...@aim.com> wrote:
> > > Hello,
> > >
> > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> pdf
> > > files which do not change over time.
> > > I wondered if there is a way of configuring nutch not to fetch
> unchanged
> > > documents again and again, but keep the old index for them.
> > >
> > >
> > > Thanks.
> > > Alex.
>



-- 
*Lewis*

Re: keeping index up to date

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi,

To add to Markus' comments, if you take a look at the script it is written
in such a way that if run in safe mode it protects us against an error which
may occur. If this is the case we an recover segments etc and take
appropriate actions to resolve.

On Tue, Jun 7, 2011 at 9:01 PM, Markus Jelsma <ma...@openindex.io>wrote:

>
> >  Hi,
> >
> > I took a look to the  recrawl script and noticed that all the steps
> except
> > urls injection are repeated at the consequent indexing and wondered why
> > would we generate new segments? Is it possible to do fetch, update for
> all
> > previous $s1..$sn , invertlink  and index steps.
>
> No, the generater generates a segment with a list of URL for the fetcher to
> fetch. You can, if you like, then merge segments.
>
> >
> > Thanks.
> > Alex.
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Julien Nioche <li...@gmail.com>
> > To: user <us...@nutch.apache.org>
> > Sent: Wed, Jun 1, 2011 12:59 am
> > Subject: Re: keeping index up to date
> >
> >
> > You should use the adaptative fetch schedule. See
> > http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> > <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20
> >for
> > details
> >
> > On 1 June 2011 07:18, <al...@aim.com> wrote:
> > > Hello,
> > >
> > > I use nutch-1.2 to index about 3000 sites. One of them has about 1500
> pdf
> > > files which do not change over time.
> > > I wondered if there is a way of configuring nutch not to fetch
> unchanged
> > > documents again and again, but keep the old index for them.
> > >
> > >
> > > Thanks.
> > > Alex.
>



-- 
*Lewis*

Re: keeping index up to date

Posted by Markus Jelsma <ma...@openindex.io>.

>  Hi,
> 
> I took a look to the  recrawl script and noticed that all the steps except
> urls injection are repeated at the consequent indexing and wondered why
> would we generate new segments? Is it possible to do fetch, update for all
> previous $s1..$sn , invertlink  and index steps.

No, the generater generates a segment with a list of URL for the fetcher to 
fetch. You can, if you like, then merge segments.

> 
> Thanks.
> Alex.
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Julien Nioche <li...@gmail.com>
> To: user <us...@nutch.apache.org>
> Sent: Wed, Jun 1, 2011 12:59 am
> Subject: Re: keeping index up to date
> 
> 
> You should use the adaptative fetch schedule. See
> http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
> <http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20>for
> details
> 
> On 1 June 2011 07:18, <al...@aim.com> wrote:
> > Hello,
> > 
> > I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
> > files which do not change over time.
> > I wondered if there is a way of configuring nutch not to fetch unchanged
> > documents again and again, but keep the old index for them.
> > 
> > 
> > Thanks.
> > Alex.

Re: keeping index up to date

Posted by al...@aim.com.

 Hi,

I took a look to the  recrawl script and noticed that all the steps except urls injection are repeated at the consequent indexing and wondered why would we generate new segments?
Is it possible to do fetch, update for all previous $s1..$sn , invertlink  and index steps.

Thanks.
Alex.

-----Original Message-----
From: Julien Nioche <li...@gmail.com>
To: user <us...@nutch.apache.org>
Sent: Wed, Jun 1, 2011 12:59 am
Subject: Re: keeping index up to date

You should use the adaptative fetch schedule. See
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
<http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20>for
details

On 1 June 2011 07:18, <al...@aim.com> wrote:

> Hello,
>
> I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
> files which do not change over time.
> I wondered if there is a way of configuring nutch not to fetch unchanged
> documents again and again, but keep the old index for them.
>
>
> Thanks.
> Alex.
>

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: keeping index up to date

Posted by Julien Nioche <li...@gmail.com>.

You should use the adaptative fetch schedule. See
http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/
<http://pascaldimassimo.com/2010/06/11/how-to-re-crawl-with-nutch/%20>for
details

On 1 June 2011 07:18, <al...@aim.com> wrote:

> Hello,
>
> I use nutch-1.2 to index about 3000 sites. One of them has about 1500 pdf
> files which do not change over time.
> I wondered if there is a way of configuring nutch not to fetch unchanged
> documents again and again, but keep the old index for them.
>
>
> Thanks.
> Alex.
>

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com