You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Fred Zimmerman <wf...@nimblebooks.com> on 2011/10/05 16:57:52 UTC

when and how to delete old crawls?

hi,

I have a bunch of test crawls that I have carried out in the past sitting
around.  most of them are indexed by solr configured per nutch-config to run
again in 30 days.  these old crawls are a subset of (and redundant to) my
current "master" crawl. How should I get rid of these old crawls so that
Nutch doesn't run them again and they are no longer cluttering up my
directories? Also, are they all creating duplicate entries in the solr
index?

Fred

Re: when and how to delete old crawls?

Posted by Fred Zimmerman <wf...@nimblebooks.com>.
I mean the directories like this:

crawl-20110920160208
crawl-20110920211805
etc ...




On Wed, Oct 5, 2011 at 11:08, Markus Jelsma <ma...@openindex.io>wrote:

> "crawls" or segment directories? You can delete old segment files is all
> files
> are fetched in newer segments, that is, older than 30 days if your crawl
> can
> keep up with the limit.
>
> On Wednesday 05 October 2011 16:57:52 Fred Zimmerman wrote:
> > hi,
> >
> > I have a bunch of test crawls that I have carried out in the past sitting
> > around.  most of them are indexed by solr configured per nutch-config to
> > run again in 30 days.  these old crawls are a subset of (and redundant
> to)
> > my current "master" crawl. How should I get rid of these old crawls so
> > that Nutch doesn't run them again and they are no longer cluttering up my
> > directories? Also, are they all creating duplicate entries in the solr
> > index?
> >
> > Fred
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: when and how to delete old crawls?

Posted by Markus Jelsma <ma...@openindex.io>.
"crawls" or segment directories? You can delete old segment files is all files 
are fetched in newer segments, that is, older than 30 days if your crawl can 
keep up with the limit.

On Wednesday 05 October 2011 16:57:52 Fred Zimmerman wrote:
> hi,
> 
> I have a bunch of test crawls that I have carried out in the past sitting
> around.  most of them are indexed by solr configured per nutch-config to
> run again in 30 days.  these old crawls are a subset of (and redundant to)
> my current "master" crawl. How should I get rid of these old crawls so
> that Nutch doesn't run them again and they are no longer cluttering up my
> directories? Also, are they all creating duplicate entries in the solr
> index?
> 
> Fred

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350