You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Abhay Ratnaparkhi <ab...@gmail.com> on 2021/04/06 19:53:44 UTC

Nutch getting rid of older segments

Hello,

I have a large number of segments occupying disk space. It is a good
strategy to delete old segments or it's better to merge them.

Thank you
Abhay

Re: Nutch getting rid of older segments

Posted by Abhay Ratnaparkhi <ab...@gmail.com>.
We frequently recrawl urls (adaptive fetch from 3 to 30 days). So seems no
harm in deleting older than month segments.

Thank you.

On Wed, Apr 7, 2021 at 5:24 AM Markus Jelsma <ma...@openindex.io>
wrote:

> Hello Abhay,
>
> You only need to keep or merge old segments if you 'quickly' need to
> reindex the data, and are unable to start with a fresh crawl. If you
> frequently recrawl all urls, e.g. a month, then segments older than a month
> can safely be removed.
>
> You can also do daily an monthly merges, like we do. This makes it possible
> to revisit old data for research, in case websites change layout, or are no
> longer customer and not being crawled anymore.
>
> Regards,
> Markus
>
> Op di 6 apr. 2021 om 21:54 schreef Abhay Ratnaparkhi <
> abhay.ratnaparkhi@gmail.com>:
>
> > Hello,
> >
> > I have a large number of segments occupying disk space. It is a good
> > strategy to delete old segments or it's better to merge them.
> >
> > Thank you
> > Abhay
> >
>

Re: Nutch getting rid of older segments

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Abhay,

You only need to keep or merge old segments if you 'quickly' need to
reindex the data, and are unable to start with a fresh crawl. If you
frequently recrawl all urls, e.g. a month, then segments older than a month
can safely be removed.

You can also do daily an monthly merges, like we do. This makes it possible
to revisit old data for research, in case websites change layout, or are no
longer customer and not being crawled anymore.

Regards,
Markus

Op di 6 apr. 2021 om 21:54 schreef Abhay Ratnaparkhi <
abhay.ratnaparkhi@gmail.com>:

> Hello,
>
> I have a large number of segments occupying disk space. It is a good
> strategy to delete old segments or it's better to merge them.
>
> Thank you
> Abhay
>