You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Bai Shen <ba...@gmail.com> on 2012/08/03 16:17:50 UTC

Nutch 2 fetched content cleanup

In Nutch 1.4, after I indexed a segment, I could delete it to save space.
Is something like this possible with Nutch 2?

Thanks.

Re: Nutch 2 fetched content cleanup

Posted by Ferdy Galema <fe...@kalooga.com>.
Hi,

I think it is not directly supported in Nutch2. One way would be to write a
tool that simply deletes all fields not needed for general crawling. (Since
you want to keep the fields that indicate that the url is already fetched,
for example). The big fields that can be deleted after indexing include
'content' and 'text'.

Delete support is currently not optimal in Gora so you might want to
implement a workaround by directly using your store specific api. (Of
course this would not be of any benefit to the other datastores).

If you do not need inlinks (anchor texts) you could strip out some of the
functionality of the DbUpdateReducer that writes the inlinks for every row.
(Just  skip the actual writing of the inlinks to every row, but keeping the
scoring functionality that depends on the inlinks). This requires some
coding too.

Feel free to share other suggestions.

Ferdy.

On Fri, Aug 3, 2012 at 4:17 PM, Bai Shen <ba...@gmail.com> wrote:

> In Nutch 1.4, after I indexed a segment, I could delete it to save space.
> Is something like this possible with Nutch 2?
>
> Thanks.
>