You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "S.L" <si...@gmail.com> on 2014/03/19 23:13:19 UTC

fetcher.store.content property

Hi All,

I am not using the Nutch indexer but indexing using my own utility method
after every page is fetched and I need to bypass any additional steps that
Nutch executes in a crawl .Along those line I have identified the following
steps to implement.


   1. Disable LinkDB creation by commenting out LinkDB.invert() method.
   2. Not store the fetch_content in a segment which is used to create an
   index by setting the property fetcher.store.content to false.


I am clear about #1 from discussion I have had with Sebastian earlier.

About #2 I need to know if having fetcher.store.content set to false would
be a good idea ?


Thanks.

Re: fetcher.store.content property

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

It's possible to set
 fetcher.store.content = false
in combination with
 fetcher.parse = true

If disk space is rare or disks are slow this combination may make sense.
But there are serious reasons why the parser is run as a separate job
per default and, as a precondition, raw content is kept: see NUTCH-872,
and http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F

Sebastian

On 03/19/2014 11:13 PM, S.L wrote:
> Hi All,
> 
> I am not using the Nutch indexer but indexing using my own utility method
> after every page is fetched and I need to bypass any additional steps that
> Nutch executes in a crawl .Along those line I have identified the following
> steps to implement.
> 
> 
>    1. Disable LinkDB creation by commenting out LinkDB.invert() method.
>    2. Not store the fetch_content in a segment which is used to create an
>    index by setting the property fetcher.store.content to false.
> 
> 
> I am clear about #1 from discussion I have had with Sebastian earlier.
> 
> About #2 I need to know if having fetcher.store.content set to false would
> be a good idea ?
> 
> 
> Thanks.
>