You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "S.L" <si...@gmail.com> on 2014/03/19 23:13:19 UTC
fetcher.store.content property
Hi All,
I am not using the Nutch indexer but indexing using my own utility method
after every page is fetched and I need to bypass any additional steps that
Nutch executes in a crawl .Along those line I have identified the following
steps to implement.
1. Disable LinkDB creation by commenting out LinkDB.invert() method.
2. Not store the fetch_content in a segment which is used to create an
index by setting the property fetcher.store.content to false.
I am clear about #1 from discussion I have had with Sebastian earlier.
About #2 I need to know if having fetcher.store.content set to false would
be a good idea ?
Thanks.
Re: fetcher.store.content property
Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,
It's possible to set
fetcher.store.content = false
in combination with
fetcher.parse = true
If disk space is rare or disks are slow this combination may make sense.
But there are serious reasons why the parser is run as a separate job
per default and, as a precondition, raw content is kept: see NUTCH-872,
and http://wiki.apache.org/nutch/FAQ#Can_I_parse_during_the_fetching_process.3F
Sebastian
On 03/19/2014 11:13 PM, S.L wrote:
> Hi All,
>
> I am not using the Nutch indexer but indexing using my own utility method
> after every page is fetched and I need to bypass any additional steps that
> Nutch executes in a crawl .Along those line I have identified the following
> steps to implement.
>
>
> 1. Disable LinkDB creation by commenting out LinkDB.invert() method.
> 2. Not store the fetch_content in a segment which is used to create an
> index by setting the property fetcher.store.content to false.
>
>
> I am clear about #1 from discussion I have had with Sebastian earlier.
>
> About #2 I need to know if having fetcher.store.content set to false would
> be a good idea ?
>
>
> Thanks.
>