You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eelco Lempsink <le...@paragin.nl> on 2006/10/25 16:02:46 UTC
Preventing pages to be indexed based on content
Hello,
I'm looking for a solution to a problem typical to domain specific
search engines: a way to prevent certain pages to be indexed, based
on their content, but keeping the outlinks of the page. When
searching this mailing list I noticed this question (or something
similar) being asked and answered before:
Howie Wang wrote @ Tue, 24 Jan 2006 16:27:11 -0800:
> You can do this within a custom HtmlParser and IndexFilter. In
> the HtmlParser, look at the page and decide whether you want it
> or not, then insert a metadata property called "index" and
> set it to "true" or "false".
>
> In the filter method of the index filter, look up the "index"
> metadata, and if it's false, just return without indexing anything.
This seems like a good idea. The thing I don't understand is the
last part of the phrase: "just return without indexing anything".
Looking at the code how the various filters work I would say they're
not really filters, because they don't have the ability to 'skip' a
document.
The only way to make a filter 'skip' is by throwing an exception, but
this seems crude to me, since the behaviour is intended, not
exceptional.
Andrzej Bialecki wrote @ Wed, 04 Jan 2006 04:09:44 -0800:
> Aled Jones wrote:
>> Hi
>>
>> Is there a way to remove certain urls from a crawled set of data?
>
> Please see the PruneIndexTool. This removes just the index entries,
> without actually removing the content from segments. This means
> that you will no longer see the hits from these urls, but it
> doesn't prevent you from collecting the same urls in the next round
> of fetching. To prevent that, you need to modify your URLFilters.
Of course, for high volumes of data first indexing, and afterwards
removing it, doesn't sound like a good option in my case where only a
small part of the fetched data needs to be indexed.
Has anyone solved this problem (elegantly)? I mainly wonder if it's
feasible to do it only using plugins, since I suspect I must
implement my own Indexer.
--
Regards,
Eelco Lempsink
Re: Preventing pages to be indexed based on content
Posted by Eelco Lempsink <le...@paragin.nl>.
On 25-okt-2006, at 18:26, Andrzej Bialecki wrote:
> Eelco Lempsink wrote:
>> Of course, for high volumes of data first indexing, and afterwards
>> removing it, doesn't sound like a good option in my case where
>> only a small part of the fetched data needs to be indexed.
>>
>> Has anyone solved this problem (elegantly)? I mainly wonder if
>> it's feasible to do it only using plugins, since I suspect I must
>> implement my own Indexer.
>
> Plugins may also return null doc. Standard Indexer would have to be
> modified to handle this gracefully, but it's trivial:
Thank you, that's indeed a good solution. The only thing that
bothers me is that plugins _may_ return null doc's, but it's not
handled well. (In other words, by reading the code I didn't get the
idea that returning a null doc would be okay.) I submitted a bug
report for this (https://issues.apache.org/jira/browse/NUTCH-393).
--
Regards,
Eelco Lempsink
Re: Preventing pages to be indexed based on content
Posted by Andrzej Bialecki <ab...@getopt.org>.
Eelco Lempsink wrote:
> Of course, for high volumes of data first indexing, and afterwards
> removing it, doesn't sound like a good option in my case where only a
> small part of the fetched data needs to be indexed.
>
> Has anyone solved this problem (elegantly)? I mainly wonder if it's
> feasible to do it only using plugins, since I suspect I must implement
> my own Indexer.
Plugins may also return null doc. Standard Indexer would have to be
modified to handle this gracefully, but it's trivial:
Indexer.java:239
try {
// run indexing filters
doc = this.filters.filter(doc, parse, (Text)key, fetchDatum, inlinks);
} catch (IndexingException e) {
if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
return;
}
+ if (doc == null) return;
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com