You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eelco Lempsink <le...@paragin.nl> on 2006/10/25 16:02:46 UTC

Preventing pages to be indexed based on content

Hello,

I'm looking for a solution to a problem typical to domain specific  
search engines: a way to prevent certain pages to be indexed, based  
on their content, but keeping the outlinks of the page.  When  
searching this mailing list I noticed this question (or something  
similar) being asked and answered before:

Howie Wang wrote @ Tue, 24 Jan 2006 16:27:11 -0800:
> You can do this within a custom HtmlParser and IndexFilter. In
> the HtmlParser, look at the page and decide whether you want it
> or not, then insert a metadata property called "index" and
> set it to "true" or "false".
>
> In the filter method of the index filter, look up the "index"
> metadata, and if it's false, just return without indexing anything.

This seems like a good idea.  The thing I don't understand is the  
last part of the phrase: "just return without indexing anything".   
Looking at the code how the various filters work I would say they're  
not really filters, because they don't have the ability to 'skip' a  
document.

The only way to make a filter 'skip' is by throwing an exception, but  
this seems crude to me, since the behaviour is intended, not  
exceptional.

Andrzej Bialecki wrote @ Wed, 04 Jan 2006 04:09:44 -0800:
> Aled Jones wrote:
>> Hi
>>
>> Is there a way to remove certain urls from a crawled set of data?
>
> Please see the PruneIndexTool. This removes just the index entries,  
> without actually removing the content from segments. This means  
> that you will no longer see the hits from these urls, but it  
> doesn't prevent you from collecting the same urls in the next round  
> of fetching. To prevent that, you need to modify your URLFilters.

Of course, for high volumes of data first indexing, and afterwards  
removing it, doesn't sound like a good option in my case where only a  
small part of the fetched data needs to be indexed.

Has anyone solved this problem (elegantly)?  I mainly wonder if it's  
feasible to do it only using plugins, since I suspect I must  
implement my own Indexer.

-- 
Regards,

Eelco Lempsink

Re: Preventing pages to be indexed based on content

Posted by Eelco Lempsink <le...@paragin.nl>.

On 25-okt-2006, at 18:26, Andrzej Bialecki wrote:
> Eelco Lempsink wrote:
>> Of course, for high volumes of data first indexing, and afterwards  
>> removing it, doesn't sound like a good option in my case where  
>> only a small part of the fetched data needs to be indexed.
>>
>> Has anyone solved this problem (elegantly)?  I mainly wonder if  
>> it's feasible to do it only using plugins, since I suspect I must  
>> implement my own Indexer.
>
> Plugins may also return null doc. Standard Indexer would have to be  
> modified to handle this gracefully, but it's trivial:

Thank you, that's indeed a good solution.  The only thing that  
bothers me is that plugins _may_ return null doc's, but it's not  
handled well.  (In other words, by reading the code I didn't get the  
idea that returning a null doc would be okay.)  I submitted a bug  
report for this (https://issues.apache.org/jira/browse/NUTCH-393).

-- 
Regards,

Eelco Lempsink

Re: Preventing pages to be indexed based on content

Posted by Andrzej Bialecki <ab...@getopt.org>.

Eelco Lempsink wrote:
> Of course, for high volumes of data first indexing, and afterwards 
> removing it, doesn't sound like a good option in my case where only a 
> small part of the fetched data needs to be indexed.
>
> Has anyone solved this problem (elegantly)?  I mainly wonder if it's 
> feasible to do it only using plugins, since I suspect I must implement 
> my own Indexer.

Plugins may also return null doc. Standard Indexer would have to be 
modified to handle this gracefully, but it's trivial:

Indexer.java:239

    try {
      // run indexing filters
      doc = this.filters.filter(doc, parse, (Text)key, fetchDatum, inlinks);
    } catch (IndexingException e) {
      if (LOG.isWarnEnabled()) { LOG.warn("Error indexing "+key+": "+e); }
      return;
    }
+   if (doc == null) return;

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com