You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "cwinay@yahoo.com (JIRA)" <ji...@apache.org> on 2009/10/12 13:15:31 UTC

[jira] Commented: (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

    [ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12764651#action_12764651 ] 

cwinay@yahoo.com commented on NUTCH-585:
----------------------------------------

Hi,
Is it possible for you to share the code with me??
I seem to have found a use of the facility you wish to add to Nutch.
I'm using a content management system called Infoglue to create my website.
The pages I create for my site have a fixed template containing header, footer and a menu system.
I wish that Nutch should index the template content only for the home page and I want it to index just the relevant (non-template) content on the inner pages.

So please share your idea and/or code.
Details of the implementation are appreciated. 
So far I have just been a naive Nutch user. 

Thanks a lot.
Winz

Quoted from: 
http://www.nabble.com/-jira--Created%3A-%28NUTCH-585%29--PARSE-HTML-plugin--Block-certain-parts-of-HTML-code-from-being-indexed-tp14023775p14023775.html



> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-585
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: All operating systems
>            Reporter: Andrea Spinelli
>            Priority: Minor
>
> We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML comments, like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any expression of  interest - or for an explanation why waht we are doing is plain wrong!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.