You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Bojan Tomic (JIRA)" <ji...@apache.org> on 2013/05/11 15:27:17 UTC

[jira] [Comment Edited] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

    [ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13655269#comment-13655269 ] 

Bojan Tomic edited comment on NUTCH-585 at 5/11/13 1:26 PM:
------------------------------------------------------------

I adapted Elisabeth Adler's plugin for use with Nutch 2.1 and added two small features:

* the ability to protect certain URLs from filtering
* the ability to configure the field where the filtered content is stored (overwriting the "text" field by default)

I didn't immediately realize the common practice is creating a patch, so I put my stuff on GitHub: https://github.com/veggen/nutch-element-selector
but if anyone cares about including this, I will gladly make a patch as well (and change package names etc).
                
      was (Author: veggen):
    I adapted Elisabeth Adler's plugin for Nutch 2.1 and added two small features:

* the ability to protect certain URLs from filtering
* the ability to configure the field where the filtered content is stored (overwriting the "text" field by default)

I didn't immediately realize the common practice is creating a patch, so I put my stuff on GitHub: https://github.com/veggen/nutch-element-selector
but if anyone cares about including this, I will gladly make a patch as well (and change package names etc).
                  
> [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-585
>                 URL: https://issues.apache.org/jira/browse/NUTCH-585
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>         Environment: All operating systems
>            Reporter: Andrea Spinelli
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.7
>
>         Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch
>
>
> We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches.
> We have modified the plugin so that it ignores HTML code between certain HTML comments, like
> <!-- START-IGNORE -->
> ... ignored part ...
> <!-- STOP-IGNORE -->
> We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml).
> We are almost ready to contribute our code snippet.  Looking forward for any expression of  interest - or for an explanation why waht we are doing is plain wrong!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira