You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Marcos Bori (JIRA)" <ji...@apache.org> on 2017/09/26 15:10:02 UTC

[jira] [Created] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found

Marcos Bori created NUTCH-2433:
----------------------------------

             Summary: Html Parser: keep htmltag where the outlinks are found
                 Key: NUTCH-2433
                 URL: https://issues.apache.org/jira/browse/NUTCH-2433
             Project: Nutch
          Issue Type: New Feature
          Components: parser
    Affects Versions: 1.13
         Environment: Apache Nutch release 1.13.
            Reporter: Marcos Bori


When parsing HTML pages, I need to know in which HTML tag the outlinks were found (for example, 'a', 'script', 'img', etc).

I propose to add a new configuration value, "parser.html.outlinks.htmlnode_metadata_name".
If this configuration property is not empty, all found outlinks will be assigned a metadata with the name indicated in this configuration property with the html tag name where the outlink was found.

I will now send the pull request with my code implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)