You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Marcos Bori (JIRA)" <ji...@apache.org> on 2017/09/26 15:10:02 UTC
[jira] [Created] (NUTCH-2433) Html Parser: keep htmltag where the
outlinks are found
Marcos Bori created NUTCH-2433:
----------------------------------
Summary: Html Parser: keep htmltag where the outlinks are found
Key: NUTCH-2433
URL: https://issues.apache.org/jira/browse/NUTCH-2433
Project: Nutch
Issue Type: New Feature
Components: parser
Affects Versions: 1.13
Environment: Apache Nutch release 1.13.
Reporter: Marcos Bori
When parsing HTML pages, I need to know in which HTML tag the outlinks were found (for example, 'a', 'script', 'img', etc).
I propose to add a new configuration value, "parser.html.outlinks.htmlnode_metadata_name".
If this configuration property is not empty, all found outlinks will be assigned a metadata with the name indicated in this configuration property with the html tag name where the outlink was found.
I will now send the pull request with my code implementation.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)