You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2017/09/29 11:50:02 UTC
[jira] [Resolved] (NUTCH-2433) Html Parser: keep htmltag where the
outlinks are found
[ https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2433.
------------------------------------
Resolution: Fixed
Fix Version/s: 1.14
Thanks, committed to 1.x, [777e759|https://github.com/apache/nutch/commit/777e759ada24eac84072a5f1722938442432eadc].
> Html Parser: keep htmltag where the outlinks are found
> ------------------------------------------------------
>
> Key: NUTCH-2433
> URL: https://issues.apache.org/jira/browse/NUTCH-2433
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.13
> Environment: Apache Nutch release 1.13.
> Reporter: Marcos Bori
> Labels: html, outlink
> Fix For: 1.14
>
>
> When parsing HTML pages, I need to know in which HTML tag the outlinks were found (for example, 'a', 'script', 'img', etc).
> I propose to add a new configuration value, "parser.html.outlinks.htmlnode_metadata_name".
> If this configuration property is not empty, all found outlinks will be assigned a metadata with the name indicated in this configuration property with the html tag name where the outlink was found.
> I will now send the pull request with my code implementation.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)