You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2017/09/29 11:50:02 UTC

[jira] [Resolved] (NUTCH-2433) Html Parser: keep htmltag where the outlinks are found

     [ https://issues.apache.org/jira/browse/NUTCH-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel resolved NUTCH-2433.
------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.14

Thanks, committed to 1.x, [777e759|https://github.com/apache/nutch/commit/777e759ada24eac84072a5f1722938442432eadc].

> Html Parser: keep htmltag where the outlinks are found
> ------------------------------------------------------
>
>                 Key: NUTCH-2433
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2433
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.13
>         Environment: Apache Nutch release 1.13.
>            Reporter: Marcos Bori
>              Labels: html, outlink
>             Fix For: 1.14
>
>
> When parsing HTML pages, I need to know in which HTML tag the outlinks were found (for example, 'a', 'script', 'img', etc).
> I propose to add a new configuration value, "parser.html.outlinks.htmlnode_metadata_name".
> If this configuration property is not empty, all found outlinks will be assigned a metadata with the name indicated in this configuration property with the html tag name where the outlink was found.
> I will now send the pull request with my code implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)