You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2021/06/18 08:53:00 UTC

[jira] [Updated] (NUTCH-2880) parse-html/tika: update/complete HTML elements to extract outlinks from

     [ https://issues.apache.org/jira/browse/NUTCH-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel updated NUTCH-2880:
-----------------------------------
    Labels: help-wanted  (was: )

> parse-html/tika: update/complete HTML elements to extract outlinks from
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2880
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2880
>             Project: Nutch
>          Issue Type: Improvement
>          Components: parser, plugin
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Priority: Major
>              Labels: help-wanted
>             Fix For: 1.19
>
>
> The list of HTML elements used to extract outlinks from (in [DOMContentUtils (parse-html)|https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java] and [DOMContentUtils (parse-tika)|https://github.com/apache/nutch/blob/master/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java]) needs to be updated/completed to include HTML elements common in HTML5. Cf. a [related question on stackoverflow about the <object> element|https://stackoverflow.com/questions/68024834/nutchsolr-how-do-you-index-a-pdf-embedded-in-html]
> A (mostly?) up-to-date list of HTML elements could be taken from the [extractor of iipc/webarchiv-commons|https://github.com/iipc/webarchive-commons/blob/26b1e7af27abec102ab36faf6a786dfedf9436fd/src/main/java/org/archive/resource/html/ExtractingParseObserver.java#L49].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)