You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by "Julien Massiera (Jira)" <ji...@apache.org> on 2021/02/12 17:00:03 UTC

[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML

    [ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283832#comment-17283832 ] 

Julien Massiera commented on CONNECTORS-1656:
---------------------------------------------

[~kwright@metacarta.com], is the patch ok ? 

> HTML extractor produces invalid XML
> -----------------------------------
>
>                 Key: CONNECTORS-1656
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1656
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: HTML extractor
>    Affects Versions: ManifoldCF 2.17
>            Reporter: Julien Massiera
>            Assignee: Karl Wright
>            Priority: Major
>             Fix For: ManifoldCF next
>
>         Attachments: patch-CONNECTORS-1656
>
>
> The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some tags like img do not have closing tag), and in some cases it is problematic. For example, when Tika is used behind, it processes the document as an XML document and most of the time a parse exception is raised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)