You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2020/03/13 09:10:00 UTC

[jira] [Resolved] (NUTCH-2773) SegmentReader (-dump or -get): show HTML content as UTF-8

     [ https://issues.apache.org/jira/browse/NUTCH-2773?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebastian Nagel resolved NUTCH-2773.
------------------------------------
    Resolution: Implemented

> SegmentReader (-dump or -get): show HTML content as UTF-8
> ---------------------------------------------------------
>
>                 Key: NUTCH-2773
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2773
>             Project: Nutch
>          Issue Type: Improvement
>          Components: segment
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> SegmentReader dumps resp. the output shown by -get is first converted to Java strings and then shown using UTF-8 as output encoding. The HTML page content is hold by the container class "Content" as byte[] and if another charset than UTF-8 is used as original page encoding, the output of SegmentReader may look flawed. The reader could use the encoding already detected by the parser (if available) and try to properly recode the HTML page content to UTF-8.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)