You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Ayman Farhat (Jira)" <ji...@apache.org> on 2022/01/24 12:18:00 UTC

[jira] [Commented] (BEAM-11875) XmlIO.Read does not handle XML encoding per spec

    [ https://issues.apache.org/jira/browse/BEAM-11875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17481046#comment-17481046 ] 

Ayman Farhat commented on BEAM-11875:
-------------------------------------

Hello! I'm following up on this issue to understand if there are any solution implementations planned in the pipeline for the near future. We've been facing this issue recently with a data pipeline running on Dataflow, content containing multi-byte latin characters we had to do several workarounds to get a sort of acceptable solution. Would still be extremely helpful if this issue gets fixed direct on XMLIO.read

> XmlIO.Read does not handle XML encoding per spec
> ------------------------------------------------
>
>                 Key: BEAM-11875
>                 URL: https://issues.apache.org/jira/browse/BEAM-11875
>             Project: Beam
>          Issue Type: Bug
>          Components: io-java-xml
>    Affects Versions: 2.28.0
>            Reporter: Elliotte Rusty Harold
>            Priority: P1
>
> Not sure what the implementation problem is but based on the API doc, there's a real flaw in XmlIO.Read:
>  
> By default, UTF-8 charset is used. To specify a different charset, use [{{XmlIO.Read.withCharset(java.nio.charset.Charset)}}|https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/io/xml/XmlIO.Read.html#withCharset-java.nio.charset.Charset-].
> Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.
>  
> Properly handled, there is never any need to specify the character encoding when reading an XML document. XML documents fully identify their character encoding. The developer at this level doesn't need to know and shouldn't think about the character encoding. Perhaps in the source code someone is a using a Reader where they should be using an InputStream instead? That might lead this problem.
> Also, the text contradicts itself. UTF-8 is a multibyte character set. I hope that doesn't lead to data loss or duplication by default.
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)