You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Niels Basjes (Jira)" <ji...@apache.org> on 2020/11/24 10:45:00 UTC

[jira] [Commented] (BEAM-10883) Xmlio parsing of multibyte characters

    [ https://issues.apache.org/jira/browse/BEAM-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238039#comment-17238039 ] 

Niels Basjes commented on BEAM-10883:
-------------------------------------

I suspect this comment to describe the cause of this problem:
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L335

{quote}
// We use Woodstox because the StAX implementation provided by OpenJDK reports
// character locations incorrectly. Note that Woodstox still currently reports *byte*
// locations incorrectly when parsing documents that contain multi-byte characters.
{quote}

> Xmlio parsing of multibyte characters
> -------------------------------------
>
>                 Key: BEAM-10883
>                 URL: https://issues.apache.org/jira/browse/BEAM-10883
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Duncan Lew
>            Priority: P1
>              Labels: Clarified
>
> We are running into issues with parsing multi-byte characters that result in duplicates and/or data loss as described in this document: https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)