You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Niels Basjes (Jira)" <ji...@apache.org> on 2020/11/24 10:52:00 UTC

[jira] [Comment Edited] (BEAM-10883) Xmlio parsing of multibyte characters

    [ https://issues.apache.org/jira/browse/BEAM-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238039#comment-17238039 ] 

Niels Basjes edited comment on BEAM-10883 at 11/24/20, 10:51 AM:
-----------------------------------------------------------------

This is a documented limitation

{quote}Currently, only XML files that use single-byte characters are supported. Using a file that contains multi-byte characters may result in data loss or duplication.{quote}
 
I suspect this comment describes the root cause of this problem:
 [https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L335]
{quote}// We use Woodstox because the StAX implementation provided by OpenJDK reports
 // character locations incorrectly. Note that Woodstox still currently reports *byte*
 // locations incorrectly when parsing documents that contain multi-byte characters.
{quote}


was (Author: nielsbasjes):
I suspect this comment to describe the cause of this problem:
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L335

{quote}
// We use Woodstox because the StAX implementation provided by OpenJDK reports
// character locations incorrectly. Note that Woodstox still currently reports *byte*
// locations incorrectly when parsing documents that contain multi-byte characters.
{quote}

> Xmlio parsing of multibyte characters
> -------------------------------------
>
>                 Key: BEAM-10883
>                 URL: https://issues.apache.org/jira/browse/BEAM-10883
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-core
>            Reporter: Duncan Lew
>            Priority: P1
>              Labels: Clarified
>
> We are running into issues with parsing multi-byte characters that result in duplicates and/or data loss as described in this document: https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)