You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Niels Basjes (Jira)" <ji...@apache.org> on 2020/11/24 10:45:00 UTC
[jira] [Commented] (BEAM-10883) Xmlio parsing of multibyte
characters
[ https://issues.apache.org/jira/browse/BEAM-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17238039#comment-17238039 ]
Niels Basjes commented on BEAM-10883:
-------------------------------------
I suspect this comment to describe the cause of this problem:
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L335
{quote}
// We use Woodstox because the StAX implementation provided by OpenJDK reports
// character locations incorrectly. Note that Woodstox still currently reports *byte*
// locations incorrectly when parsing documents that contain multi-byte characters.
{quote}
> Xmlio parsing of multibyte characters
> -------------------------------------
>
> Key: BEAM-10883
> URL: https://issues.apache.org/jira/browse/BEAM-10883
> Project: Beam
> Issue Type: Bug
> Components: sdk-java-core
> Reporter: Duncan Lew
> Priority: P1
> Labels: Clarified
>
> We are running into issues with parsing multi-byte characters that result in duplicates and/or data loss as described in this document: https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/io/xml/XmlIO.html
--
This message was sent by Atlassian Jira
(v8.3.4#803005)