You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Changming Ma (Jira)" <ji...@apache.org> on 2019/11/07 19:34:00 UTC

[jira] [Created] (BEAM-8579) Strip UTF-8 BOM bytes (if present) in TextSource.

Changming Ma created BEAM-8579:
----------------------------------

             Summary: Strip UTF-8 BOM bytes (if present) in TextSource.
                 Key: BEAM-8579
                 URL: https://issues.apache.org/jira/browse/BEAM-8579
             Project: Beam
          Issue Type: Bug
          Components: io-java-text
    Affects Versions: 2.15.0
            Reporter: Changming Ma


TextSource in the org.apache.beam.sdk.io package can handle UTF-8 encoded files, and when the file contains byte order mark (BOM), it will preserve it in the output. According to Unicode standard ([http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf|https://www.google.com/url?q=http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf&sa=D&usg=AFQjCNF_PW0McUUnM1UrvZSIwgvAj1uUKw]): "Use of a BOM is neither required nor recommended for UTF-8". UTF-8 with a BOM will also be a potential problem for some Java implementations (e.g., [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4508058|https://www.google.com/url?q=https://bugs.java.com/bugdatabase/view_bug.do?bug_id%3D4508058&sa=D&usg=AFQjCNEdT7vUK99N5bxQc9fkCt-uIG2v7Q]). As a general practice, it's suggested to use UTF-8 without BOM.

Proposal: remove BOM bytes in the output from TextSource.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)