You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "Damien GOUYETTE (JIRA)" <ji...@apache.org> on 2017/04/24 13:34:04 UTC
[jira] [Updated] (BEAM-2060) XmlSource use harcoded Charset
[ https://issues.apache.org/jira/browse/BEAM-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Damien GOUYETTE updated BEAM-2060:
----------------------------------
Description:
When i use a file encoded with ISO-8859-1 with a caracter *é* i got an exception like :
{code}
Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061, byte #1012)
at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
... 19 more
{code}
Encoding is hardcoded :
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342
It would be great if i can specify it like :
{code}
XmlSource.from[MyClass](input)
.withRootElement("ROOT_ELEMENT")
.withRecordElement("MyClass")
.withRecordClass(classOf[MyClass])
.withCharset(StandardCharsets.ISO_8859_1)
{code}
was:
When i use a file encoded with ISO-8859-1 with a caracter `é` i got an exception like :
{code}
Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061, byte #1012)
at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
... 19 more
{code}
Encoding is hardcoded :
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342
It would be great if i can specify it like :
{code}
XmlSource.from[MyClass](input)
.withRootElement("ROOT_ELEMENT")
.withRecordElement("MyClass")
.withRecordClass(classOf[MyClass])
.withCharset(StandardCharsets.ISO_8859_1)
{code}
> XmlSource use harcoded Charset
> ------------------------------
>
> Key: BEAM-2060
> URL: https://issues.apache.org/jira/browse/BEAM-2060
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Affects Versions: 0.6.0
> Reporter: Damien GOUYETTE
> Assignee: Davor Bonaci
>
> When i use a file encoded with ISO-8859-1 with a caracter *é* i got an exception like :
> {code}
> Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x64 (at char #1061, byte #1012)
> at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:314)
> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:205)
> at com.ctc.wstx.io.MergedReader.read(MergedReader.java:105)
> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:86)
> at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:56)
> at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:1001)
> ... 19 more
> {code}
> Encoding is hardcoded :
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L190
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L238
> https://github.com/apache/beam/blob/master/sdks/java/io/xml/src/main/java/org/apache/beam/sdk/io/xml/XmlSource.java#L342
> It would be great if i can specify it like :
> {code}
> XmlSource.from[MyClass](input)
> .withRootElement("ROOT_ELEMENT")
> .withRecordElement("MyClass")
> .withRecordClass(classOf[MyClass])
> .withCharset(StandardCharsets.ISO_8859_1)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)