You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by David Dreyfus <dd...@gmail.com> on 2018/08/08 15:45:21 UTC

UTF-16 support for TextInputFormat

Hello -

It does not appear that Flink supports a charset encoding of "UTF-16". It
particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM)
to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are there any
plans to enhance Flink to handle UTF-16 with BOM?

Thank you,
David

Re: UTF-16 support for TextInputFormat

Posted by Fabian Hueske <fh...@gmail.com>.
Thanks for creating FLINK-10134 and adding your suggestions!

Best, Fabian

2018-08-13 23:55 GMT+02:00 David Dreyfus <dd...@gmail.com>:

> Hi Fabian,
>
> I've added FLINK-10134. FLINK-10134
> <https://issues.apache.org/jira/browse/FLINK-10134>. I'm not sure you'd
> consider it a blocker or that I've identified the right component.
> I'm afraid I don't have the bandwidth or knowledge to make the kind of
> pull request you really need. I do hope my suggestions prove a little
> useful.
>
> Thank you,
> David
>
> On Fri, Aug 10, 2018 at 5:41 AM Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi David,
>>
>> Thanks for digging into the code! I had a quick look into the classes as
>> well.
>> As far as I can see, your analysis is correct and the BOM handling in
>> DelimitedInputFormat and TextInputFormat (and other text-based IFs such as
>> CsvInputFormat) is broken.
>> In fact, its obvious that nobody paid attention to this yet.
>>
>> It would be great if you could open a Jira issue and copy your analysis
>> and solution proposal into it.
>> While on it, we could also deprecated the (duplicated) setCharsetName()
>> method from TextInputFormat and redirect it to DelimitedInputFormat.
>> setCharset().
>>
>> Would you also be interested in contributing a fix for this problem?
>>
>> Best, Fabian
>>
>> [1] https://github.com/apache/flink/blob/master/flink-java/
>> src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95
>>
>> 2018-08-09 14:55 GMT+02:00 David Dreyfus <dd...@gmail.com>:
>>
>>> Hi Fabian,
>>>
>>> Thank you for taking my email.
>>> TextInputFormat.setCharsetName("UTF-16") appears to set the private
>>> variable TextInputFormat.charsetName.
>>> It doesn't appear to cause additional behavior that would help interpret
>>> UTF-16 data.
>>>
>>> The method I've tested is calling DelimitedInputFormat.setCharset("UTF-16"),
>>> which then sets TextInputFormat.charsetName and then modifies the
>>> previously set delimiterString to construct the proper byte string encoding
>>> of the the delimiter. This same charsetName is also used in
>>> TextInputFormat.readRecord() to interpret the bytes read from the file.
>>>
>>> There are two problems that this implementation would seem to have when
>>> using UTF-16.
>>>
>>>    1. delimiterString.getBytes(getCharset()) in
>>>    DelimitedInputFormat.java will return a Big Endian byte sequence including
>>>    the Byte Order Mark (BOM). The actual text file will not contain a BOM at
>>>    each line ending, so the delimiter will never be read. Moreover, if the
>>>    actual byte encoding of the file is Little Endian, the bytes will be
>>>    interpreted incorrectly.
>>>    2. TextInputFormat.readRecord() will not see a BOM each time it
>>>    decodes a byte sequence with the String(bytes, offset, numBytes, charset)
>>>    call. Therefore, it will assume Big Endian, which may not always be correct.
>>>
>>> While there are likely many solutions, I would think that all of them
>>> would have to start by reading the BOM from the file when a Split is opened
>>> and then using that BOM to modify the specified encoding to a BOM specific
>>> one when the caller doesn't specify one, and to overwrite the caller's
>>> specification if the BOM is in conflict with the caller's specification.
>>> That is, if the BOM indicates Little Endian and the caller indicates
>>> UTF-16BE, Flink should rewrite the charsetName as UTF-16LE.
>>>
>>> I hope this makes sense and that I haven't been testing incorrectly or
>>> misreading the code.
>>>
>>> Thank you,
>>> David
>>>
>>> On Thu, Aug 9, 2018 at 4:04 AM Fabian Hueske <fh...@gmail.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>> Did you try to set the encoding on the TextInputFormat with
>>>>
>>>> TextInputFormat tif = ...
>>>> tif.setCharsetName("UTF-16");
>>>>
>>>> Best, Fabian
>>>>
>>>> 2018-08-08 17:45 GMT+02:00 David Dreyfus <dd...@gmail.com>:
>>>>
>>>>> Hello -
>>>>>
>>>>> It does not appear that Flink supports a charset encoding of "UTF-16".
>>>>> It particular, it doesn't appear that Flink consumes the Byte Order Mark
>>>>> (BOM) to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are there
>>>>> any plans to enhance Flink to handle UTF-16 with BOM?
>>>>>
>>>>> Thank you,
>>>>> David
>>>>>
>>>>
>>>>
>>

Re: UTF-16 support for TextInputFormat

Posted by David Dreyfus <dd...@gmail.com>.
Hi Fabian,

I've added FLINK-10134. FLINK-10134
<https://issues.apache.org/jira/browse/FLINK-10134>. I'm not sure you'd
consider it a blocker or that I've identified the right component.
I'm afraid I don't have the bandwidth or knowledge to make the kind of pull
request you really need. I do hope my suggestions prove a little useful.

Thank you,
David

On Fri, Aug 10, 2018 at 5:41 AM Fabian Hueske <fh...@gmail.com> wrote:

> Hi David,
>
> Thanks for digging into the code! I had a quick look into the classes as
> well.
> As far as I can see, your analysis is correct and the BOM handling in
> DelimitedInputFormat and TextInputFormat (and other text-based IFs such as
> CsvInputFormat) is broken.
> In fact, its obvious that nobody paid attention to this yet.
>
> It would be great if you could open a Jira issue and copy your analysis
> and solution proposal into it.
> While on it, we could also deprecated the (duplicated) setCharsetName()
> method from TextInputFormat and redirect it to
> DelimitedInputFormat.setCharset().
>
> Would you also be interested in contributing a fix for this problem?
>
> Best, Fabian
>
> [1]
> https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95
>
> 2018-08-09 14:55 GMT+02:00 David Dreyfus <dd...@gmail.com>:
>
>> Hi Fabian,
>>
>> Thank you for taking my email.
>> TextInputFormat.setCharsetName("UTF-16") appears to set the private
>> variable TextInputFormat.charsetName.
>> It doesn't appear to cause additional behavior that would help interpret
>> UTF-16 data.
>>
>> The method I've tested is calling
>> DelimitedInputFormat.setCharset("UTF-16"), which then sets
>> TextInputFormat.charsetName and then modifies the previously set
>> delimiterString to construct the proper byte string encoding of the the
>> delimiter. This same charsetName is also used in
>> TextInputFormat.readRecord() to interpret the bytes read from the file.
>>
>> There are two problems that this implementation would seem to have when
>> using UTF-16.
>>
>>    1. delimiterString.getBytes(getCharset()) in
>>    DelimitedInputFormat.java will return a Big Endian byte sequence including
>>    the Byte Order Mark (BOM). The actual text file will not contain a BOM at
>>    each line ending, so the delimiter will never be read. Moreover, if the
>>    actual byte encoding of the file is Little Endian, the bytes will be
>>    interpreted incorrectly.
>>    2. TextInputFormat.readRecord() will not see a BOM each time it
>>    decodes a byte sequence with the String(bytes, offset, numBytes, charset)
>>    call. Therefore, it will assume Big Endian, which may not always be correct.
>>
>> While there are likely many solutions, I would think that all of them
>> would have to start by reading the BOM from the file when a Split is opened
>> and then using that BOM to modify the specified encoding to a BOM specific
>> one when the caller doesn't specify one, and to overwrite the caller's
>> specification if the BOM is in conflict with the caller's specification.
>> That is, if the BOM indicates Little Endian and the caller indicates
>> UTF-16BE, Flink should rewrite the charsetName as UTF-16LE.
>>
>> I hope this makes sense and that I haven't been testing incorrectly or
>> misreading the code.
>>
>> Thank you,
>> David
>>
>> On Thu, Aug 9, 2018 at 4:04 AM Fabian Hueske <fh...@gmail.com> wrote:
>>
>>> Hi David,
>>>
>>> Did you try to set the encoding on the TextInputFormat with
>>>
>>> TextInputFormat tif = ...
>>> tif.setCharsetName("UTF-16");
>>>
>>> Best, Fabian
>>>
>>> 2018-08-08 17:45 GMT+02:00 David Dreyfus <dd...@gmail.com>:
>>>
>>>> Hello -
>>>>
>>>> It does not appear that Flink supports a charset encoding of "UTF-16".
>>>> It particular, it doesn't appear that Flink consumes the Byte Order Mark
>>>> (BOM) to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are there
>>>> any plans to enhance Flink to handle UTF-16 with BOM?
>>>>
>>>> Thank you,
>>>> David
>>>>
>>>
>>>
>

Re: UTF-16 support for TextInputFormat

Posted by Fabian Hueske <fh...@gmail.com>.
Hi David,

Thanks for digging into the code! I had a quick look into the classes as
well.
As far as I can see, your analysis is correct and the BOM handling in
DelimitedInputFormat and TextInputFormat (and other text-based IFs such as
CsvInputFormat) is broken.
In fact, its obvious that nobody paid attention to this yet.

It would be great if you could open a Jira issue and copy your analysis and
solution proposal into it.
While on it, we could also deprecated the (duplicated) setCharsetName()
method from TextInputFormat and redirect it to
DelimitedInputFormat.setCharset().

Would you also be interested in contributing a fix for this problem?

Best, Fabian

[1]
https://github.com/apache/flink/blob/master/flink-java/src/main/java/org/apache/flink/api/java/io/TextInputFormat.java#L95

2018-08-09 14:55 GMT+02:00 David Dreyfus <dd...@gmail.com>:

> Hi Fabian,
>
> Thank you for taking my email.
> TextInputFormat.setCharsetName("UTF-16") appears to set the private
> variable TextInputFormat.charsetName.
> It doesn't appear to cause additional behavior that would help interpret
> UTF-16 data.
>
> The method I've tested is calling DelimitedInputFormat.setCharset("UTF-16"),
> which then sets TextInputFormat.charsetName and then modifies the
> previously set delimiterString to construct the proper byte string encoding
> of the the delimiter. This same charsetName is also used in
> TextInputFormat.readRecord() to interpret the bytes read from the file.
>
> There are two problems that this implementation would seem to have when
> using UTF-16.
>
>    1. delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java
>    will return a Big Endian byte sequence including the Byte Order Mark (BOM).
>    The actual text file will not contain a BOM at each line ending, so the
>    delimiter will never be read. Moreover, if the actual byte encoding of the
>    file is Little Endian, the bytes will be interpreted incorrectly.
>    2. TextInputFormat.readRecord() will not see a BOM each time it
>    decodes a byte sequence with the String(bytes, offset, numBytes, charset)
>    call. Therefore, it will assume Big Endian, which may not always be correct.
>
> While there are likely many solutions, I would think that all of them
> would have to start by reading the BOM from the file when a Split is opened
> and then using that BOM to modify the specified encoding to a BOM specific
> one when the caller doesn't specify one, and to overwrite the caller's
> specification if the BOM is in conflict with the caller's specification.
> That is, if the BOM indicates Little Endian and the caller indicates
> UTF-16BE, Flink should rewrite the charsetName as UTF-16LE.
>
> I hope this makes sense and that I haven't been testing incorrectly or
> misreading the code.
>
> Thank you,
> David
>
> On Thu, Aug 9, 2018 at 4:04 AM Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi David,
>>
>> Did you try to set the encoding on the TextInputFormat with
>>
>> TextInputFormat tif = ...
>> tif.setCharsetName("UTF-16");
>>
>> Best, Fabian
>>
>> 2018-08-08 17:45 GMT+02:00 David Dreyfus <dd...@gmail.com>:
>>
>>> Hello -
>>>
>>> It does not appear that Flink supports a charset encoding of "UTF-16".
>>> It particular, it doesn't appear that Flink consumes the Byte Order Mark
>>> (BOM) to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are there
>>> any plans to enhance Flink to handle UTF-16 with BOM?
>>>
>>> Thank you,
>>> David
>>>
>>
>>

Re: UTF-16 support for TextInputFormat

Posted by David Dreyfus <dd...@gmail.com>.
Hi Fabian,

Thank you for taking my email.
TextInputFormat.setCharsetName("UTF-16") appears to set the private
variable TextInputFormat.charsetName.
It doesn't appear to cause additional behavior that would help interpret
UTF-16 data.

The method I've tested is calling
DelimitedInputFormat.setCharset("UTF-16"), which then sets
TextInputFormat.charsetName and then modifies the previously set
delimiterString to construct the proper byte string encoding of the the
delimiter. This same charsetName is also used in
TextInputFormat.readRecord() to interpret the bytes read from the file.

There are two problems that this implementation would seem to have when
using UTF-16.

   1. delimiterString.getBytes(getCharset()) in DelimitedInputFormat.java
   will return a Big Endian byte sequence including the Byte Order Mark (BOM).
   The actual text file will not contain a BOM at each line ending, so the
   delimiter will never be read. Moreover, if the actual byte encoding of the
   file is Little Endian, the bytes will be interpreted incorrectly.
   2. TextInputFormat.readRecord() will not see a BOM each time it decodes
   a byte sequence with the String(bytes, offset, numBytes, charset) call.
   Therefore, it will assume Big Endian, which may not always be correct.

While there are likely many solutions, I would think that all of them would
have to start by reading the BOM from the file when a Split is opened and
then using that BOM to modify the specified encoding to a BOM specific one
when the caller doesn't specify one, and to overwrite the caller's
specification if the BOM is in conflict with the caller's specification.
That is, if the BOM indicates Little Endian and the caller indicates
UTF-16BE, Flink should rewrite the charsetName as UTF-16LE.

I hope this makes sense and that I haven't been testing incorrectly or
misreading the code.

Thank you,
David

On Thu, Aug 9, 2018 at 4:04 AM Fabian Hueske <fh...@gmail.com> wrote:

> Hi David,
>
> Did you try to set the encoding on the TextInputFormat with
>
> TextInputFormat tif = ...
> tif.setCharsetName("UTF-16");
>
> Best, Fabian
>
> 2018-08-08 17:45 GMT+02:00 David Dreyfus <dd...@gmail.com>:
>
>> Hello -
>>
>> It does not appear that Flink supports a charset encoding of "UTF-16". It
>> particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM)
>> to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are there any
>> plans to enhance Flink to handle UTF-16 with BOM?
>>
>> Thank you,
>> David
>>
>
>

Re: UTF-16 support for TextInputFormat

Posted by Fabian Hueske <fh...@gmail.com>.
Hi David,

Did you try to set the encoding on the TextInputFormat with

TextInputFormat tif = ...
tif.setCharsetName("UTF-16");

Best, Fabian

2018-08-08 17:45 GMT+02:00 David Dreyfus <dd...@gmail.com>:

> Hello -
>
> It does not appear that Flink supports a charset encoding of "UTF-16". It
> particular, it doesn't appear that Flink consumes the Byte Order Mark (BOM)
> to establish whether a UTF-16 file is UTF-16LE or UTF-16BE. Are there any
> plans to enhance Flink to handle UTF-16 with BOM?
>
> Thank you,
> David
>