You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zoltan Ivanfi (JIRA)" <ji...@apache.org> on 2016/09/27 12:59:20 UTC
[jira] [Created] (HIVE-14846) Char encoding does not apply to
newline chars
Zoltan Ivanfi created HIVE-14846:
------------------------------------
Summary: Char encoding does not apply to newline chars
Key: HIVE-14846
URL: https://issues.apache.org/jira/browse/HIVE-14846
Project: Hive
Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Zoltan Ivanfi
Priority: Minor
I created and populated a table with utf-16 encoding:
hive> create external table utf16 (col1 timestamp, col2 string) row format delimited fields terminated by "," location '/tmp/utf16';
hive> alter table utf16 set serdeproperties ('serialization.encoding'='UTF-16');
hive> insert into utf16 values('2010-01-01 00:00:00.000', 'hőség');
Then I checked the contents of the file:
$ hadoop fs -cat /tmp/utf16/000000_0 | hd
00000000 fe ff 00 32 00 30 00 31 00 30 00 2d 00 30 00 31 |...2.0.1.0.-.0.1|
00000010 00 2d 00 30 00 34 00 20 00 30 00 30 00 3a 00 30 |.-.0.4. .0.0.:.0|
00000020 00 30 00 3a 00 30 00 30 00 2c 00 63 00 69 00 70 |.0.:.0.0.,.c.i.p|
00000030 01 51 0a |.Q.|
00000033
The newline character is represented as 0a instead of the expected 00 0a.
If I do it the other way around and put correct UTF-16 files into HDFS and try to query them from Hive, I get unknown unicode chars in the output:
hive> select * from utf16;
2010-01-01 00:00:00 hőség�
2010-01-02 00:00:00 város�
2010-01-03 00:00:00 füzet�
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)