You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Szehon Ho (JIRA)" <ji...@apache.org> on 2013/12/07 01:54:36 UTC

[jira] [Commented] (HIVE-3245) UTF encoded data not displayed correctly by Hive driver

    [ https://issues.apache.org/jira/browse/HIVE-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841965#comment-13841965 ] 

Szehon Ho commented on HIVE-3245:
---------------------------------

I created the table as described in the JIRA and ran select * both from beeline and my own java program embedding the JDBC driver.  In both instances, the Japanese characters displayed correctly:

0: jdbc:hive2://localhost:10000> select * from japan_j;
+-------+------------------------------------------------+------+
| rnum  |                       c1                       | ord  |
+-------+------------------------------------------------+------+
| 11    | (1)インデックス                                     | 36   |
| 12    | <5>Switches                                    | 37   |
| 10    | 400ranku                                       | 39   |
| 9     | 666Sink                                        | 40   |
| 14    | P-Cabels                                       | 35   |
| 13    | R-Bench                                        | 38   |
| 27    | エコー                                            | 34   |
| 26    | エチャント                                          | 24   |
| 25    | ガード                                            | 4    |
| 28    | コート                                            | 3    |
| 29    | ゴム                                             | 1    |
| 41    | ざぶと                                            | 2    |
| 40    | さんしょう                                          | 6    |
| 31    | ズボン                                            | 5    |
| 30    | スワップ                                           | 41   |
| 37    | せっけい                                           | 42   |
| 36    | せんたくざい                                         | 46   |
| 32    | ダイエル                                           | 45   |
| 39    | はっぽ                                            | 43   |
| 38    | はつ剤                                            | 44   |
| 34    | ファイル                                           | 48   |
| 33    | フィルター                                          | 50   |
| 35    | フッコク                                           | 49   |
| 8     | 「2」計画                                          | 47   |
| 46    | 暗視                                             | 9    |
| 45    | 音楽                                             | 8    |
| 47    | 音声認識                                           | 7    |
| 44    | 記載                                             | 10   |
| 43    | 記録機                                            | 11   |
| 42    | 高機能                                            | 15   |
| 50    | 国家利益                                           | 14   |
| 48    | 国立公園                                           | 18   |
| 49    | 国立大学                                           | 22   |
| 7     | ⑤号線路                                           | 21   |
| 5     | (Ⅰ)番号列                                         | 23   |
| 1     | 356CAL                                         | 17   |
| 2     | 980Series                                      | 16   |
| 6     | <ⅸ>Pattern                                     | 20   |
| 3     | PVDF                                           | 19   |
| 4     | ROMAN-8                                        | 13   |
| 15    | アンカー                                           | 12   |
| 16    | エンジン                                          | 30   |
| 19    | カットマシン                                         | 29   |
| 20    | カード                                           | 28   |
| 18    | コーラ                                            | 26   |
| 17    | ゴールド                                         | 25   |
| 24    | サイフ                                            | 27   |
| 21    | ツーウィング                                        | 32   |
| 23    | フォルダー                                         | 33   |
| 22    | マンボ                                           | 31   |
+-------+------------------------------------------------+------+


I tested with the new JDBCDriver (org.apache.hive.jdbc.HiveDriver) against HiveServer2.  

The platform running Beeline should be set to utf8 ("echo $LANG"), or any other java application using JDBC driver should have be started with utf-8 JVM args ("java -Dfile.encoding=UTF-8").  That should already be a requirement for client's wishing to display utf-8 characters.

The code that Mark Grover mentioned does not apply anymore, as new JDBCDriver gets results from HiveServer directly via ThriftString field, and does not do another round of serialization/deserialization on client side, where it is said the error occurred.  So in my opinion, the issue can be closed for Hive driver.

> UTF encoded data not displayed correctly by Hive driver
> -------------------------------------------------------
>
>                 Key: HIVE-3245
>                 URL: https://issues.apache.org/jira/browse/HIVE-3245
>             Project: Hive
>          Issue Type: Bug
>          Components: JDBC
>    Affects Versions: 0.8.0
>            Reporter: N Campbell
>            Assignee: Szehon Ho
>         Attachments: ASF.LICENSE.NOT.GRANTED--screenshot-1.jpg, CERT.TLJA.txt
>
>
> various foreign language data (i.e. japanese, thai etc) is loaded into string columns via tab delimited text files. A simple projection of the columns in the table is not displaying the correct data. Exporting the data from Hive and looking at the files implies the data is loaded properly. it appears to be an encoding issue at the driver but unaware of any required URL connection properties re encoding that Hive JDBC requires.
> create table if not exists CERT.TLJA_JP_E ( RNUM int , C1 string, ORD int)
> row format delimited
> fields terminated by '\t'
> stored as textfile;
> create table if not exists CERT.TLJA_JP ( RNUM int , C1 string, ORD int)
> stored as sequencefile;
> load data local inpath '/home/hadoopadmin/jdbc-cert/CERT/CERT.TLJA_JP.txt'
> overwrite into table CERT.TLJA_JP_E;
> insert overwrite table CERT.TLJA_JP  select * from CERT.TLJA_JP_E;



--
This message was sent by Atlassian JIRA
(v6.1#6144)