You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2020/11/13 00:42:00 UTC
[jira] [Updated] (IMPALA-10319) Support arbitrary encodings on
Text/Sequence files
[ https://issues.apache.org/jira/browse/IMPALA-10319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Quanlong Huang updated IMPALA-10319:
------------------------------------
Summary: Support arbitrary encodings on Text/Sequence files (was: Support arbitrary encodings on text files)
> Support arbitrary encodings on Text/Sequence files
> --------------------------------------------------
>
> Key: IMPALA-10319
> URL: https://issues.apache.org/jira/browse/IMPALA-10319
> Project: IMPALA
> Issue Type: New Feature
> Reporter: Quanlong Huang
> Priority: Major
> Attachments: gbk_names.txt
>
>
> ORC/Parquet/Avro files store strings in UTF-8 encoded bytes. However, text files can be in arbitrary encodings. Hive support specifying arbitrary encoding on text tables in the "serialization.encoding" table property (HIVE-7142). Impala is currently not aware of this table property and treate all strings as byte arrays. It's good to support at least reading from these text files.
> *Example*
> Create a text table in Hive using GBK encoding and load a GBK encoded text file into it:
> {code:sql}
> hive> create table gbk_names (name string) stored as textfile tblproperties("serialization.encoding"="GBK");
> hive> load data local inpath '/home/quanlong/workspace/Impala/gbk_names.txt' into table gbk_names;
> hive> select * from gbk_names;
> +-----------------+
> | gbk_names.name |
> +-----------------+
> | 张三 |
> | 李四 |
> | 王五 |
> +-----------------+
> {code}
> Impala read strings as byte arrays so can't decode them correctly:
> {code:sql}
> impala-shell> invalidate metadata gbk_names;
> impala-shell> select * from gbk_names;
> +------+
> | name |
> +------+
> | ���� |
> | ���� |
> | ���� |
> +------+
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org