You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2020/03/17 03:50:00 UTC
[jira] [Commented] (IMPALA-9410) Support resolving ORC file columns
by names
[ https://issues.apache.org/jira/browse/IMPALA-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060613#comment-17060613 ]
Quanlong Huang commented on IMPALA-9410:
----------------------------------------
Hive resolves nested struct columns (not table level columns) by names and it's case sensitive. So we have inconsistent results with Hive for this case:
{code:sql}
$ beeline -u jdbc:hive2://localhost:11050 -e "select nested_struct.a from functional_orc_def.complextypestbl"
+-------+
| a |
+-------+
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| -1 |
+-------+
$ bin/impala-shell.sh -q "select nested_struct.a from functional_orc_def.complextypestbl"
+-----------------+
| nested_struct.a |
+-----------------+
| -1 |
| 1 |
| NULL |
| NULL |
| NULL |
| NULL |
| NULL |
| 7 |
+-----------------+
{code}
Table functional_orc_def.complextypestbl contains two files: nullable.orc and nonnullable.orc. The cause for this case is that nullable.orc has subcolumn "A" but not "a" in the "nested_struct" column. Here are the schemas of these two orc files:
{code}
nullable.orc:
struct<id:bigint,int_array:array<int>,int_array_Array:array<array<int>>,int_map:map<string,int>,int_Map_Array:array<map<string,int>>,nested_struct:struct<A:int,b:array<int>,C:struct<d:array<array<struct<E:int,F:string>>>>,g:map<string,struct<H:struct<i:array<double>>>>>>
nonnullable.orc:
struct<ID:bigint,Int_Array:array<int>,int_array_array:array<array<int>>,Int_Map:map<string,int>,int_map_array:array<map<string,int>>,nested_Struct:struct<a:int,B:array<int>,c:struct<D:array<array<struct<e:int,f:string>>>>,G:map<string,struct<h:struct<i:array<double>>>>>>
{code}
Impala currently resolves orc columns by index. We need to define the expected behavior (case sensitive or not in nested columns) when we support resolving column by names.
> Support resolving ORC file columns by names
> -------------------------------------------
>
> Key: IMPALA-9410
> URL: https://issues.apache.org/jira/browse/IMPALA-9410
> Project: IMPALA
> Issue Type: New Feature
> Reporter: Quanlong Huang
> Priority: Major
>
> Currently we resolve ORC file columns by indices. We should provide an query option like PARQUET_FALLBACK_SCHEMA_RESOLUTION for Parquet (IMPALA-2835), to resolve ORC file columns by names.
> Note that Hive only writes column names to ORC files after Hive-2.x (HIVE-4243). For older versions of Hive, the column names in ORC files are something like _col0, _col1,....,_col99. So this feature is only required when deployed with Hive 2+.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org