You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2020/03/17 03:50:00 UTC
[jira] [Commented] (IMPALA-9410) Support resolving ORC file columns by names

    [ https://issues.apache.org/jira/browse/IMPALA-9410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060613#comment-17060613 ] 

Quanlong Huang commented on IMPALA-9410:
----------------------------------------

Hive resolves nested struct columns (not table level columns) by names and it's case sensitive. So we have inconsistent results with Hive for this case:
{code:sql}
$ beeline -u jdbc:hive2://localhost:11050 -e "select nested_struct.a from functional_orc_def.complextypestbl"
+-------+
|   a   |
+-------+
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| NULL  |
| -1    |
+-------+
$ bin/impala-shell.sh -q "select nested_struct.a from functional_orc_def.complextypestbl"
+-----------------+
| nested_struct.a |
+-----------------+
| -1              |
| 1               |
| NULL            |
| NULL            |
| NULL            |
| NULL            |
| NULL            |
| 7               |
+-----------------+
{code}
Table functional_orc_def.complextypestbl contains two files: nullable.orc and nonnullable.orc. The cause for this case is that nullable.orc has subcolumn "A" but not "a" in the "nested_struct" column. Here are the schemas of these two orc files:
{code}
nullable.orc:
struct<id:bigint,int_array:array<int>,int_array_Array:array<array<int>>,int_map:map<string,int>,int_Map_Array:array<map<string,int>>,nested_struct:struct<A:int,b:array<int>,C:struct<d:array<array<struct<E:int,F:string>>>>,g:map<string,struct<H:struct<i:array<double>>>>>>

nonnullable.orc:
struct<ID:bigint,Int_Array:array<int>,int_array_array:array<array<int>>,Int_Map:map<string,int>,int_map_array:array<map<string,int>>,nested_Struct:struct<a:int,B:array<int>,c:struct<D:array<array<struct<e:int,f:string>>>>,G:map<string,struct<h:struct<i:array<double>>>>>>
{code}

Impala currently resolves orc columns by index. We need to define the expected behavior (case sensitive or not in nested columns) when we support resolving column by names. 

> Support resolving ORC file columns by names
> -------------------------------------------
>
>                 Key: IMPALA-9410
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9410
>             Project: IMPALA
>          Issue Type: New Feature
>            Reporter: Quanlong Huang
>            Priority: Major
>
> Currently we resolve ORC file columns by indices. We should provide an query option like PARQUET_FALLBACK_SCHEMA_RESOLUTION for Parquet (IMPALA-2835), to resolve ORC file columns by names.
> Note that Hive only writes column names to ORC files after Hive-2.x (HIVE-4243). For older versions of Hive, the column names in ORC files are something like _col0, _col1,....,_col99. So this feature is only required when deployed with Hive 2+.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org