You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/06/10 13:16:00 UTC

[jira] [Commented] (IMPALA-11346) Migrated partitioned Iceberg tables might return ERROR when WHERE condition is used on partition column

    [ https://issues.apache.org/jira/browse/IMPALA-11346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552744#comment-17552744 ] 

ASF subversion and git services commented on IMPALA-11346:
----------------------------------------------------------

Commit 1a1536bd1d0162a168877a6f33dd75f9544a82f3 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=1a1536bd1 ]

IMPALA-11346: Migrated partitioned Iceberg tables might return ERROR when WHERE condition is used on partition column

Identity-partitioned columns are not necessarily stored in the data
files. E.g. when we migrate a legacy partitioned table to Iceberg
without rewriting the data files, the partition columns won't be
present in the files.

The Parquet scanner does a few optimizations to eliminate row groups,
i.e. filtering based on stats, bloom filters, etc. When a column is
not present in the data file that has some predicate on, then it is
assumed that the whole row group doesn't pass the filtering criteria.

But for Iceberg some files might contain partition columns, while
other files doesn't, so we need to prepare the scanners to handle
such cases.

The ORC scanner doesn't have that many optimizations so it didn't
ran into this issue.

Testing:
 * e2e tests

Change-Id: Ie706317888981f634d792fb570f3eab1ec11a4f4
Reviewed-on: http://gerrit.cloudera.org:8080/18605
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Reviewed-by: Tamas Mate <tm...@apache.org>
Reviewed-by: <li...@sensorsdata.cn>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Migrated partitioned Iceberg tables might return ERROR when WHERE condition is used on partition column
> -------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-11346
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11346
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>              Labels: impala-iceberg
>
> {noformat}
> [localhost:21050] default> select * from functional_parquet.iceberg_alltypes_part where p_bool=false;
> Fetched 0 row(s) in 0.11s
> [localhost:21050] default> select * from functional_parquet.iceberg_alltypes_part where p_bool=true;
> ERROR: Unable to find SchemaNode for path 'functional_parquet.iceberg_alltypes_part.p_bool' in the schema of file 'hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part/p_bool=true/p_int=1/p_bigint=11/p_float=1.1/p_double=2.222/p_decimal=123.321/p_date=2022-02-22/p_string=impala/000000_0'.
> [localhost:21050] default> select * from functional_parquet.iceberg_alltypes_part where i=3;
> Fetched 0 row(s) in 0.12s
> [localhost:21050] default> select * from functional_parquet.iceberg_alltypes_part where i=1;
> +---+--------+-------+----------+---------------+----------+-----------+------------+----------+
> | i | p_bool | p_int | p_bigint | p_float       | p_double | p_decimal | p_date     | p_string |
> +---+--------+-------+----------+---------------+----------+-----------+------------+----------+
> | 1 | true   | 1     | 11       | 1.10000002384 | 2.222    | 123.321   | 2022-02-22 | impala   |
> +---+--------+-------+----------+---------------+----------+-----------+------------+----------+
> Fetched 1 row(s) in 0.12s
> [localhost:21050] default> select * from functional_parquet.iceberg_alltypes_part where p_int=1;
> ERROR: Unable to find SchemaNode for path 'functional_parquet.iceberg_alltypes_part.p_int' in the schema of file 'hdfs://localhost:20500/test-warehouse/iceberg_test/hadoop_catalog/ice/iceberg_alltypes_part/p_bool=true/p_int=1/p_bigint=11/p_float=1.1/p_double=2.222/p_decimal=123.321/p_date=2022-02-22/p_string=impala/000000_0'.
> [localhost:21050] default> select * from functional_parquet.iceberg_alltypes_part where p_int=3;
> Fetched 0 row(s) in 0.11s{noformat}
> So we don't get incorrect results at least, but getting errors on partition column values that are existing.
> It seems like it works well with ORC.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org