You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2023/03/27 23:02:00 UTC

[jira] [Commented] (IMPALA-11707) Wrong results when global runtime IN-list filters are applied

    [ https://issues.apache.org/jira/browse/IMPALA-11707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705709#comment-17705709 ] 

ASF subversion and git services commented on IMPALA-11707:
----------------------------------------------------------

Commit 4f284b0f15ef4db3abc24027fe3c09e6eaf870c3 in impala's branch refs/heads/branch-4.1.2 from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=4f284b0f1 ]

IMPALA-11707: Fix global runtime IN-list filter of numeric types are AlwaysFalse

Global runtime filters are published to the coordinator and then
distributed to all executors that need it. The filter is serialized and
deserialized using protobuf. While deserializing a global runtime filter
of numeric type from protobuf, the InsertBatch() method forgot to update
the total_entries_ counter. The filter is then considered as an empty
list, which will reject any files/rows.

This patch adds the missing update of total_entries_. Some DCHECKs are
added to make sure total_entries_ is consistent with the actual size of
the value set. This patch also fixes a type error (long_val -> int_val)
in ToProtobuf() of Date type IN-list filter.

Tests:
- Added BE tests to verify the filter cloned from protobuf has the same
  behavior as the original one.
- Added e2e regression tests
- Run TestInListFilters 200 times.

Change-Id: Ie90b2bce5e5ec6f6906ce9d2090b0ab19d48cc78
Reviewed-on: http://gerrit.cloudera.org:8080/19220
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Qifan Chen <qf...@hotmail.com>


> Wrong results when global runtime IN-list filters are applied
> -------------------------------------------------------------
>
>                 Key: IMPALA-11707
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11707
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.1.0, Impala 4.1.1
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>              Labels: correctness
>             Fix For: Impala 4.2.0
>
>
> Found this bug when doing a large scale TPC-H benchmark. The bug can be reproduced by the following query:
> {code:sql}
> use tpch_orc_def;
> set enabled_runtime_filter_types=in_list;
> select count(*) from supplier, nation, region
> where s_nationkey = n_nationkey
>   and n_regionkey = r_regionkey
>   and r_name = 'EUROPE';{code}
> The result is 0 which is wrong. The expected result is 1987. The summary shows that ScanNode on "nation" table returns 0 rows:
> {noformat}
> 04:HASH JOIN                  1      1  445.629us  445.629us      0       2.00K    1.98 MB        1.94 MB  INNER JOIN, BROADCAST 
> |--07:EXCHANGE                1      1   40.466us   40.466us      1           1   16.00 KB       16.00 KB  BROADCAST             
> |  F02:EXCHANGE SENDER        1      1  217.341us  217.341us                       8.60 KB       99.20 KB                        
> |  02:SCAN HDFS               1      1    4.507ms    4.507ms      1           1  917.09 KB       96.00 MB  tpch_orc_def.region   
> 03:HASH JOIN                  1      1    2.112ms    2.112ms      0      10.00K    1.97 MB        1.94 MB  INNER JOIN, BROADCAST 
> |--06:EXCHANGE                1      1   27.803us   27.803us      0          25          0       16.00 KB  BROADCAST             
> |  F01:EXCHANGE SENDER        1      1   89.872us   89.872us                      25.59 KB       32.00 KB                        
> |  01:SCAN HDFS               1      1   12.833ms   12.833ms      0          25   32.00 KB       64.00 MB  tpch_orc_def.nation   
> 00:SCAN HDFS                  1      1  371.636us  371.636us      0      10.00K   16.00 KB       32.00 MB  tpch_orc_def.supplier {noformat}
> There is a runtime IN-list filter applied on this node:
> {noformat}
> 01:SCAN HDFS [tpch_orc_def.nation, RANDOM]
>    HDFS partitions=1/1 files=1 size=1.74KB
>    runtime filters: RF000[in_list] -> n_regionkey
>    stored statistics:
>      table: rows=25 size=1.74KB
>      columns: all 
>    extrapolated-rows=disabled max-scan-range-rows=25
>    mem-estimate=64.00MB mem-reservation=32.00KB thread-reservation=1
>    tuple-ids=1 row-size=4B cardinality=25
>    in pipelines: 01(GETNEXT){noformat}
> The filter is generated from a build side which is reading the "region" table which predicate "r_name = 'EUROPE'". Note that it's a global runtime filter generated by other impalads (not the impalad scanning the "nation" table).
> The profile shows that this filter rejects one file which is the exact one file of "nation" table.
> {noformat}
>         Filter 0 (2.00 KB):
>            - Files processed: 1 (1)
>            - Files rejected: 1 (1)
>            - Files total: 1 (1){noformat}
> This is wrong since at least 5 rows in the file should pass the filter:
> {code:java}
> impala-shell> select count(*) from nation, region where n_regionkey = r_regionkey and r_name = 'EUROPE';
> +----------+
> | count(*) |
> +----------+
> | 5        |
> +----------+{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org