You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/09/22 16:55:00 UTC

[jira] [Updated] (IMPALA-10182) Rows with NULLs filtered out with duplicate columns in subquery select inside UNION ALL

     [ https://issues.apache.org/jira/browse/IMPALA-10182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong updated IMPALA-10182:
-----------------------------------
    Description: 
Bug report from here - https://community.cloudera.com/t5/Support-Questions/quot-union-all-quot-dropping-records-with-all-null-empty/m-p/303153#M221415

Repro:

{noformat}
create database if not exists as_adventure;
use as_adventure;
CREATE tABLE IF NOT EXISTS
    as_adventure.t1 
    ( 
        productsubcategorykey INT, 
        productline STRING);

insert into t1 values (1,'l1');
insert into t1 values (2,'l1');
insert into t1 values (1,'l2');
insert into t1 values (3,'l3');
insert into t1 values (null,'');

select * from t1; 

SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c3s0            c3s0,
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productline           c3s0,
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3,
            4,
            5 ) t_53
GROUP BY
    4,
    5,
    6,
    7
 
UNION ALL

SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c3s0            c3s0,
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productline           c3s0,
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3,
            4,
            5 ) t_53
GROUP BY
    4,
    5,
    6,
    7
{noformat}

Somewhat similar to IMPALA-7957 in that the inferred predicates from the column equivalences get placed in a Select node. It's a bit different in that the NULLs that are filtered out from the predicates come from the base table.

{noformat}
+------------------------------------------------------------------------------------------------+
| Explain String                                                                                 |
+------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=136.02MB Threads=6                                   |
| Per-Host Resource Estimates: Memory=576MB                                                      |
| WARNING: The following tables are missing relevant table and/or column statistics.             |
| as_adventure.t1                                                                                |
|                                                                                                |
| PLAN-ROOT SINK                                                                                 |
| |                                                                                              |
| 13:EXCHANGE [UNPARTITIONED]                                                                    |
| |                                                                                              |
| 00:UNION                                                                                       |
| |  row-size=52B cardinality=2                                                                  |
| |                                                                                              |
| |--08:SELECT                                                                                   |
| |  |  predicates: t_53.c4 = t_53.c5s0, t_53.c2 = t_53.c3s0                                     |
| |  |  row-size=36B cardinality=1                                                               |
| |  |                                                                                           |
| |  07:AGGREGATE [FINALIZE]                                                                     |
| |  |  output: min(t.productsubcategorykey)                                                     |
| |  |  group by: t.productline, t.productline, t.productsubcategorykey, t.productsubcategorykey |
| |  |  row-size=36B cardinality=2                                                               |
| |  |                                                                                           |
| |  12:AGGREGATE [FINALIZE]                                                                     |
| |  |  group by: t.productline, t.productsubcategorykey                                         |
| |  |  row-size=16B cardinality=2                                                               |
| |  |                                                                                           |
| |  11:EXCHANGE [HASH(t.productline,t.productsubcategorykey)]                                   |
| |  |                                                                                           |
| |  06:AGGREGATE [STREAMING]                                                                    |
| |  |  group by: t.productline, t.productsubcategorykey                                         |
| |  |  row-size=16B cardinality=2                                                               |
| |  |                                                                                           |
| |  05:SCAN HDFS [as_adventure.t1 t]                                                            |
| |     HDFS partitions=1/1 files=5 size=24B                                                     |
| |     row-size=16B cardinality=2                                                               |
| |                                                                                              |
| 04:SELECT                                                                                      |
| |  predicates: t_53.c4 = t_53.c5s0, t_53.c2 = t_53.c3s0                                        |
| |  row-size=36B cardinality=1                                                                  |
| |                                                                                              |
| 03:AGGREGATE [FINALIZE]                                                                        |
| |  output: min(t.productsubcategorykey)                                                        |
| |  group by: t.productline, t.productline, t.productsubcategorykey, t.productsubcategorykey    |
| |  row-size=36B cardinality=2                                                                  |
| |                                                                                              |
| 10:AGGREGATE [FINALIZE]                                                                        |
| |  group by: t.productline, t.productsubcategorykey                                            |
| |  row-size=16B cardinality=2                                                                  |
| |                                                                                              |
| 09:EXCHANGE [HASH(t.productline,t.productsubcategorykey)]                                      |
| |                                                                                              |
| 02:AGGREGATE [STREAMING]                                                                       |
| |  group by: t.productline, t.productsubcategorykey                                            |
| |  row-size=16B cardinality=2                                                                  |
| |                                                                                              |
| 01:SCAN HDFS [as_adventure.t1 t]                                                               |
|    HDFS partitions=1/1 files=5 size=24B                                                        |
|    row-size=16B cardinality=2                                                                  |
+------------------------------------------------------------------------------------------------+

{noformat}

Here are some more related issues:

{noformat}
SELECT
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        GROUP BY
            1, 2 ) t_53
GROUP BY 1, 2;
{noformat}
results in a bogus predicate in the scan node:
{noformat}
| 00:SCAN HDFS [as_adventure.t1 t, RANDOM]                                                  |
|    HDFS partitions=1/1 files=5 size=24B                                                   |
|    predicates: t.productsubcategorykey = t.productsubcategorykey                          |
|    stored statistics:                                                                     |
|      table: rows=unavailable size=unavailable                                             |
|      columns: unavailable                                                                 |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable                             |
|    mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1                       |
|    tuple-ids=0 row-size=4B cardinality=1                                                  |
|    in pipelines: 00(GETNEXT)                                                              |
{noformat}

This query is a simplified version of the original one:
{noformat}
SELECT
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        GROUP BY
            1,
            2 ) t_53
GROUP BY 1,2

UNION ALL

SELECT
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        GROUP BY
            1, 2 ) t_53
GROUP BY 1, 2;
{noformat}

  was:
Bug report from here - https://community.cloudera.com/t5/Support-Questions/quot-union-all-quot-dropping-records-with-all-null-empty/m-p/303153#M221415

Repro:

{noformat}
create database if not exists as_adventure;
use as_adventure;
CREATE tABLE IF NOT EXISTS
    as_adventure.t1 
    ( 
        productsubcategorykey INT, 
        productline STRING);

insert into t1 values (1,'l1');
insert into t1 values (2,'l1');
insert into t1 values (1,'l2');
insert into t1 values (3,'l3');
insert into t1 values (null,'');

select * from t1; 

SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c3s0            c3s0,
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productline           c3s0,
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3,
            4,
            5 ) t_53
GROUP BY
    4,
    5,
    6,
    7
 
UNION ALL

SELECT
    MIN(t_53.c_41)       c_41,
    CAST(NULL AS DOUBLE) c_43,
    CAST(NULL AS BIGINT) c_44,
    t_53.c2              c2,
    t_53.c3s0            c3s0,
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c_41,
            t.productline           c2,
            t.productline           c3s0,
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        WHERE
            true
        GROUP BY
            2,
            3,
            4,
            5 ) t_53
GROUP BY
    4,
    5,
    6,
    7
{noformat}

Somewhat similar to IMPALA-7957 in that the inferred predicates from the column equivalences get placed in a Select node. It's a bit different in that the NULLs that are filtered out from the predicates come from the base table.

{noformat}
+------------------------------------------------------------------------------------------------+
| Explain String                                                                                 |
+------------------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=136.02MB Threads=6                                   |
| Per-Host Resource Estimates: Memory=576MB                                                      |
| WARNING: The following tables are missing relevant table and/or column statistics.             |
| as_adventure.t1                                                                                |
|                                                                                                |
| PLAN-ROOT SINK                                                                                 |
| |                                                                                              |
| 13:EXCHANGE [UNPARTITIONED]                                                                    |
| |                                                                                              |
| 00:UNION                                                                                       |
| |  row-size=52B cardinality=2                                                                  |
| |                                                                                              |
| |--08:SELECT                                                                                   |
| |  |  predicates: t_53.c4 = t_53.c5s0, t_53.c2 = t_53.c3s0                                     |
| |  |  row-size=36B cardinality=1                                                               |
| |  |                                                                                           |
| |  07:AGGREGATE [FINALIZE]                                                                     |
| |  |  output: min(t.productsubcategorykey)                                                     |
| |  |  group by: t.productline, t.productline, t.productsubcategorykey, t.productsubcategorykey |
| |  |  row-size=36B cardinality=2                                                               |
| |  |                                                                                           |
| |  12:AGGREGATE [FINALIZE]                                                                     |
| |  |  group by: t.productline, t.productsubcategorykey                                         |
| |  |  row-size=16B cardinality=2                                                               |
| |  |                                                                                           |
| |  11:EXCHANGE [HASH(t.productline,t.productsubcategorykey)]                                   |
| |  |                                                                                           |
| |  06:AGGREGATE [STREAMING]                                                                    |
| |  |  group by: t.productline, t.productsubcategorykey                                         |
| |  |  row-size=16B cardinality=2                                                               |
| |  |                                                                                           |
| |  05:SCAN HDFS [as_adventure.t1 t]                                                            |
| |     HDFS partitions=1/1 files=5 size=24B                                                     |
| |     row-size=16B cardinality=2                                                               |
| |                                                                                              |
| 04:SELECT                                                                                      |
| |  predicates: t_53.c4 = t_53.c5s0, t_53.c2 = t_53.c3s0                                        |
| |  row-size=36B cardinality=1                                                                  |
| |                                                                                              |
| 03:AGGREGATE [FINALIZE]                                                                        |
| |  output: min(t.productsubcategorykey)                                                        |
| |  group by: t.productline, t.productline, t.productsubcategorykey, t.productsubcategorykey    |
| |  row-size=36B cardinality=2                                                                  |
| |                                                                                              |
| 10:AGGREGATE [FINALIZE]                                                                        |
| |  group by: t.productline, t.productsubcategorykey                                            |
| |  row-size=16B cardinality=2                                                                  |
| |                                                                                              |
| 09:EXCHANGE [HASH(t.productline,t.productsubcategorykey)]                                      |
| |                                                                                              |
| 02:AGGREGATE [STREAMING]                                                                       |
| |  group by: t.productline, t.productsubcategorykey                                            |
| |  row-size=16B cardinality=2                                                                  |
| |                                                                                              |
| 01:SCAN HDFS [as_adventure.t1 t]                                                               |
|    HDFS partitions=1/1 files=5 size=24B                                                        |
|    row-size=16B cardinality=2                                                                  |
+------------------------------------------------------------------------------------------------+

{noformat}

Here are some more related issues:

{noformat}
SELECT
    t_53.c4              c4,
    t_53.c5s0            c5s0
FROM
    (   SELECT
            t.productsubcategorykey c4,
            t.productsubcategorykey c5s0
        FROM
            as_adventure.t1 t
        GROUP BY
            1, 2 ) t_53
GROUP BY 1, 2;
{noformat}
results in a bogus predicate in the scan node:
{noformat}
| 00:SCAN HDFS [as_adventure.t1 t, RANDOM]                                                  |
|    HDFS partitions=1/1 files=5 size=24B                                                   |
|    predicates: t.productsubcategorykey = t.productsubcategorykey                          |
|    stored statistics:                                                                     |
|      table: rows=unavailable size=unavailable                                             |
|      columns: unavailable                                                                 |
|    extrapolated-rows=disabled max-scan-range-rows=unavailable                             |
|    mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1                       |
|    tuple-ids=0 row-size=4B cardinality=1                                                  |
|    in pipelines: 00(GETNEXT)                                                              |
{noformat}


> Rows with NULLs filtered out with duplicate columns in subquery select inside UNION ALL
> ---------------------------------------------------------------------------------------
>
>                 Key: IMPALA-10182
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10182
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>            Reporter: Tim Armstrong
>            Assignee: Aman Sinha
>            Priority: Blocker
>              Labels: correctness
>
> Bug report from here - https://community.cloudera.com/t5/Support-Questions/quot-union-all-quot-dropping-records-with-all-null-empty/m-p/303153#M221415
> Repro:
> {noformat}
> create database if not exists as_adventure;
> use as_adventure;
> CREATE tABLE IF NOT EXISTS
>     as_adventure.t1 
>     ( 
>         productsubcategorykey INT, 
>         productline STRING);
> insert into t1 values (1,'l1');
> insert into t1 values (2,'l1');
> insert into t1 values (1,'l2');
> insert into t1 values (3,'l3');
> insert into t1 values (null,'');
> select * from t1; 
> SELECT
>     MIN(t_53.c_41)       c_41,
>     CAST(NULL AS DOUBLE) c_43,
>     CAST(NULL AS BIGINT) c_44,
>     t_53.c2              c2,
>     t_53.c3s0            c3s0,
>     t_53.c4              c4,
>     t_53.c5s0            c5s0
> FROM
>     (   SELECT
>             t.productsubcategorykey c_41,
>             t.productline           c2,
>             t.productline           c3s0,
>             t.productsubcategorykey c4,
>             t.productsubcategorykey c5s0
>         FROM
>             as_adventure.t1 t
>         WHERE
>             true
>         GROUP BY
>             2,
>             3,
>             4,
>             5 ) t_53
> GROUP BY
>     4,
>     5,
>     6,
>     7
>  
> UNION ALL
> SELECT
>     MIN(t_53.c_41)       c_41,
>     CAST(NULL AS DOUBLE) c_43,
>     CAST(NULL AS BIGINT) c_44,
>     t_53.c2              c2,
>     t_53.c3s0            c3s0,
>     t_53.c4              c4,
>     t_53.c5s0            c5s0
> FROM
>     (   SELECT
>             t.productsubcategorykey c_41,
>             t.productline           c2,
>             t.productline           c3s0,
>             t.productsubcategorykey c4,
>             t.productsubcategorykey c5s0
>         FROM
>             as_adventure.t1 t
>         WHERE
>             true
>         GROUP BY
>             2,
>             3,
>             4,
>             5 ) t_53
> GROUP BY
>     4,
>     5,
>     6,
>     7
> {noformat}
> Somewhat similar to IMPALA-7957 in that the inferred predicates from the column equivalences get placed in a Select node. It's a bit different in that the NULLs that are filtered out from the predicates come from the base table.
> {noformat}
> +------------------------------------------------------------------------------------------------+
> | Explain String                                                                                 |
> +------------------------------------------------------------------------------------------------+
> | Max Per-Host Resource Reservation: Memory=136.02MB Threads=6                                   |
> | Per-Host Resource Estimates: Memory=576MB                                                      |
> | WARNING: The following tables are missing relevant table and/or column statistics.             |
> | as_adventure.t1                                                                                |
> |                                                                                                |
> | PLAN-ROOT SINK                                                                                 |
> | |                                                                                              |
> | 13:EXCHANGE [UNPARTITIONED]                                                                    |
> | |                                                                                              |
> | 00:UNION                                                                                       |
> | |  row-size=52B cardinality=2                                                                  |
> | |                                                                                              |
> | |--08:SELECT                                                                                   |
> | |  |  predicates: t_53.c4 = t_53.c5s0, t_53.c2 = t_53.c3s0                                     |
> | |  |  row-size=36B cardinality=1                                                               |
> | |  |                                                                                           |
> | |  07:AGGREGATE [FINALIZE]                                                                     |
> | |  |  output: min(t.productsubcategorykey)                                                     |
> | |  |  group by: t.productline, t.productline, t.productsubcategorykey, t.productsubcategorykey |
> | |  |  row-size=36B cardinality=2                                                               |
> | |  |                                                                                           |
> | |  12:AGGREGATE [FINALIZE]                                                                     |
> | |  |  group by: t.productline, t.productsubcategorykey                                         |
> | |  |  row-size=16B cardinality=2                                                               |
> | |  |                                                                                           |
> | |  11:EXCHANGE [HASH(t.productline,t.productsubcategorykey)]                                   |
> | |  |                                                                                           |
> | |  06:AGGREGATE [STREAMING]                                                                    |
> | |  |  group by: t.productline, t.productsubcategorykey                                         |
> | |  |  row-size=16B cardinality=2                                                               |
> | |  |                                                                                           |
> | |  05:SCAN HDFS [as_adventure.t1 t]                                                            |
> | |     HDFS partitions=1/1 files=5 size=24B                                                     |
> | |     row-size=16B cardinality=2                                                               |
> | |                                                                                              |
> | 04:SELECT                                                                                      |
> | |  predicates: t_53.c4 = t_53.c5s0, t_53.c2 = t_53.c3s0                                        |
> | |  row-size=36B cardinality=1                                                                  |
> | |                                                                                              |
> | 03:AGGREGATE [FINALIZE]                                                                        |
> | |  output: min(t.productsubcategorykey)                                                        |
> | |  group by: t.productline, t.productline, t.productsubcategorykey, t.productsubcategorykey    |
> | |  row-size=36B cardinality=2                                                                  |
> | |                                                                                              |
> | 10:AGGREGATE [FINALIZE]                                                                        |
> | |  group by: t.productline, t.productsubcategorykey                                            |
> | |  row-size=16B cardinality=2                                                                  |
> | |                                                                                              |
> | 09:EXCHANGE [HASH(t.productline,t.productsubcategorykey)]                                      |
> | |                                                                                              |
> | 02:AGGREGATE [STREAMING]                                                                       |
> | |  group by: t.productline, t.productsubcategorykey                                            |
> | |  row-size=16B cardinality=2                                                                  |
> | |                                                                                              |
> | 01:SCAN HDFS [as_adventure.t1 t]                                                               |
> |    HDFS partitions=1/1 files=5 size=24B                                                        |
> |    row-size=16B cardinality=2                                                                  |
> +------------------------------------------------------------------------------------------------+
> {noformat}
> Here are some more related issues:
> {noformat}
> SELECT
>     t_53.c4              c4,
>     t_53.c5s0            c5s0
> FROM
>     (   SELECT
>             t.productsubcategorykey c4,
>             t.productsubcategorykey c5s0
>         FROM
>             as_adventure.t1 t
>         GROUP BY
>             1, 2 ) t_53
> GROUP BY 1, 2;
> {noformat}
> results in a bogus predicate in the scan node:
> {noformat}
> | 00:SCAN HDFS [as_adventure.t1 t, RANDOM]                                                  |
> |    HDFS partitions=1/1 files=5 size=24B                                                   |
> |    predicates: t.productsubcategorykey = t.productsubcategorykey                          |
> |    stored statistics:                                                                     |
> |      table: rows=unavailable size=unavailable                                             |
> |      columns: unavailable                                                                 |
> |    extrapolated-rows=disabled max-scan-range-rows=unavailable                             |
> |    mem-estimate=32.00MB mem-reservation=8.00KB thread-reservation=1                       |
> |    tuple-ids=0 row-size=4B cardinality=1                                                  |
> |    in pipelines: 00(GETNEXT)                                                              |
> {noformat}
> This query is a simplified version of the original one:
> {noformat}
> SELECT
>     t_53.c4              c4,
>     t_53.c5s0            c5s0
> FROM
>     (   SELECT
>             t.productsubcategorykey c4,
>             t.productsubcategorykey c5s0
>         FROM
>             as_adventure.t1 t
>         GROUP BY
>             1,
>             2 ) t_53
> GROUP BY 1,2
> UNION ALL
> SELECT
>     t_53.c4              c4,
>     t_53.c5s0            c5s0
> FROM
>     (   SELECT
>             t.productsubcategorykey c4,
>             t.productsubcategorykey c5s0
>         FROM
>             as_adventure.t1 t
>         GROUP BY
>             1, 2 ) t_53
> GROUP BY 1, 2;
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org