You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Matthew Jacobs (JIRA)" <ji...@apache.org> on 2017/08/02 19:36:00 UTC

[jira] [Resolved] (IMPALA-2826) Outer join w/ old HJ returns wrong results if 0 outer tbl cols referenced

     [ https://issues.apache.org/jira/browse/IMPALA-2826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthew Jacobs resolved IMPALA-2826.
------------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.10.0

Old HJ removed in IMPALA-4674

> Outer join w/ old HJ returns wrong results if 0 outer tbl cols referenced
> -------------------------------------------------------------------------
>
>                 Key: IMPALA-2826
>                 URL: https://issues.apache.org/jira/browse/IMPALA-2826
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.3.0
>            Reporter: Matthew Jacobs
>              Labels: correctness, downgraded
>             Fix For: Impala 2.10.0
>
>
> We are not handling null tuples correctly in the old HJ for outer joins when there are no columns referenced from the outer table.
> For example:
> {code}
> select t1.int_col, t3.c1
> from alltypestiny t1
> left outer join
> (select 0 as c1 from alltypestiny) t3
> on t1.int_col = t3.c1 limit 10
> +---------+------+
> | int_col | c1   |
> +---------+------+
> | 0       | NULL |
> | 0       | NULL |
> | 0       | NULL |
> | 0       | NULL |
> | 0       | NULL |
> | 0       | NULL |
> | 0       | NULL |
> | 0       | NULL |
> | 1       | NULL |
> | 0       | NULL |
> +---------+------+
> Fetched 10 row(s) in 0.12s
> {code}
> When the right side of the left outer join has no columns referenced, it produces a null tuple which doesn't work with the plan generated for left outer joins as the planner sets the build expr to if(TupleIsNull(), NULL, 0).
> The reason that this works with the PHJ is that the BufferedTupleStream returns a non-null (but still 0 slots) Tuple for the build rows since the tuple isn't supposed to be nullable (see node 3 w/ non-nullable tupleId 1 in the plan below).
> {code}
> explain select t1.int_col, t3.c1
> from alltypestiny t1
> left outer join
> (select 0 as c1 from alltypestiny) t3
> on t1.int_col = t3.c1
> +---------------------------------------------------------------+
> | Explain String                                                |
> +---------------------------------------------------------------+
> | Estimated Per-Host Requirements: Memory=80.00MB VCores=2      |
> |                                                               |
> | F02:PLAN FRAGMENT [UNPARTITIONED]                             |
> |   04:EXCHANGE [UNPARTITIONED]                                 |
> |      hosts=1 per-host-mem=unavailable                         |
> |      tuple-ids=0,1N row-size=4B cardinality=8                 |
> |                                                               |
> | F00:PLAN FRAGMENT [RANDOM]                                    |
> |   DATASTREAM SINK [FRAGMENT=F02, EXCHANGE=04, UNPARTITIONED]  |
> |   02:HASH JOIN [LEFT OUTER JOIN, BROADCAST]                   |
> |   |  hash predicates: t1.int_col = if(TupleIsNull(), NULL, 0) |
> |   |  hosts=1 per-host-mem=0B                                  |
> |   |  tuple-ids=0,1N row-size=4B cardinality=8                 |
> |   |                                                           |
> |   |--03:EXCHANGE [BROADCAST]                                  |
> |   |     hosts=1 per-host-mem=0B                               |
> |   |     tuple-ids=1 row-size=0B cardinality=8                 |
> |   |                                                           |
> |   00:SCAN HDFS [functional.alltypestiny t1, RANDOM]           |
> |      partitions=4/4 files=4 size=460B                         |
> |      table stats: 8 rows total                                |
> |      column stats: all                                        |
> |      hosts=1 per-host-mem=80.00MB                             |
> |      tuple-ids=0 row-size=4B cardinality=8                    |
> |                                                               |
> | F01:PLAN FRAGMENT [RANDOM]                                    |
> |   DATASTREAM SINK [FRAGMENT=F00, EXCHANGE=03, BROADCAST]      |
> |   01:SCAN HDFS [functional.alltypestiny, RANDOM]              |
> |      partitions=4/4 files=4 size=460B                         |
> |      table stats: 8 rows total                                |
> |      column stats: all                                        |
> |      hosts=1 per-host-mem=80.00MB                             |
> |      tuple-ids=1 row-size=0B cardinality=8                    |
> +---------------------------------------------------------------+
> Fetched 33 row(s) in 0.02s
> {code}
> We're confusing what it means to be NULL. It's strange that a tuple that is not supposed to be nullable can be NULL. There are a few things we can consider doing here:
> 1) Not emitting NULL tuples when the tuple isn't supposed to be nullable. This may have potential perf implications, but it seems like it would really make things easier to reason about.
> 2) Special NULL handling in cases like this. We found this issue here, but there may be others.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)