You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/06/22 15:55:00 UTC
[jira] [Commented] (IMPALA-9877) uuid() generates multiple values per row when joined from a CTE

    [ https://issues.apache.org/jira/browse/IMPALA-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142213#comment-17142213 ] 

Tim Armstrong commented on IMPALA-9877:
---------------------------------------

[~koke] yeah it's not necessarily that well-defined when the random functions are evaluated when they're used in inline views. This is a side-effect of how we (effectively) inline the views into the rest of the query and then evaluate uuid() after doing the join. I think that's undesirable in this case and we don't have the special handling of random functions we'd need in inline views.

So you're not missing anything and it would be desirable to fix this.

You could work around in various ways, e.g. adding a union to the subquery (this exploits a quirk in the impala planner). You could also add a distinct to the subquery, 

Something like this should work.
{noformat}
with sessions as (
    select
        uid,
        uuid() session_id,
        min(ts) session_start
    from events
    group by uid
    union all
    select uid, '', ''
    from events
    limit 0
)
select
    session_id,
    events.*
from sessions join events on sessions.uid = events.uid;
{noformat}


This is the explain plan i get running on some tables I have locally:
{noformat}
explain select v.id, v.int_col, t.id, uuid from (select id, int_col, uuid() as uuid from functional_parquet.alltypestiny UNION ALL select 0, 0, '' from functional_parquet.alltypes limit 0) v join functional_parquet.alltypessmall t on v.int_col = t.int_col;
Query: explain select v.id, v.int_col, t.id, uuid from (select id, int_col, uuid() as uuid from functional_parquet.alltypestiny UNION ALL select 0, 0, '' from functional_parquet.alltypes limit 0) v join functional_parquet.alltypessmall t on v.int_col = t.int_col
+------------------------------------------------------------------------------------+
| Explain String                                                                     |
+------------------------------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=3.97MB Threads=6                         |
| Per-Host Resource Estimates: Memory=68MB                                           |
| WARNING: The following tables are missing relevant table and/or column statistics. |
| functional_parquet.alltypessmall, functional_parquet.alltypestiny                  |
|                                                                                    |
| PLAN-ROOT SINK                                                                     |
| |                                                                                  |
| 06:EXCHANGE [UNPARTITIONED]                                                        |
| |                                                                                  |
| 03:HASH JOIN [INNER JOIN, PARTITIONED]                                             |
| |  hash predicates: int_col = t.int_col                                            |
| |  runtime filters: RF000 <- t.int_col                                             |
| |  row-size=28B cardinality=758                                                    |
| |                                                                                  |
| |--05:EXCHANGE [HASH(t.int_col)]                                                   |
| |  |                                                                               |
| |  02:SCAN HDFS [functional_parquet.alltypessmall t]                               |
| |     HDFS partitions=4/4 files=4 size=14.78KB                                     |
| |     row-size=8B cardinality=940                                                  |
| |                                                                                  |
| 04:EXCHANGE [HASH(int_col)]                                                        |
| |                                                                                  |
| 00:UNION                                                                           |
| |  row-size=20B cardinality=758                                                    |
| |                                                                                  |
| 01:SCAN HDFS [functional_parquet.alltypestiny]                                     |
|    HDFS partitions=4/4 files=4 size=11.92KB                                        |
|    runtime filters: RF000 -> functional_parquet.alltypestiny.int_col               |
|    row-size=8B cardinality=758                                                     |
+------------------------------------------------------------------------------------+
Fetched 29 row(s) in 0.04s
{noformat}

> uuid() generates multiple values per row when joined from a CTE
> ---------------------------------------------------------------
>
>                 Key: IMPALA-9877
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9877
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.12.0
>         Environment: impalad version 2.12.0-cdh5.16.1 RELEASE (build 4a3775ef6781301af81b23bca45a9faeca5e761d)
>            Reporter: Jorge Bernal
>            Priority: Major
>
> I'm not sure if I'm misunderstanding the execution model here, but I would expect that when I use uuid() on a CTE, it returns a different value for each row of the CTE, and not one value per row of the final join.  
> It's hard to explain, so here's an example:
>  
> {code:java}
> drop table if exists events;
> create table events (uid int, ts int);
> insert into events values (1,1), (2,2), (1,3);
> +-----+----+
> | uid | ts |
> +-----+----+
> | 1   | 1  |
> | 2   | 2  |
> | 1   | 3  |
> +-----+----+{code}
>  
> If you then use uuid() on a CTE, and join it with something else, the resulting values might have a different uuid for the same input row:
> {code:java}
> with sessions as (
>     select
>         uid,
>         uuid() session_id,
>         min(ts) session_start
>     from events
>     group by uid
> )
> select
>     session_id,
>     events.*
> from sessions join events on sessions.uid = events.uid;
> drop table if exists koke_events;
> +--------------------------------------+-----+----+
> | session_id                           | uid | ts |
> +--------------------------------------+-----+----+
> | 71d3671a-fa97-4c00-8d90-d1584ad7cf18 | 2   | 2  |
> | af564c03-2db3-44cb-bf5b-7748e231af72 | 1   | 3  |
> | d4bc2779-1bd9-4a5e-a5e2-dc28b00d2f34 | 1   | 1  |
> +--------------------------------------+-----+----+
> {code}
> I would expect that rows with the same {{uid}} would have the same {{session_id}} in this result.
> When I run the same query in Hive, that's the behavior:
> {code:java}
> +---------------------------------------+-------------+------------+--+
> |              session_id               | events.uid  | events.ts  |
> +---------------------------------------+-------------+------------+--+
> | e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 1          |
> | e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 3          |
> | 72f8a922-738e-4151-ba93-3b5e50eb9140  | 2           | 2          |
> +---------------------------------------+-------------+------------+--+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org