You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Jorge Bernal (Jira)" <ji...@apache.org> on 2020/06/22 09:27:00 UTC

[jira] [Updated] (IMPALA-9877) uuid() generates multiple values per row when joined from a CTE

     [ https://issues.apache.org/jira/browse/IMPALA-9877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jorge Bernal updated IMPALA-9877:
---------------------------------
    Description: 
I'm not sure if I'm misunderstanding the execution model here, but I would expect that when I use uuid() on a CTE, it returns a different value for each row of the CTE, and not one value per row of the final join.  

It's hard to explain, so here's an example:

 
{code:java}
drop table if exists events;
create table events (uid int, ts int);
insert into events values (1,1), (2,2), (1,3);
+-----+----+
| uid | ts |
+-----+----+
| 1   | 1  |
| 2   | 2  |
| 1   | 3  |
+-----+----+{code}
 

If you then use uuid() on a CTE, and join it with something else, the resulting values might have a different uuid for the same input row:
{code:java}
with sessions as (
    select
        uid,
        uuid() session_id,
        min(ts) session_start
    from events
    group by uid
)
select
    session_id,
    events.*
from sessions join events on sessions.uid = events.uid;
drop table if exists koke_events;
+--------------------------------------+-----+----+
| session_id                           | uid | ts |
+--------------------------------------+-----+----+
| 71d3671a-fa97-4c00-8d90-d1584ad7cf18 | 2   | 2  |
| af564c03-2db3-44cb-bf5b-7748e231af72 | 1   | 3  |
| d4bc2779-1bd9-4a5e-a5e2-dc28b00d2f34 | 1   | 1  |
+--------------------------------------+-----+----+
{code}
I would expect that rows with the same {{uid}} would have the same {{session_id}} in this result.

When I run the same query in Hive, that's the behavior:
{code:java}
+---------------------------------------+-------------+------------+--+
|              session_id               | events.uid  | events.ts  |
+---------------------------------------+-------------+------------+--+
| e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 1          |
| e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 3          |
| 72f8a922-738e-4151-ba93-3b5e50eb9140  | 2           | 2          |
+---------------------------------------+-------------+------------+--+
{code}
 

  was:
I'm not sure if I'm misunderstanding the execution model here, but I would expect that when I use uuid() on a CTE, it returns a different value for each row of the CTE, and not one value per row of the final join.  

It's hard to explain, so here's an example:

 
{code:java}
drop table if exists events;
create table events (uid int, ts int);
insert into events values (1,1), (2,2), (1,3);
+-----+----+
| uid | ts |
+-----+----+
| 1 | 1 |
| 2 | 2 |
| 1 | 3 |
+-----+----+{code}
 

If you then use uuid() on a CTE, and join it with something else, the resulting values might have a different uuid for the same input row:
{code:java}
with sessions as (
    select
        uid,
        uuid() session_id,
        min(ts) session_start
    from events
    group by uid
)
select
    session_id,
    events.*
from sessions join events on sessions.uid = events.uid;
drop table if exists koke_events;
+--------------------------------------+-----+----+
| session_id                           | uid | ts |
+--------------------------------------+-----+----+
| 71d3671a-fa97-4c00-8d90-d1584ad7cf18 | 2   | 2  |
| af564c03-2db3-44cb-bf5b-7748e231af72 | 1   | 3  |
| d4bc2779-1bd9-4a5e-a5e2-dc28b00d2f34 | 1   | 1  |
+--------------------------------------+-----+----+
{code}
I would expect that rows with the same {{uid}} would have the same {{session_id}} in this result.

When I run the same query in Hive, that's the behavior:
{code:java}
+---------------------------------------+-------------+------------+--+
|              session_id               | events.uid  | events.ts  |
+---------------------------------------+-------------+------------+--+
| e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 1          |
| e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 3          |
| 72f8a922-738e-4151-ba93-3b5e50eb9140  | 2           | 2          |
+---------------------------------------+-------------+------------+--+
{code}
 


> uuid() generates multiple values per row when joined from a CTE
> ---------------------------------------------------------------
>
>                 Key: IMPALA-9877
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9877
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 2.12.0
>         Environment: impalad version 2.12.0-cdh5.16.1 RELEASE (build 4a3775ef6781301af81b23bca45a9faeca5e761d)
>            Reporter: Jorge Bernal
>            Priority: Major
>
> I'm not sure if I'm misunderstanding the execution model here, but I would expect that when I use uuid() on a CTE, it returns a different value for each row of the CTE, and not one value per row of the final join.  
> It's hard to explain, so here's an example:
>  
> {code:java}
> drop table if exists events;
> create table events (uid int, ts int);
> insert into events values (1,1), (2,2), (1,3);
> +-----+----+
> | uid | ts |
> +-----+----+
> | 1   | 1  |
> | 2   | 2  |
> | 1   | 3  |
> +-----+----+{code}
>  
> If you then use uuid() on a CTE, and join it with something else, the resulting values might have a different uuid for the same input row:
> {code:java}
> with sessions as (
>     select
>         uid,
>         uuid() session_id,
>         min(ts) session_start
>     from events
>     group by uid
> )
> select
>     session_id,
>     events.*
> from sessions join events on sessions.uid = events.uid;
> drop table if exists koke_events;
> +--------------------------------------+-----+----+
> | session_id                           | uid | ts |
> +--------------------------------------+-----+----+
> | 71d3671a-fa97-4c00-8d90-d1584ad7cf18 | 2   | 2  |
> | af564c03-2db3-44cb-bf5b-7748e231af72 | 1   | 3  |
> | d4bc2779-1bd9-4a5e-a5e2-dc28b00d2f34 | 1   | 1  |
> +--------------------------------------+-----+----+
> {code}
> I would expect that rows with the same {{uid}} would have the same {{session_id}} in this result.
> When I run the same query in Hive, that's the behavior:
> {code:java}
> +---------------------------------------+-------------+------------+--+
> |              session_id               | events.uid  | events.ts  |
> +---------------------------------------+-------------+------------+--+
> | e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 1          |
> | e39555b2-c99b-4521-bfb4-9c4d0a5fd3cc  | 1           | 3          |
> | 72f8a922-738e-4151-ba93-3b5e50eb9140  | 2           | 2          |
> +---------------------------------------+-------------+------------+--+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org