You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Jinchul Kim (JIRA)" <ji...@apache.org> on 2017/10/27 08:20:00 UTC
[jira] [Created] (IMPALA-6117) Parallelization of sub execution plan makes incorrectness

Jinchul Kim created IMPALA-6117:
-----------------------------------

             Summary: Parallelization of sub execution plan makes incorrectness
                 Key: IMPALA-6117
                 URL: https://issues.apache.org/jira/browse/IMPALA-6117
             Project: IMPALA
          Issue Type: Bug
            Reporter: Jinchul Kim
            Assignee: Jinchul Kim


- Symptom:
I realized that unexpected behavior from rand(...) happened when 'create as select' or aggregation with rand function. Let's see the examples:

** On Impala:
{code:java}
> select rand(1) from t1;
+--------------------+
| rand(1)            |
+--------------------+
| 0.2219843274084778 |
| 0.3161931793746507 |
| 0.2793945173171323 |
| 0.3648608677856908 |
| 0.4869666437092082 |
+--------------------+
> create table t2 as select rand(1) from t1;
+-------------------+
| summary           |
+-------------------+
| Inserted 5 row(s) |
+-------------------+
> select * from t2;
+--------------------+
| _c0                |
+--------------------+
| 0.2219843274084778 |
| 0.2219843274084778 |
| 0.2219843274084778 |
| 0.2219843274084778 |
| 0.2219843274084778 |
+--------------------+
> select count(*), rand(1) from t1 group by rand(1);
+----------+--------------------+
| count(*) | rand(1)            |
+----------+--------------------+
| 5        | 0.2219843274084778 |
+----------+--------------------+
{code}

** On PostgreSQL: 

{code:java}
# select setseed(0.1);
# select random() from t1;
       random
--------------------
  0.727818949148059
 0.0379444309510291
  0.314393464010209
  0.900541861541569
  0.918851081747562
# select setseed(0.1);
# create table t2 as select random() from t1;
SELECT 5
# select * from t2;
       random
--------------------
  0.727818949148059
 0.0379444309510291
  0.314393464010209
  0.900541861541569
  0.918851081747562
# select setseed(0.1);
# select random() from t1 group by random();
       random
--------------------
  0.918851081747562
  0.727818949148059
  0.900541861541569
 0.0379444309510291
  0.314393464010209
{code}

** On MariaDB:

{code:java}
> select rand(1) from t1;
+---------------------+
| rand(1)             |
+---------------------+
| 0.40540353712197724 |
|  0.8716141803857071 |
|  0.1418603212962489 |
| 0.09445909605776807 |
| 0.04671454713373868 |
+---------------------+
> create table t2 as select rand(1) from t1;
> select * from t2;
+---------------------+
| rand(1)             |
+---------------------+
| 0.40540353712197724 |
|  0.8716141803857071 |
|  0.1418603212962489 |
| 0.09445909605776807 |
| 0.04671454713373868 |
+---------------------+
> select rand(1) from t2 group by rand(1);
+---------------------+
| rand(1)             |
+---------------------+
| 0.04671454713373868 |
| 0.09445909605776807 |
|  0.1418603212962489 |
| 0.40540353712197724 |
|  0.8716141803857071 |
+---------------------+
{code}

- Cause:
Current implementation for random expression does not consider parallelization of sub execution plans. Intermediate results are pulled up and then the results are consumed on each query executor. The following processing happens in each executor:

1) Scalar expression evaluator creates an object for FunctionContext
2) In preparation phase of random expression, issue a local storage to keep a seed value and 
3) Generate a random value repeatedly
4) Clean-up phase of random expression

A developer should be aware of the scope of FunctionContext. It cannot keep any shareable value during expression evaluation if a query plan is distributable. 

- Solutions

My initial idea is generating a non-distributed (sub) plan if rand function exists. It promises a consistent random sequence based on a given seed value or not, but the performance issue might happen. If I choose one between correctness and performance, I always choose an aspect of correctness.

I believe current behavior makes wrong result issue as I mentioned above.  Please share a better idea if you would have.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)