You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Jinchul Kim (JIRA)" <ji...@apache.org> on 2017/10/27 08:20:00 UTC
[jira] [Created] (IMPALA-6117) Parallelization of sub execution
plan makes incorrectness
Jinchul Kim created IMPALA-6117:
-----------------------------------
Summary: Parallelization of sub execution plan makes incorrectness
Key: IMPALA-6117
URL: https://issues.apache.org/jira/browse/IMPALA-6117
Project: IMPALA
Issue Type: Bug
Reporter: Jinchul Kim
Assignee: Jinchul Kim
- Symptom:
I realized that unexpected behavior from rand(...) happened when 'create as select' or aggregation with rand function. Let's see the examples:
** On Impala:
{code:java}
> select rand(1) from t1;
+--------------------+
| rand(1) |
+--------------------+
| 0.2219843274084778 |
| 0.3161931793746507 |
| 0.2793945173171323 |
| 0.3648608677856908 |
| 0.4869666437092082 |
+--------------------+
> create table t2 as select rand(1) from t1;
+-------------------+
| summary |
+-------------------+
| Inserted 5 row(s) |
+-------------------+
> select * from t2;
+--------------------+
| _c0 |
+--------------------+
| 0.2219843274084778 |
| 0.2219843274084778 |
| 0.2219843274084778 |
| 0.2219843274084778 |
| 0.2219843274084778 |
+--------------------+
> select count(*), rand(1) from t1 group by rand(1);
+----------+--------------------+
| count(*) | rand(1) |
+----------+--------------------+
| 5 | 0.2219843274084778 |
+----------+--------------------+
{code}
** On PostgreSQL:
{code:java}
# select setseed(0.1);
# select random() from t1;
random
--------------------
0.727818949148059
0.0379444309510291
0.314393464010209
0.900541861541569
0.918851081747562
# select setseed(0.1);
# create table t2 as select random() from t1;
SELECT 5
# select * from t2;
random
--------------------
0.727818949148059
0.0379444309510291
0.314393464010209
0.900541861541569
0.918851081747562
# select setseed(0.1);
# select random() from t1 group by random();
random
--------------------
0.918851081747562
0.727818949148059
0.900541861541569
0.0379444309510291
0.314393464010209
{code}
** On MariaDB:
{code:java}
> select rand(1) from t1;
+---------------------+
| rand(1) |
+---------------------+
| 0.40540353712197724 |
| 0.8716141803857071 |
| 0.1418603212962489 |
| 0.09445909605776807 |
| 0.04671454713373868 |
+---------------------+
> create table t2 as select rand(1) from t1;
> select * from t2;
+---------------------+
| rand(1) |
+---------------------+
| 0.40540353712197724 |
| 0.8716141803857071 |
| 0.1418603212962489 |
| 0.09445909605776807 |
| 0.04671454713373868 |
+---------------------+
> select rand(1) from t2 group by rand(1);
+---------------------+
| rand(1) |
+---------------------+
| 0.04671454713373868 |
| 0.09445909605776807 |
| 0.1418603212962489 |
| 0.40540353712197724 |
| 0.8716141803857071 |
+---------------------+
{code}
- Cause:
Current implementation for random expression does not consider parallelization of sub execution plans. Intermediate results are pulled up and then the results are consumed on each query executor. The following processing happens in each executor:
1) Scalar expression evaluator creates an object for FunctionContext
2) In preparation phase of random expression, issue a local storage to keep a seed value and
3) Generate a random value repeatedly
4) Clean-up phase of random expression
A developer should be aware of the scope of FunctionContext. It cannot keep any shareable value during expression evaluation if a query plan is distributable.
- Solutions
My initial idea is generating a non-distributed (sub) plan if rand function exists. It promises a consistent random sequence based on a given seed value or not, but the performance issue might happen. If I choose one between correctness and performance, I always choose an aspect of correctness.
I believe current behavior makes wrong result issue as I mentioned above. Please share a better idea if you would have.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)