You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "benj (JIRA)" <ji...@apache.org> on 2019/01/07 11:27:00 UTC
[jira] [Created] (DRILL-6948) Defaults values not homogeneous
benj created DRILL-6948:
---------------------------
Summary: Defaults values not homogeneous
Key: DRILL-6948
URL: https://issues.apache.org/jira/browse/DRILL-6948
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 1.15.0
Reporter: benj
Several problems can occur when using Parquet file with differents columns.
{code:java}
CREATE TABLE tmp2.`mytable1` AS SELECT 1 AS myc1,
'col3_1' AS myc3;
CREATE TABLE tmp2.`mytable2` AS SELECT 2 AS myc1, 'col2_2' AS
myc2, 'col3_2' AS myc3, 'col4_2' AS myc4;
CREATE TABLE tmp2.`mytable3` AS SELECT 0 AS myc0, 3 AS myc1, 'col2_3' AS
myc2;
{code}
1 - SELECT ... FROM tmp2.`mytable*`; can pass or fail depending of the random order of treatment of the files.
2 - (When passing) the default value is not stable between the files, example :
{code:java}
SELECT myc0, myc1, myc2, myc3, myc4 FROM tmp2.`mytable*`;
+-------+-------+---------+---------+---------+
| myc0 | myc1 | myc2 | myc3 | myc4 |
+-------+-------+---------+---------+---------+
| 0 | 3 | col2_3 | null | null |
| 0 | 2 | col2_2 | col3_2 | col4_2 |
| 0 | 1 | | col3_1 | |
+-------+-------+---------+---------+---------+
{code}
Here, column myc4 only exists on mytable2, but appears in
* empty string for mytable1
* null for mytable3
The repartition of NULL and empty may change (function of the random order of the treatment of the file) and lead to extremely different results (function of WHERE part for example).
The problem is sometimes "worse"
{code:java}
SELECT * FROM tmp2.`mytable*`;
+-----------+-------------+-------+---------+
| dir0 | myc0 | myc1 | myc2 |
+-----------+-------------+-------+---------+
| mytable3 | 0 | 3 | col2_3 |
| mytable2 | 0 | 2 | col2_2 |
| mytable1 | 1635023213 | 1 | |
+-----------+-------------+-------+---------+{code}
In reality, only the file mytable3 has the column myc0 defined (with the value 0), but these queries give us different non NULL values for
* mytable2 : 0 - disturbing because of the 0 of mytable3
* mytable1 : 1635023213 - disturbing because these value can change with each call
It's very difficult to work with queries that produce random value and/or random NULL value and can fail randomly.
In my opinion, +with Parquet files+, if a column is unknown, this column could be treated but the default value should always be NULL.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)