You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "benj (JIRA)" <ji...@apache.org> on 2019/01/07 11:27:00 UTC
[jira] [Created] (DRILL-6948) Defaults values not homogeneous

benj created DRILL-6948:
---------------------------

             Summary: Defaults values not homogeneous
                 Key: DRILL-6948
                 URL: https://issues.apache.org/jira/browse/DRILL-6948
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.15.0
            Reporter: benj


Several problems can occur when using Parquet file with differents columns.
{code:java}
CREATE TABLE tmp2.`mytable1` AS SELECT            1 AS myc1,
      'col3_1' AS myc3;
CREATE TABLE tmp2.`mytable2` AS SELECT            2 AS myc1, 'col2_2' AS
myc2, 'col3_2' AS myc3, 'col4_2' AS myc4;
CREATE TABLE tmp2.`mytable3` AS SELECT 0 AS myc0, 3 AS myc1, 'col2_3' AS
myc2;
{code}
1 - SELECT ... FROM tmp2.`mytable*`; can pass or fail depending of the random order of treatment of the files.

2 - (When passing) the default value is not stable between the files, example :

 
{code:java}
SELECT myc0, myc1, myc2, myc3, myc4 FROM tmp2.`mytable*`;
+-------+-------+---------+---------+---------+
| myc0  | myc1  |  myc2   |  myc3   |  myc4   |
+-------+-------+---------+---------+---------+
| 0     | 3     | col2_3  | null    | null    |
| 0     | 2     | col2_2  | col3_2  | col4_2  |
| 0     | 1     |         | col3_1  |         |
+-------+-------+---------+---------+---------+
{code}
Here, column myc4 only exists on mytable2, but appears in

 
 * empty string for mytable1
 * null for mytable3

The repartition of NULL and empty may change (function of the random order of the treatment of the file) and lead to extremely different results (function of WHERE part for example).

The problem is sometimes "worse"

 
{code:java}
SELECT * FROM tmp2.`mytable*`;
+-----------+-------------+-------+---------+
|   dir0    |    myc0     | myc1  |  myc2   |
+-----------+-------------+-------+---------+
| mytable3  | 0           | 3     | col2_3  |
| mytable2  | 0           | 2     | col2_2  |
| mytable1  | 1635023213  | 1     |         |
+-----------+-------------+-------+---------+{code}
In reality, only the file mytable3 has the column myc0 defined (with the value 0), but these queries give us different non NULL values for 

 
 * mytable2 : 0 - disturbing because of the 0 of mytable3
 * mytable1 : 1635023213 - disturbing because these value can change with each call

 

It's very difficult to work with queries that produce random value and/or random NULL value and can fail randomly.

In my opinion, +with Parquet files+, if a column is unknown, this column could be treated but the default value should always be NULL.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)