You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "benj (JIRA)" <ji...@apache.org> on 2019/03/14 16:07:00 UTC

[jira] [Updated] (DRILL-7104) Change of data type when parquet with multiple fragment

     [ https://issues.apache.org/jira/browse/DRILL-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

benj updated DRILL-7104:
------------------------
    Description: 
When creating a Parquet with a column filled only with "CAST(NULL AS VARCHAR)", if the parquet has several fragment, the type is read like INT instead of VARCHAR.

First, create +Parquet with only one fragment+ - all is fine (the type of "demo" is correct).
{code:java}
CREATE TABLE ....`nobug` AS 
 (SELECT CAST(NULL AS VARCHAR) AS demo
  , md5(cast(rand() AS VARCHAR) AS jam 
  FROM ....`onebigfile` LIMIT 1000000));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 10000000                   |

SELECT drilltypeof(demo) AS goodtype FROM ....`bug` LIMIT 1;
+--------------------+
| goodtype           |
+--------------------+
| VARCHAR            |
{code}
Second, create +Parquet with at least 2 fragments+ - the type of "demo" change to INT
{code:java}
CREATE TABLE ....`bug` AS 
((SELECT CAST(NULL AS VARCHAR) AS demo
  ,md5(CAST(rand() AS VARCHAR)) AS jam 
  FROM ....`onebigfile` LIMIT 1000000) 
 UNION 
 (SELECT CAST(NULL AS VARCHAR) AS demo
  ,md5(CAST(rand() AS VARCHAR)) AS jam
  FROM ....`onebigfile` LIMIT 1000000));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_1       | 1000276                    |
| 1_0       | 999724                     |


SELECT drilltypeof(demo) AS badtype FROM ....`bug` LIMIT 1;
+--------------------+
| badtype            |
+--------------------+
| INT                |{code}
The change of type is really terrible...

 

 

 

  was:
When creating a Parquet with a column filled only with "CAST(NULL AS VARCHAR)", if the parquet has several fragment, the type is read like INT instead of VARCHAR.

First, create +Parquet with only one fragment+ - all is fine (the type of "demo" is correct).
{code:java}
CREATE TABLE ....`bug` AS 
 (SELECT CAST(NULL AS VARCHAR) AS demo
  , md5(cast(rand() AS VARCHAR) AS jam 
  FROM ....`onebigfile` LIMIT 1000000));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 0_0       | 10000000                   |

SELECT drilltypeof(demo) AS goodtype FROM ....`bug` LIMIT 1;
+--------------------+
| goodtype           |
+--------------------+
| VARCHAR            |
{code}
Second, create +Parquet with at least 2 fragments+ - the type of "demo" change to INT
{code:java}
CREATE TABLE ....`bug` AS 
((SELECT CAST(NULL AS VARCHAR) AS demo
  ,md5(CAST(rand() AS VARCHAR)) AS jam 
  FROM ....`onebigfile` LIMIT 1000000) 
 UNION 
 (SELECT CAST(NULL AS VARCHAR) AS demo
  ,md5(CAST(rand() AS VARCHAR)) AS jam
  FROM ....`onebigfile` LIMIT 1000000));
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_1       | 1000276                    |
| 1_0       | 999724                     |


SELECT drilltypeof(demo) AS badtype FROM ....`bug` LIMIT 1;
+--------------------+
| badtype            |
+--------------------+
| INT                |{code}
The change of type is really terrible...

 

 

 


> Change of data type when parquet with multiple fragment
> -------------------------------------------------------
>
>                 Key: DRILL-7104
>                 URL: https://issues.apache.org/jira/browse/DRILL-7104
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.15.0
>            Reporter: benj
>            Priority: Major
>
> When creating a Parquet with a column filled only with "CAST(NULL AS VARCHAR)", if the parquet has several fragment, the type is read like INT instead of VARCHAR.
> First, create +Parquet with only one fragment+ - all is fine (the type of "demo" is correct).
> {code:java}
> CREATE TABLE ....`nobug` AS 
>  (SELECT CAST(NULL AS VARCHAR) AS demo
>   , md5(cast(rand() AS VARCHAR) AS jam 
>   FROM ....`onebigfile` LIMIT 1000000));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 0_0       | 10000000                   |
> SELECT drilltypeof(demo) AS goodtype FROM ....`bug` LIMIT 1;
> +--------------------+
> | goodtype           |
> +--------------------+
> | VARCHAR            |
> {code}
> Second, create +Parquet with at least 2 fragments+ - the type of "demo" change to INT
> {code:java}
> CREATE TABLE ....`bug` AS 
> ((SELECT CAST(NULL AS VARCHAR) AS demo
>   ,md5(CAST(rand() AS VARCHAR)) AS jam 
>   FROM ....`onebigfile` LIMIT 1000000) 
>  UNION 
>  (SELECT CAST(NULL AS VARCHAR) AS demo
>   ,md5(CAST(rand() AS VARCHAR)) AS jam
>   FROM ....`onebigfile` LIMIT 1000000));
> +-----------+----------------------------+
> | Fragment  | Number of records written  |
> +-----------+----------------------------+
> | 1_1       | 1000276                    |
> | 1_0       | 999724                     |
> SELECT drilltypeof(demo) AS badtype FROM ....`bug` LIMIT 1;
> +--------------------+
> | badtype            |
> +--------------------+
> | INT                |{code}
> The change of type is really terrible...
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)