You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "benj (Jira)" <ji...@apache.org> on 2019/09/27 09:56:00 UTC

[jira] [Created] (DRILL-7389) JSON empty list avoid Parquet creation

benj created DRILL-7389:
---------------------------

             Summary: JSON empty list avoid Parquet creation
                 Key: DRILL-7389
                 URL: https://issues.apache.org/jira/browse/DRILL-7389
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - JSON, Storage - Parquet
    Affects Versions: 1.16.0
            Reporter: benj


With a JSON file with only one row with an empty list as below, it's possible to request the file but there is an error when trying to create a Parquet
 File ANIMALS_1.json:
{code:json}
{"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_1.json`;
+-------------------------------------------------------------------------------+
|                                animals                                        |
+-------------------------------------------------------------------------------+
| {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} |
+-------------------------------------------------------------------------------+

CREATE TABLE ....`ANIMALS_1_pqt` AS 
(SELECT * FROM ....`ANIMALS_1.json`);
=>
Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an empty group: optional group Horse {}
{code}
 
 But if the json file contains a second line with a non-empty list for "Horse", it's possible to request file and create the Parquet
 File ANIMALS_2.json:
{code:json}
{"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{"detected":false,"gender":"1","obsdate":"20171229"}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_2.json`;
+-------------------------------------------------------------------------------+
|                                     animals                          +-------------------------------------------------------------------------------+
| {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} |
| {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{"detected":"false","gender":"1","obsdate":"20171229"}} |
+-------------------------------------------------------------------------------+

CREATE TABLE ....`ANIMALS_2_pqt` AS 
(SELECT * FROM ....`ANIMALS_2.json`);
+----------+---------------------------+
| Fragment | Number of records written |
+----------+---------------------------+
| 0_0      | 2                         |
+----------+---------------------------+
{code}
 
Many problems appears with this when manipulating multiple JSON with "rare" value (and when do not master the generation).

It's very annoying to have no possibility push data in parquet where there is missing/null value in JSON. 
The possibility to cast in varchar (DRILL-7375) the data could allow the parquet storage 
 
In the simple case of the example discussed here, it's possible to change the type of the input file from JSON to CSV and it will work. But it does not answer all the problems and it doesn't allow to keep some part in "json"  and some other in "text"

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)