You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "benj (Jira)" <ji...@apache.org> on 2019/09/27 09:56:00 UTC
[jira] [Created] (DRILL-7389) JSON empty list avoid Parquet
creation
benj created DRILL-7389:
---------------------------
Summary: JSON empty list avoid Parquet creation
Key: DRILL-7389
URL: https://issues.apache.org/jira/browse/DRILL-7389
Project: Apache Drill
Issue Type: Improvement
Components: Storage - JSON, Storage - Parquet
Affects Versions: 1.16.0
Reporter: benj
With a JSON file with only one row with an empty list as below, it's possible to request the file but there is an error when trying to create a Parquet
File ANIMALS_1.json:
{code:json}
{"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_1.json`;
+-------------------------------------------------------------------------------+
| animals |
+-------------------------------------------------------------------------------+
| {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} |
+-------------------------------------------------------------------------------+
CREATE TABLE ....`ANIMALS_1_pqt` AS
(SELECT * FROM ....`ANIMALS_1.json`);
=>
Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with an empty group: optional group Horse {}
{code}
But if the json file contains a second line with a non-empty list for "Horse", it's possible to request file and create the Parquet
File ANIMALS_2.json:
{code:json}
{"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{}}}
{"animals": {"Rhinoceros":{"detected":false,"gender":"1","obsdate":"20171229"},"Horse":{"detected":false,"gender":"1","obsdate":"20171229"}}}
{code}
{code:sql}
SELECT * FROM ....`ANIMALS_2.json`;
+-------------------------------------------------------------------------------+
| animals +-------------------------------------------------------------------------------+
| {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{}} |
| {"Rhinoceros":{"detected":"false","gender":"1","obsdate":"20171229"},"Horse":{"detected":"false","gender":"1","obsdate":"20171229"}} |
+-------------------------------------------------------------------------------+
CREATE TABLE ....`ANIMALS_2_pqt` AS
(SELECT * FROM ....`ANIMALS_2.json`);
+----------+---------------------------+
| Fragment | Number of records written |
+----------+---------------------------+
| 0_0 | 2 |
+----------+---------------------------+
{code}
Many problems appears with this when manipulating multiple JSON with "rare" value (and when do not master the generation).
It's very annoying to have no possibility push data in parquet where there is missing/null value in JSON.
The possibility to cast in varchar (DRILL-7375) the data could allow the parquet storage
In the simple case of the example discussed here, it's possible to change the type of the input file from JSON to CSV and it will work. But it does not answer all the problems and it doesn't allow to keep some part in "json" and some other in "text"
--
This message was sent by Atlassian Jira
(v8.3.4#803005)