You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by "benj (Jira)" <ji...@apache.org> on 2019/09/27 13:07:00 UTC
[jira] [Created] (DRILL-7390) kvgen/flatten doesn't produce same
result from .json or .parquet
benj created DRILL-7390:
---------------------------
Summary: kvgen/flatten doesn't produce same result from .json or .parquet
Key: DRILL-7390
URL: https://issues.apache.org/jira/browse/DRILL-7390
Project: Apache Drill
Issue Type: Bug
Components: Execution - Data Types, Functions - Drill, Storage - JSON, Storage - Parquet
Affects Versions: 1.16.0
Reporter: benj
Attachments: ANIMALS_json.tar.gz, ANIMALS_pqt.tar.gz
With a Parquet produce from JSON (_ANIMALS_json_ and _ANIMALS_pqt_ in attachment in tar.gz format)
{code:sql}
CREATE TABLE ....`ANIMALS_pqt` AS
(SELECT * FROM ....`ANIMALS_json`);
{code}
Same request, using kvgen and flatten, applied on JSON and Parquet doesn't produce the same results
{code:sql}
SELECT count(*) FROM
(SELECT f FROM
(SELECT flatten(k) AS f FROM
(SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x)
=>
8482290
SELECT count(*) FROM
(SELECT f FROM
(SELECT flatten(k) AS f FROM
(SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x)
=>
929430
{code}
Or another example:
{code:sql}
SELECT count(*) FROM
(SELECT f FROM
(SELECT flatten(k) AS f FROM
(SELECT kvgen(animals) AS k FROM ....`ANIMALS_json`)) AS x WHERE x.f.key='Cat')
=>
121368
SELECT count(*) FROM
(SELECT f FROM
(SELECT flatten(k) AS f FROM
(SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt`)) AS x WHERE x.f.key='Cat')
=>
13470
{code}
The real result is the json one, as proved by:
{code:bash}
cat ANIMALS_json/*.json | grep -c "Cat"
121368
{code}
Please note that, here, It's appear the particular file _ANIMALS_pqt/1_0_0.parquet_ is not well computed but the other are correct:
{code:sql}
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_0_0.parquet`)) AS x WHERE x.f.key='Cat');
=> 107898
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_1_0.parquet`)) AS x WHERE x.f.key='Cat');
=> 2429
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_2_0.parquet`)) AS x WHERE x.f.key='Cat');
=> 5419
SELECT count(*) FROM (SELECT f FROM (SELECT flatten(k) AS f FROM (SELECT kvgen(animals) AS k FROM ....`ANIMALS_pqt/1_3_0.parquet`)) AS x WHERE x.f.key='Cat');
=> 5622
{code}
2429+5419+5622=13470 (result of request on ANIMALS_pqt)
107898+2429+5419+5622=121368 (result of request on ANIMALS_json)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)