You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Aman Sinha (JIRA)" <ji...@apache.org> on 2016/05/25 19:00:15 UTC
[jira] [Updated] (DRILL-4694) CTAS in JSON format produces
extraneous NULL fields
[ https://issues.apache.org/jira/browse/DRILL-4694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aman Sinha updated DRILL-4694:
------------------------------
Description:
Consider the following JSON file:
{noformat}
// file t2.json
{
"X" : {
"key1" : "value1",
"key2" : "value2"
}
}
{
"X" : {
"key3" : "value3",
"key4" : "value4"
}
}
{
"X" : {
"key5" : "value5",
"key6" : "value6"
}
}
{noformat}
Now create a table in Json format using CTAS:
{noformat}
0: jdbc:drill:zk=local> alter session set `store.format` = 'json';
0: jdbc:drill:zk=local> create table dfs.tmp.jt12 as select t.`X` from `t2.json` t;
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 3 |
+-----------+----------------------------+
{noformat}
The output file has rows with union schema of all the fields in all the records. This creates extraneous Null fields in the output:
{noformat}
$ cat jt12/0_0_0.json
{
"X" : {
"key1" : "value1",
"key2" : "value2",
"key3" : null,
"key4" : null,
"key5" : null,
"key6" : null
}
} {
"X" : {
"key1" : null,
"key2" : null,
"key3" : "value3",
"key4" : "value4",
"key5" : null,
"key6" : null
}
} {
"X" : {
"key1" : null,
"key2" : null,
"key3" : null,
"key4" : null,
"key5" : "value5",
"key6" : "value6"
}
}
{noformat}
Note that if I change the output format to CSV or Parquet, there are no Null fields created in the output file. The expectation for a CTAS in json format is that the output should match that of the input json data.
was:
Consider the following JSON file:
{noformat}
// file t2.json
{
"X" : {
"key1" : "value1",
"key2" : "value2"
}
}
{
"X" : {
"key3" : "value3",
"key4" : "value4"
}
}
{
"X" : {
"key5" : "value5",
"key6" : "value6"
}
}
{noformat}
Now create a table in Json format using CTAS:
{noformat}
0: jdbc:drill:zk=local> alter session set `store.format` = 'json';
0: jdbc:drill:zk=local> create table dfs.tmp.jt12 as select t.`X` from `t2.json` t;
+-----------+----------------------------+
| Fragment | Number of records written |
+-----------+----------------------------+
| 0_0 | 3 |
+-----------+----------------------------+
{noformat}
The output file has rows with union schema of all the fields in all the records. This creates extraneous Null fields in the output:
{noformat}
$ cat jt12/0_0_0.json
{
"X" : {
"key1" : "value1",
"key2" : "value2",
"key3" : null,
"key4" : null,
"key5" : null,
"key6" : null
}
} {
"X" : {
"key1" : null,
"key2" : null,
"key3" : "value3",
"key4" : "value4",
"key5" : null,
"key6" : null
}
} {
"X" : {
"key1" : null,
"key2" : null,
"key3" : null,
"key4" : null,
"key5" : "value5",
"key6" : "value6"
}
}
{noformat}
Note that if I change the output format to CSV or Parquet, there are no Null fields created in the output file.
> CTAS in JSON format produces extraneous NULL fields
> ---------------------------------------------------
>
> Key: DRILL-4694
> URL: https://issues.apache.org/jira/browse/DRILL-4694
> Project: Apache Drill
> Issue Type: Bug
> Components: Storage - JSON
> Affects Versions: 1.6.0
> Reporter: Aman Sinha
>
> Consider the following JSON file:
> {noformat}
> // file t2.json
> {
> "X" : {
> "key1" : "value1",
> "key2" : "value2"
> }
> }
> {
> "X" : {
> "key3" : "value3",
> "key4" : "value4"
> }
> }
> {
> "X" : {
> "key5" : "value5",
> "key6" : "value6"
> }
> }
> {noformat}
> Now create a table in Json format using CTAS:
> {noformat}
> 0: jdbc:drill:zk=local> alter session set `store.format` = 'json';
> 0: jdbc:drill:zk=local> create table dfs.tmp.jt12 as select t.`X` from `t2.json` t;
> +-----------+----------------------------+
> | Fragment | Number of records written |
> +-----------+----------------------------+
> | 0_0 | 3 |
> +-----------+----------------------------+
> {noformat}
> The output file has rows with union schema of all the fields in all the records. This creates extraneous Null fields in the output:
> {noformat}
> $ cat jt12/0_0_0.json
> {
> "X" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : null,
> "key4" : null,
> "key5" : null,
> "key6" : null
> }
> } {
> "X" : {
> "key1" : null,
> "key2" : null,
> "key3" : "value3",
> "key4" : "value4",
> "key5" : null,
> "key6" : null
> }
> } {
> "X" : {
> "key1" : null,
> "key2" : null,
> "key3" : null,
> "key4" : null,
> "key5" : "value5",
> "key6" : "value6"
> }
> }
> {noformat}
> Note that if I change the output format to CSV or Parquet, there are no Null fields created in the output file. The expectation for a CTAS in json format is that the output should match that of the input json data.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)