You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2020/01/13 03:22:00 UTC
[jira] [Created] (DRILL-7522) JSON reader (v1) omits null columns
in SELECT *
Paul Rogers created DRILL-7522:
----------------------------------
Summary: JSON reader (v1) omits null columns in SELECT *
Key: DRILL-7522
URL: https://issues.apache.org/jira/browse/DRILL-7522
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers
Run the following unit test: {{TestStarQueries.testSelStarOrderBy}}, runs the following query:
{code:sql}
select * from cp.`employee.json` order by last_name
{code}
The query reads a Foodmart file {{customer.json}} that has records like this:
{code:json}
{"employee_id":53,...","end_date":null,"salary":...}
{code}
The field {{end_date}} turns out to be null for all records in {{customer.json}}.
Then, look at the verification query. It carefully includes all fields *except* {{end_date}}. That is, the test was written to expect that the JSON reader will omit a column that has all NULL values.
While it might seem OK to omit all-NULL columns (they don't have any data), the problem is that Drill is a distributed system. Suppose we query a directory of 50 such files, some of which have all-NULLs in one field, some of which have all-NULLs in another. Although the files have the same schema, {{SELECT *}} will return different schemas (depending on which file has which non-NULL columns.)
A downstream operator will have to merge these schemas. And, since Drill fills in a Nullable INT field for missing columns, we might end up with a schema change exception because the actual field type is VARCHAR when it appears.
One can argue that {{SELECT *}} means "return all columns", not "return all columns except those that happen to be null in the first batch." Yes, we have the problem of not knowing the actual field type. Eventually, provided schemas will resolve such issues.
Note that in the "V2" JSON reader, {{end_date}} is included in the query.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)