You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2020/01/13 03:22:00 UTC

[jira] [Created] (DRILL-7522) JSON reader (v1) omits null columns in SELECT *

Paul Rogers created DRILL-7522:
----------------------------------

             Summary: JSON reader (v1) omits null columns in SELECT *
                 Key: DRILL-7522
                 URL: https://issues.apache.org/jira/browse/DRILL-7522
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.17.0
            Reporter: Paul Rogers


Run the following unit test: {{TestStarQueries.testSelStarOrderBy}}, runs the following query:

{code:sql}
select * from cp.`employee.json` order by last_name
{code}

The query reads a Foodmart file {{customer.json}} that has records like this:

{code:json}
{"employee_id":53,...","end_date":null,"salary":...}
{code}

The field {{end_date}} turns out to be null for all records in {{customer.json}}.

Then, look at the verification query. It carefully includes all fields *except* {{end_date}}. That is, the test was written to expect that the JSON reader will omit a column that has all NULL values.

While it might seem OK to omit all-NULL columns (they don't have any data), the problem is that Drill is a distributed system. Suppose we query a directory of 50 such files, some of which have all-NULLs in one field, some of which have all-NULLs in another. Although the files have the same schema, {{SELECT *}} will return different schemas (depending on which file has which non-NULL columns.)

A downstream operator will have to merge these schemas. And, since Drill fills in a Nullable INT field for missing columns, we might end up with a schema change exception because the actual field type is VARCHAR when it appears.

One can argue that {{SELECT *}} means "return all columns", not "return all columns except those that happen to be null in the first batch." Yes, we have the problem of not knowing the actual field type. Eventually, provided schemas will resolve such issues.

Note that in the "V2" JSON reader, {{end_date}} is included in the query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)