You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/29 22:29:04 UTC

[jira] [Created] (DRILL-5551) `columns` changes meaning for CSV files depending on query

Paul Rogers created DRILL-5551:
----------------------------------

             Summary: `columns` changes meaning for CSV files depending on query
                 Key: DRILL-5551
                 URL: https://issues.apache.org/jira/browse/DRILL-5551
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Priority: Minor


Drill's CSV column reader supports two forms of files:

* Files with column headers as the first line of the file.
* Files without column headers.

The CSV storage plugin specifies which format to use for files accessed via that storage plugin config.

Suppose we have a CSV file with headers:

{code}
a,b,c
10,foo,bar
{code}

Suppose we configure a storage plugin to use headers:

{code}
    TextFormatConfig csvFormat = new TextFormatConfig();
    csvFormat.fieldDelimiter = ',';
    csvFormat.skipFirstLine = false;
    csvFormat.extractHeader = true;
{code}

(The above can also be done using JSON when running Drill as a server.)

Suppose we execute this query:
{code}
SELECT columns FROM `dfs.data.example.csv`
{code}

The result is a single column, the special {{columns}} array, that contains all three fields.

Suppose we alter the query just a bit:
{code}
SELECT columns, a FROM `dfs.data.example.csv`
{code}

Now the result set is two non-nullable Varchar columns:

{code}
columns,a
,10
{code}

It seems that the meaning of `columns` shifts depending on whether the value appears by itself or as part of a SELECT list.

Perhaps this handles the case of a file such as:

{code}
columns,values
a;b,10;10
c;d,20;30
{code}

That is fine. but what if I just wanted the first column:

{code}
SELECT columns FROM `dfs.data.strange.csv`
{code}

How would the code know if {{columns}} was the special column vs. the normal column called "columns"?

Perhaps one long-term solution is to make {{columns}} into a table function (as has been proposed for the implicit columns):

{code}
SELECT columns(t) FROM `dfs.data.strange.csv` AS t
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)