You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Anton Gozhiy (JIRA)" <ji...@apache.org> on 2019/06/25 13:33:00 UTC

[jira] [Reopened] (DRILL-7083) Wrong data type for explicit partition column beyond file depth

     [ https://issues.apache.org/jira/browse/DRILL-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Gozhiy reopened DRILL-7083:
---------------------------------

> Wrong data type for explicit partition column beyond file depth
> ---------------------------------------------------------------
>
>                 Key: DRILL-7083
>                 URL: https://issues.apache.org/jira/browse/DRILL-7083
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Consider the simple case in DRILL-7082. That ticket talks about implicit partition columns created by the wildcard. Consider a very similar case:
> {code:sql}
> SELECT a, b, c, dir0, dir1 FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>    |- file2.csv
> {noformat}
> If the query is run in "stock" Drill, the planner will place both files within a single scan operator as described in DRILL-7082. The result schema will be:
> {noformat}
> (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
> {noformat}
> Notice that last column: why is "dir1" a (nullable) INT? The partition mechanism only recognizes partitions that actually exist, leaving the Project operator to fill in (with a Nullable INT) any partitions that don't exist (any directory levels not actually seen by the scan operator.)
> Now, using the same trick as in DRILL-7082, try the query
> {code:sql}
> SELECT a, b, c, dir0 FROM `myTable`
> {code}
> Again, the trick causes Drill to read each file in a separate scan operator (simulating what happens when queries run at scale.)
> The scan operator for {{file1.csv}} will see no partitions, so it will omit "dir0" and the Project operator will helpfully fill in a Nullable INT. The scan operator for {{file2.csv}} sees one level of partition, so sets {{dir0}} to {{nested}} as a Nullable VARCHAR.
> What does the client see? Two records: one with "dir0" as a Nullable INT, the other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema change between the two records.
> The two cases described above are really two versions of the same issue. Clients expect that, if they use the "dir0", "dir1", ... columns, that the type is always Nullable Varchar so that the schema stays consistent across batches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)