You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/03/10 00:36:00 UTC

[jira] [Assigned] (DRILL-7055) Revise SELECT * to exclude partitions

     [ https://issues.apache.org/jira/browse/DRILL-7055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers reassigned DRILL-7055:
----------------------------------

    Assignee:     (was: Paul Rogers)

> Revise SELECT * to exclude partitions
> -------------------------------------
>
>                 Key: DRILL-7055
>                 URL: https://issues.apache.org/jira/browse/DRILL-7055
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.15.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Suppose you have a partitioned table:
> {noformat}
> myTable
>   2018
>     file1.csv
>   2019
>     file2.csv
> {noformat}
> For some time, Drill has included the partitions in a {{SELECT *}} query:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2, 2018
> col2.1, col2.2, 2019
> {noformat}
> There are two subtle issues with this behavior:
> * The behavior of partitions is not consistent with the other file metadata (implicit) columns.
> * Because directory depth can vary, and scan order is random, a hard schema change can occur if Drill starts scanning shallow files before deep files.
> This ticked proposes to change the partition behavior to be like that of other file metadata columns: they are included only when requested:
> {noformat}
> SELECT * FROM `myTable`;
> col1.1, col1.2
> col2.1, col2.2
> SELECT *, dir0, filename FROM `myTable`;
> col1.1, col1.2, 2018, file1.csv
> col2.1, col2.2, 2019, file2.csv
> {noformat}
> With this change, there is no possibility of a hard schema change: the user predefines the desired partitions.
> Unfortunately, with the existing readers, a reader that does not have given partition will omit that partition column and will instead leave it to the projection operator to fill in the column, which it will do with a Nullable INT. (The new row-set based scan mechanism handles this case correctly.)
> h4. Risks
> Note that this change does change user-visible behavior. If a user has been able to get {{SELECT *}} to work with partition columns, the query will have to change to include partition columns. However, it may be that the risk of such a breaking change is low because:
> * Users are generally discouraged from using {{SELECT *}} in a production query.
> * Use of {{SELECT *}} in a non-uniform partitioning structure would have caused failures due to the hard schema change noted above.
> For this reason, the benefits of the change appear to outweigh the risks.
> h4. Technical Background
> In the last year, Calcite appears to have added the ability to specify a wildcard plus extra columns. When used with implicit columns, we can now say:
> {code:sql}
> SELECT *, filename FROM myTable;
> {code}
> However, while the readers (at least the CSV reader) can handle this case, the {{ProjectRecordBatch}} cannot.
> Modify the {{TestCsv.java}} test case with the following test:
> {code:java}
>   @Test
>   public void testImplicitColWildcard() throws IOException {
>     String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
>     RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
>     actual.print();
>     TupleMetadata expectedSchema = new SchemaBuilder()
>         .add("a", MinorType.VARCHAR)
>         .add("b", MinorType.VARCHAR)
>         .add("c", MinorType.VARCHAR)
>         .addNullable("filename", MinorType.VARCHAR)
>         .buildSchema();
>     RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
>         .addRow("10", "foo", "bar", CASE2_FILE_NAME)
>         .build();
>     RowSetUtilities.verify(expected, actual);
>   }
> {code}
> The output of the {{actual.print()}} is:
> {noformat}
> #: a, b, c, filename
> 0: "10", "foo", "bar", "case2.csv"
> {noformat}
> Now, try the same thing, but substitute "dir0" for "filename". We would expect to see something like the above. What we actually see is:
> {noformat}
> #: a, b, c, dir0, dir00
> 0: "10", "foo", "bar", null, null
> {noformat}
> Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see the same thing on the master branch, put the CSV file under a directory and query the directory.
> The problem is traced to [here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:
> {code:java}
>   private boolean isImplicitFileColumn(ValueVector vvIn) {
>     return ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName()) != null;
>   }
> {code}
> This has two problems:
> 1. It creates a map of implicit column names, but does not handle parsing names like "dir0".
> 2. It creates the map over and over: once per column per schema change. Very inefficient.
> The solution is to modify the code to use the {{isPartitionColumn()}} method in {{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project operator instance and reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)