You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2019/02/25 05:11:00 UTC
[jira] [Created] (DRILL-7055) Project operator cannot handle wildcard + implicit cols

Paul Rogers created DRILL-7055:
----------------------------------

             Summary: Project operator cannot handle wildcard + implicit cols
                 Key: DRILL-7055
                 URL: https://issues.apache.org/jira/browse/DRILL-7055
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.15.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


In the last year, Calcite appears to have added the ability to specify a wildcard plus extra columns. When used with implicit columns, we can now say:

{code:sql}
SELECT *, filename FROM myTable;
{code}

However, while the readers (at least the CSV reader) can handle this case, the {{ProjectRecordBatch}} cannot.

Modify the {{TestCsv.java}} test case with the following test:

{code:java}
  @Test
  public void testImplicitColWildcard() throws IOException {
    String sql = "SELECT *, filename FROM `dfs.data`.`%s`";
    RowSet actual = client.queryBuilder().sql(sql, CASE2_FILE_NAME).rowSet();
    actual.print();

    TupleMetadata expectedSchema = new SchemaBuilder()
        .add("a", MinorType.VARCHAR)
        .add("b", MinorType.VARCHAR)
        .add("c", MinorType.VARCHAR)
        .addNullable("filename", MinorType.VARCHAR)
        .buildSchema();

    RowSet expected = new RowSetBuilder(client.allocator(), expectedSchema)
        .addRow("10", "foo", "bar", CASE2_FILE_NAME)
        .build();
    RowSetUtilities.verify(expected, actual);
  }
{code}

The output of the {{actual.print()}} is:

{noformat}
#: a, b, c, filename
0: "10", "foo", "bar", "case2.csv"
{noformat}

Now, try the same thing, but substitute "dir0" for "filename". We would expect to see something like the above. What we actually see is:

{noformat}
#: a, b, c, dir0, dir00
0: "10", "foo", "bar", null, null
{noformat}

Note that I'm trying this on a "new" CSV reader that fills in "dir0". To see the same thing on the master branch, put the CSV file under a directory and query the directory.

The problem is traced to [here|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java#L592]:

{code:java}
  private boolean isImplicitFileColumn(ValueVector vvIn) {
    return ColumnExplorer.initImplicitFileColumns(context.getOptions()).get(vvIn.getField().getName()) != null;
  }
{code}

This has two problems:

1. It creates a map of implicit column names, but does not handle parsing names like "dir0".
2. It creates the map over and over: once per column per schema change. Very inefficient.

The solution is to modify the code to use the {{isPartitionColumn()}} method in {{ColumnExplorer}}. Plus, create the {{ColumnExplorer}} once per project operator instance and reuse it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)