You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2018/04/08 05:49:00 UTC

[jira] [Comment Edited] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

    [ https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429634#comment-16429634 ] 

Paul Rogers edited comment on DRILL-6312 at 4/8/18 5:48 AM:
------------------------------------------------------------

While type inference (using Cast and other hints) is a very good idea, it cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. If a column is not projected, no hint can be provided, and we can end up with possible read-time problems as discussed in the original e-mail thread ("Death of Schema on Read").
* The only way to express the type of a column is to explicitly include it in the SELECT clause. Using a wildcard ("*") query will bypass the type rules unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the others to be detected automatically. If we must use a view, and we have to, say, use a cast for column x, then we have to include all other columns in the SELECT clause or we end up projecting only x. We can't use a wildcard for the other columns.
* Putting the type information in the query puts the burden on the query writer (and, ultimately, something like Tableau.) But, the schema is a property of the data, not the query, so this is not good model of reality.

For this reason, the cast idea, though elegant, and a very good enhancement, cannot be the full answer, It will reduce the number of cases where type ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to metadata, so it must be in the query (or, equivalently, a view.)


was (Author: paul-rogers):
While type inference (using Cast and other hints) is a very good idea, it cannot be the full answer. Here is why:

* The only way to express a type is to include the column in a SELECT clause. If a column is not projected, no hint can be provided, and we can end up with possible read-time problems as discussed in the original e-mail thread ("Death of Schema on Read").
* The only way to express the type of a column is to explicitly include it in the SELECT clause. Using a wildcard ("*") query will bypass the type rules unless there is a view underneath that applies the rules.
* There is no way to type just the pesky, troublesome columns, leaving the others to be detected automatically. If we must use a view, and we have to, say, use a cast for column x, then we have to include all other columns in the SELECT clause or we end up projecting only x.

For this reason, the cast idea, though elegant, and a very good enhancement, cannot be the full answer, It will reduce the number of cases where type ambiguity occurs, but it is not a general-purpose solution.

A general-purpose solution would be to provide some means to explicitly apply type information. For example, in a view or query, provide explicit hint syntax:

{noformat}
SELECT * FROM myFunkyTable
  WITH HINTS (f: INT, m.x: BIGINT NOT NULL,  a[]: VARCHAR NULL)
{noformat}

The hints say that, if fields "f", "m.x" and "a" appear, they are of the type specified. If the fields don't appear, just ignore the hints.

Most systems put this information in metadata, but Drill is very hostile to metadata, so it must be in the query (or, equivalently, a view.)

> Enable pushing of cast expressions to the scanner for better schema discovery.
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-6312
>                 URL: https://issues.apache.org/jira/browse/DRILL-6312
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Query Planning &amp; Optimization
>    Affects Versions: 1.13.0
>            Reporter: Hanumath Rao Maduri
>            Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate sources at the read time. Currently the scanners infer the schema for each batch depending upon the data for that column in the corresponding batch. This solves many uses cases but can error out when the data is too different between batches like int and array[int] etc... (There are other cases as well but just to give one example).
> There is also a mechanism to create a view by type casting the columns to appropriate type. This solves issues in some cases but fails in many other cases. This is due to the fact that cast expression is not being pushed down to the scanner but staying at the project or filter etc operators up the query plan.
> This JIRA is to fix this by propagating the type information embedded in the cast function to the scanners so that scanners can cast the incoming data appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)