You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2018/05/03 17:32:00 UTC
[jira] [Comment Edited] (DRILL-6383) View column types, modes are plan-time guesses, not actual types

    [ https://issues.apache.org/jira/browse/DRILL-6383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16462828#comment-16462828 ] 

Paul Rogers edited comment on DRILL-6383 at 5/3/18 5:31 PM:
------------------------------------------------------------

Thanks [~amansinha100] for the explanation. It is clear *why* Drill works as it does. The issue here is that, by including columns, Drill appears to be making a statement that it can provide type information, which it can't. Also, by including nullability, Drill appears to be saying it can provide that information, when it can't.

The key issues is that Drill seems to be offering a feature on the one hand (schema information), then, in the fine print, retracting that offer.

Providing false and misleading schema information may be useful for some use cases, but is not super helpful in most cases.

One simple solution is to omit the column information from the view: that sends a clear message that Drill does not know the types or cardinality, that the view can't provide that information, and that it is available only by actually running the query against a specific input file (since types can differ when run against different files.)

More to the point, in explaining this to others, what is the value-add from partial and incorrect schema info? How does this help the Tableau user, say?


was (Author: paul-rogers):
Thanks [~amansinha100] for the explanation. It is clear *why* Drill works as it does. The issue here is that, by including columns, Drill appears to be making a statement that it can provide type information, which it can't. Also, by including nullability, Drill appears to be saying it can provide that information, when it can't.

The key issues is that Drill seems to be offering a feature on the one hand (schema information), then, in the fine print, retracting that offer.

Providing false and misleading schema information may be useful for some use cases, but is not super helpful in most cases.

One simple solution is to omit the column information from the view: that sends a clear message that Drill does not know the types or cardinality, that the view can't provide that information, and that it is available only by actually running the query against a specific input file (since types can differ when run against different files.)

> View column types, modes are plan-time guesses, not actual types
> ----------------------------------------------------------------
>
>                 Key: DRILL-6383
>                 URL: https://issues.apache.org/jira/browse/DRILL-6383
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.13.0
>            Reporter: Paul Rogers
>            Priority: Minor
>
> Create a view views and look at the list of columns within the view. You'll see that they are often wrong in name, type and mode.
> Consider a very simple CSV file with headers:
> {noformat}
> custId,name,balance,status
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> 128,Barney,1.23,PAST DUE,30
> {noformat}
> Define the simplest possible view:
> {noformat}
> CREATE VIEW myView2 AS SELECT * FROM `csvh/cust.csvh`;
> {noformat}
> Then look at the view file:
> {noformat}
> {
>   "name" : "myView2",
>   "sql" : "SELECT *\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "**",
>     "type" : "DYNAMIC_STAR",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> It is clear that the view simply captured the plan-time list of the new double-star for the wildcard. Since this is not a true type, it should not have an `isNullable` attribute.
> OK, we have to spell out the columns:
> {noformat}
> CREATE VIEW myView3 AS SELECT custId  FROM `csvh/cust.csvh`;
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView3",
>   "sql" : "SELECT `custId`\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> The name is correct. The type is `ANY`, which is wrong. Since this is a CSV file, the column type is `VARCHAR`. Further, because this is a CSV file which headers, the mode is REQUIRED, but is listed as nullable. To verify:
> {noformat}
> SELECT sqlTypeOf(custId), modeOf(custId) FROM myView3 LIMIT 1;
> +--------------------+-----------+
> |       EXPR$0       |  EXPR$1   |
> +--------------------+-----------+
> | CHARACTER VARYING  | NOT NULL  |
> +--------------------+-----------+
> {noformat}
> Now, let's try a CSV file without headers:
> {noformat}
> 123,Fred,456.78
> 125,Betty,98.76,VIP
> {noformat}
> {noformat}
> CREATE VIEW myView4 AS SELECT columns FROM `csv/cust.csv`;
> SELECT * FROM myView4;
> +--------------------------------+
> |            columns             |
> +--------------------------------+
> | ["123","Fred","456.78"]        |
> | ["125","Betty","98.76","VIP"]  |
> +--------------------------------+
> {noformat}
> Let's look at the view file:
> {noformat}
> {
>   "name" : "myView4",
>   "sql" : "SELECT `columns`\nFROM `csv/cust.csv`",
>   "fields" : [ {
>     "name" : "columns",
>     "type" : "ANY",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> This is almost non-sensical. `columns` is reported as type `ANY` and nullable. But, `columns` is Repeated `VARCHAR` and repeated types cannot be nullable.
> The conclusion is that the type information is virtually worthless and the `isNullable` information is worse than worthless: it is plain wrong.
> The type information is valid only if the planner can inver types:
> {noformat}
> CREATE VIEW myView5 AS
>   SELECT CAST(custId AS INTEGER) AS custId FROM `csvh/cust.csvh`;
> {noformat}
> The view file:
> {noformat}
> {
>   "name" : "myView5",
>   "sql" : "SELECT CAST(`custId` AS INTEGER) AS `custId`\nFROM `csvh/cust.csvh`",
>   "fields" : [ {
>     "name" : "custId",
>     "type" : "INTEGER",
>     "isNullable" : true
>   } ],
>   "workspaceSchemaPath" : [ "local", "data" ]
> }
> {noformat}
> Note that the `type` is inferred from the cast, but `isNullable` is wrong because the underlying column is non-nullable:
> {noformat}
> SELECT modeOf(custId) FROM myView5 LIMIT 1;
> +-----------+
> |  EXPR$0   |
> +-----------+
> | NOT NULL  |
> +-----------+
> {noformat}
> Expected that Drill would run the underlying query as a `LIMIT 0` query to extract the actual column types, and use that in the view.
> Or, expected that Drill would simply omit the column list from the view if the data is meaningless.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)