You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Hanumath Rao Maduri (JIRA)" <ji...@apache.org> on 2018/04/07 15:05:00 UTC
[jira] [Commented] (DRILL-6312) Enable pushing of cast expressions to the scanner for better schema discovery.

    [ https://issues.apache.org/jira/browse/DRILL-6312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429403#comment-16429403 ] 

Hanumath Rao Maduri commented on DRILL-6312:
--------------------------------------------

Please find the mail thread which discusses various issues and approaches to deal with discovery of schema.

{noformat}
Hi Hanu,

The problem with views as is, even with casts, is that the casting comes too late to resolve he issues I highlighted in earlier messages. Ted's cast push-down idea causes the conversion to happen at read time so that we can, say, cast a string to an int, or cast a null to the proper type.

Today, if we use a cast, such as SELECT cast(a AS INT) FROM myTable then we get a DAG that has tree parts (to keep things simple):

* Scan the data, using types inferred from the data itself
* In a Filter operator, convert the type of data to INT
* In Screen, return the result to the user

If the type is ambiguous in the file, then the first step above fails; data never gets far enough for the Filter to kick in and apply the cast. Also, if a file contains a run of nulls, the scanner will choose Nullable Int, then fail when it finds, say, a string.

The key point is that the cast push-down means that the query will not fail due to dicey files: the cast resolves the ambiguity. If we push the cast down, then it is the SCAN operator that resolves the conflict and does the cast; avoiding the failures we've been discussing.

I like the idea you seem to be proposing: cascading views. Have a table view that cleans up each table. Then, these can be combined in higher-order views for specialized purposes.

The beauty of the cast push-down idea is that no metadata is needed other than the query. If the user wants metadata, they use existing views (that contain the casts and cause the cast push-down.)

This seems like such a simple, elegant solution that we could try it out quickly (if we get past the planner issues Aman mentioned.) In fact, the new scan operator code (done as part of the batch sizing work) already has a prototype mechanism for type hints. If the type hint is provided to the scanner, it uses them, otherwise it infers the type. We'd just hook up the cast push down data to that prototype and we could try out the result quickly. (The new scan operator is still in my private branch, in case anyone goes looking for it...)

Some of your discussion talks about automatically inferring the schema. I really don't think we need to do that. The hint (cast push-down) is sufficient to resolve ambiguities in the existing scan-time schema inference.

The syntax trick would be to find a way to provide hints just for those columns that are issues. If I have a table with columns a, b, ... z, but only b is a problem, I don't want to have to do:

SELECT a, CAST(b AS INT), c, ... z FROM myTable

Would be great if we could just do:

SELECT *, CAST(b AS INT) FROM myTable

I realize the above has issues; the key idea is: provide casts only for the problem fields without spelling out all fields.

If we really want to get fancy, we can do UDF push down for the complex cases you mentioned. Maybe:

SELECT *, CAST(b AS INT), parseCode(c) ...

We are diving into design here; maybe you can file a JIRA and we can shift detailed design discussion to that JIRA. Salim already has one related to schema change errors, which was why the "Death" article caught my eye.

Thanks,
- Paul





    On Friday, April 6, 2018, 4:59:40 PM PDT, Hanumath Rao Maduri <ha...@gmail.com> wrote:

 Hello,

Thanks for Ted & Paul for clarifying my questions.
Sorry for not being clear in my previous post, When I said create view I
was under the impression for simple views where we use cast expressions
currently to cast them to types. In this case planner can use this
information to force the scans to use this as the schema.

If the query fails then it fails at the scan and not after inferring the
schema by the scanner.

I know that views can get complicated with joins and expressions. For
schema hinting through views I assume they should be created on single
tables with corresponding columns one wants to project from the table.


Regarding the same question, today we had a discussion with Aman. Here view
can be considered as a "view" of the table with schema in place.

We can change some syntax to suite it for specifying schema. something like
this.

create schema[optional] view(/virtual table ) v1 as (a: int, b : int)
select a, b from t1 with some other rules as to conversion of scalar to
complex types.

Then the queries when used on this view (below) should enable the scanner
to use this type information and then use it to convert the data into the
appropriate types.
select * from v1

For the possibility of schema information not being known by the user, may
be use something like this.

create schema[optional] view(/virtual table) v1 as select a, b from t1
infer schema.

This view when used to query the table should trigger the logic of
inferring and consolidating the schema and attaching that inferred schema
to the view. In future when we use the same view, we should be using the
inferred schema. This view either can be local view pertaining to the
session or a global view so that other queries across sessions can use them.


By default we can apply certain rules such as converting simple scalar
values to other scalar values (like int to double etc). But we should be
also able to give option to the customer to enable rules such as scalar int
to array[int] when creating the view itself.


Thanks,
-Hanu


On Fri, Apr 6, 2018 at 3:10 PM, Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Ted, this is why your participation in Drill is such a gift: cast
> push-down is an elegant, simple solution that even works in views.
> Beautiful.
>
> Thanks,
> - Paul
>
>
>
>    On Friday, April 6, 2018, 11:35:37 AM PDT, Ted Dunning <
> ted.dunning@gmail.com> wrote:
>
>  On Thu, Apr 5, 2018 at 9:43 PM, Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
> > Great discussion. Really appreciate the insight from the Drill users!
> >
> > To Ted's points: the simplest possible solution is to allow a table
> > function to express types. Just making stuff up:
> >
> > SELECT a FROM schema(myTable, (a: INT))
> >
>
> Why not just allow cast to be pushed down to the reader?
>
> Why invent new language features?
>
> Or, really ugly, a session option:
> >
> > ALTER SESSION SET schema.myTable="a: INT"
> >
>
> These are a big problem.
>
>
 

Paul Rogers par0328@yahoo.com.invalid via drill.apache.org 
7:38 PM (12 hours ago)
to dev 
Hi Aman,

As we get into details, I suggested to Hanu that we move the discussion into a JIRA ticket.

 >On the subject of CAST pushdown to Scans, there are potential drawbacks

 >  - In general, the planner will see a Scan-Project where the Project has  CAST functions.  But the Project can have arbitrary expressions,  e.g  CAST(a as INT) * 5

Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the Project operator.

>  or a combination of 2 CAST functions 

If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then one simple rule is to push only the innermost cast downwards.

> or non-CAST functions etc.

Just keep it in Project.

 >    It would be quite expensive to examine each expression (there could be hundreds) to determine whether it is eligible to be pushed to the Scan.

Just push CAST(<column_ref> AS <type>). Even that would be a huge win. Note, for CSV, it might have to be CAST(columns[2] AS INT), since "columns" is special for CSV.

>   - Expressing Nullability is not possible with CAST.  If a column should be tagged as  (not)nullable, CAST syntax does not allow that.

Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL) ?

 >  - Drill currently supports CASTing to a SQL data type, but not to the complex types such as arrays and maps.  We would have to add support for that from a language perspective as well as the run-time.  This would be non-trivial effort.

The term "complex type" is always confusing. Consider a map. The rules would apply recursively to the members of the map. (Problem: today, if I reference a map member, Drill pulls it to the top level: SELECT m.a creates a new top-level field, it does not select "a" within "m". We need to fix that anyway.  So, CAST(m.a AS INT) should imply the type of column "a" within map "m".

For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS INT) to force array elements to INT. Maybe use CAST(a[][] AS INT) for a repeated list (2D array).

Unions don't need a solution as they are their own solution (they can hold multiple types.) Same for (non-repeated) lists. 

To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that "m" is a Map from the expression CAST(m.a AS INT). For arrays, the previously suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be a non-null scalar, then we have no good answer.

CAST cannot solve the nasty cases of JSON in which some fields are complex, some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value: "foo"}}. I suppose no solution is perfect...

I'm sure that, if someone gets a chance to desig this feature, they'll find lots more issues. Maybe cast push-down is only a partial solution. But, it seems to solve so many of the JSON and CSV cases that I've seen that it seems too good to pass up.

Thanks,


- Paul 
{noformat}


> Enable pushing of cast expressions to the scanner for better schema discovery.
> ------------------------------------------------------------------------------
>
>                 Key: DRILL-6312
>                 URL: https://issues.apache.org/jira/browse/DRILL-6312
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Query Planning &amp; Optimization
>    Affects Versions: 1.13.0
>            Reporter: Hanumath Rao Maduri
>            Priority: Major
>
> Drill is a schema less engine which tries to infer the schema from disparate sources at the read time. Currently the scanners infer the schema for each batch depending upon the data for that column in the corresponding batch. This solves many uses cases but can error out when the data is too different between batches like int and array[int] etc... (There are other cases as well but just to give one example).
> There is also a mechanism to create a view by type casting the columns to appropriate type. This solves issues in some cases but fails in many other cases. This is due to the fact that cast expression is not being pushed down to the scanner but staying at the project or filter etc operators up the query plan.
> This JIRA is to fix this by propagating the type information embedded in the cast function to the scanners so that scanners can cast the incoming data appropriately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)