You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2020/01/21 19:34:00 UTC

[jira] [Created] (DRILL-7545) Projection ambiguities in complex types

Paul Rogers created DRILL-7545:
----------------------------------

             Summary: Projection ambiguities in complex types
                 Key: DRILL-7545
                 URL: https://issues.apache.org/jira/browse/DRILL-7545
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.17.0
            Reporter: Paul Rogers


Summarized from an e-mail chain on the dev mailing list:

We recently introduced the DICT type. We also added the EVF framework. We have a bit of code which parses the projection list, then checks if a column from a reader is consistent with projection. The idea is to ensure that the columns produced by a Scan will be valid when a Project later tries to use them with the given project list. And, if the Scan says it can support Project-push-down, then the Scan is obligated to do the full check.

First we'll explain how I'll solve the projection problem given your explanation. Then we'll point out three potential ambiguities. Thanks to Bohdan for his explanations.

The problems here are not due to any one person. As explained below, they are due to trying to add concepts into SQL which SQL is not well-suited to support.

h4. Projection for DICT Types

Queries go through two major steps: planing and execution. At the planning stage we use SQL syntax for the project list. For example:

{code:sql}
explain plan for SELECT a, e.`map`.`member`, `dict`['key'], `array`[10]  FROM cp.`employee.json` e
{code}

The planner sends an execution plan to operators. The project list appears in JSON. For the above:

{code:json}
   "columns" : [ "`a`", "`map`.`member`", "`dict`.`key`", "`array`[10]" ],
{code}

We see that the JSON works as Bohdan described:

* The SQL map "map.member" syntax is converted to "`map`.`member`" in the JSON plan.
* The SQL DICT "`dict`['key']" syntax is converted to a form identical to maps: "`dict`.`key`".
* The SQL DICT/array "`array`[10]" syntax is converted to "`array`[10]" in JSON.

That is, on the execution side, we can't tell the difference between a MAP and a DICT request. We also can't tell the difference between an Array and DICT request. Apparently, because of this, the Schema Path parser does not recognize DICT syntax.

Given the way projection works, "a.b" and "a['b']" are identical: either works for both a map or a DICT with VARCHAR keys. That is, we just say that map and array projection are both compatible with a DICT column?

h4. Projection Checking in Scan

Mentioned above is that a Scan that supports Project-push-down must ensure that the output columns match the projection list. Doing that check is quite easy when the projection is simple: `a`. The column `a` can match a data column `a` of any type.

The task is bit harder when the projection is an array `a[0]`. Since this now means either array or DICT with an INT key, this projected column can match:

* Any REPEATED type
* A LIST
* A non-REPEATED DICT with INT, BIGINT, SMALLINT or TINYINT keys (ignoring the UINTx types)
* A REPEATED DICT with any type of key
* A UNION (because a union might contain a repeated type)

We can also handle a map projection: `a.b` which matches:

* A (possibly repeated) map
* A (possibly repeated) DICT with VARCHAR keys
* A UNION (because a union might contain a possibly-repeated map)
* A LIST (because the list can contain a union which might contain a possibly-repeated map)

Things get very complex indeed when we have multiple qualifiers such as `a[0][1].b` which matches:

* A LIST that contains a repeated map
* A REPEATED LIST that contains a (possibly-repeated) map
* A DICT with an INT key that has a value of a repeated map
* A REPEATED DICT that contains an INT key that contains a MAP
* (If we had sufficient metadata) A LIST that contains a REPEATED DICT with a VARCHAR key.

h4. DICT Projection Ambiguities

The DICT type introduces an ambiguity. Note above that `a.b` can refer to either a REPEATED or non-REPEATED MAP. If non-repeated, `a.b` means to get the one value for member `b` of map `a`. But, if the map is REPEATED, this means to project an array of `b` values obtained from the array of maps.

For a DICT, there is an ambiguity with `a[0][1]` if the DICT is repeated DICT of INT keys and REPEATED BIGINT values: that is ARRAY<DICT<INT, ARRAY<BIGINT>>>. Does `a[0][1]` mean to pull out the 0th element of the REPEATED DICT, then lookup where the key == 1? Or, does it mean to pull out all the DICT array values where the key == 0 and then pull out the 1st value of the INT array? That is, because we have an implied (in all members of the array) syntax, one can interpret this case as:

{noformat}
repeatedDict[0].valueOf(1) --> ARRAY<BIGINT>
-- All the values in the key=1 array of element 0
{noformat}

or

{noformat}
repeatedDict.valueOf(0)[1] --> ARRAY<BIGINT>
-- All the values in the key=0, element 1 positions across all DICT elements
{noformat}

It would seem to make sense to prefer the first interpretation. Unfortunately, MAPs already use the second interpretation.

h4. UNION and LIST Ambiguities

Then there are our loosey-goosey types like UNION and LIST which become hyper-difficult: it is impossible to validate a projection against a type that contains anything. We just have to wait until execution time to see what happens to appear inside the type. These are even highly ambiguous:

Is `a[0].b` in a UNION a reference to:

* The 0th element of a LIST which contains a DICT with VARCHAR keys?
* The 0th element of a LIST which contains a MAP that contains column b?
* The 0th element of a REPEATED MAP that contains column b?
* The 0th element of a REPEATED DICT which contains a UNION which contains a MAP?
* The slice though a REPEATED LIST where a REPEATED DICT has a VARCHAR key of 'b'.
* And so on...

That is, since we have no restrictions on what can go into a UNION or a LIST, it is possible for paths to be ambiguous if the UNION contains two subtypes that both support the requested path. At present our only recourse is to observe that no one is crazy enough to set up this scenario -- in part because UNION barely works. But, having a type system which is inherently ambiguous is not a Good Thing.

For now, I'm solving the UNION and LIST problem by doing nothing: these types don't fully work and no one uses them. So, I'll leave fixing this issue as an exercise for some future time.

h4. DICT Join Ambiguities

As I think about how DICT might work, I wonder about another case: where I want to do a lateral join of a DICT with its surrounding record. Given that key lookup looks the same as member lookup, how we differentiate the following two cases:

WHERE dict['key'] = 10 -- Filter

WHERE dict.key = parent.col -- Join between DICT keys and some other column

The answer is: we can't. We'd need different projection syntax to how that in the first case, 'key' is a key value, while in the second case, 'key' is the name of one of the two DICT fields.

And, if we change the syntax to differentiate map and DICT keys, then, I've got to rethink and recode the projection checking logic I just discussed.

h4. Solutions

In other engines, we have a known schema and we can just follow the projection path down through the types to see if the path is valid. Since the schema is known, the planner can do the projection planning. In Drill, things are backward; we generally don't know the shape of data until we've read it; then at some point, we have to match the projection against what we've read thus far. This leads to all kinds of special cases such as those we've been discussing.

Didn't we pick up DICT from the HIVE MAP type? How does HIVE handle this case (recognizing that HIVE has an up-front schema.) What path resolutions rules do you think we should use in Drill to avoid ambiguities?

This is, by the way, a reason that I think we've gone too far overboard in trying to fit complex types into SQL: the cases get ambiguous, SQL is not expressive enough to handle the complex references, and as a result our code gets exponentially complex and costly to maintain as we try to force-fit into SQL concepts that SQL is not designed to handle.

This stuff is *hard*; we're designing a complex data type and language system within a context that does not provide the required semantics.  We're leaving ambiguous (UNION, LIST) things which need to be defined for SQL to work. This is screaming at us: "we're doing it wrong!"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)