You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/12/30 03:40:00 UTC
[jira] [Commented] (DRILL-5958) Revisit the List and RepeatedList vectors

    [ https://issues.apache.org/jira/browse/DRILL-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16306653#comment-16306653 ] 

Paul Rogers commented on DRILL-5958:
------------------------------------

A bit more investigation uncovered the following:

* The {{ListVector}} acts like a single repeated list, but in which each array can be null. That is, it supports the following JSON:

{noformat}
{a: [10, 20] } {a: null }
{noformat}

* The {{ListVector}} was not fully implemented, does not work, and is not used anywhere in Drill.
* The {{RepeatedListVector}} is, essentially, another offset vector layer on top of any repeated type. It *does not* allow nulls.
* The type/mode for {{ListVector}} is {{LIST / OPTIONAL}}, while the type/mode for a repeated list is {{LIST / REPEATED}}. This makes little sense because both lists are repeated, and they have different semantics. (One allows nulls, the other does not.)
* The {{RepeatedListVector}} appears to be used only in JSON to encode a 2+D arrays. However, the rest of Drill (except a few array functions) does not handle this type, nor does JDBC or ODBC.

In short, lists are more aspirational than operational: they are very cool promises that the code does not fully implement. Nor is it clear how we'd implement them given that JDBC and ODBC (Drill's primary interfaces) don't support these types.

> Revisit the List and RepeatedList vectors
> -----------------------------------------
>
>                 Key: DRILL-5958
>                 URL: https://issues.apache.org/jira/browse/DRILL-5958
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>
> Drill provides a List vector used when reading JSON data. The semantics of this vector are somewhat obscure and overly complex. This ticket asks to clean up the design and implementation of this vector.
> h4. Current Behavior
> Drill contains two kinds of repeated types:
> * Repeated vectors, which exist for all Drill types.
> * List vectors, which are, essentially, a repeated union, with special-case logic for handling a single type.
> Lists are rather hard to explain.
> Drill has 38 types. Each type comes in three cardinalities: Required (0), Optional (0, 1) or Repeated (0..n). Thus, there is an {{IntVector}}, a {{NullableIntVector}} and a {{RepeatedIntVector}}.
> Drill also has the union type. A union can be any combination of other types. The union itself can be null, which means that union instance has neither a type nor a value.
> Lists are, essentially, a repeated union. However, lists have special logic to handle the case where the list has only one member type. In this case, the list effectively mimics the repeated vector for that type. However, unlike repeated vectors, a list vectors carries a "bits" vector, allowing a list entry to be null. The repeated vectors indicate nulls by having no entries in the array for a given row. But, lists allow both the empty array and null array conditions. This is done, presumably, to better mimic JSON.
> A list is a semi-union: it starts as a list of something. As work progresses, we may realize that we need to store some other thing. At that point, the contents of the list morph from the original type to the union type. The result of this decision is that, when inspecting the metadata for a list (in the form of a {{MaterializedField}}), there is no information about list type. Presumably the code must inspect the "data vector" associated with the list to learn the type of the items within the list. (If done at write time, then, of course, the type can change as described above.)
> For this reason, the List type is closely associated with the Union type: a list is, essentially, a "repeated Union", though it starts of as a "repeated something", then evolves to become a repeated union. (Drill has no {{RepeatedUnionVector}}, presumably because {{ListVector}} is the repeated form of the union type.)
> Strangely, Drill also has a {{RepeatedListVector}}, which introduces all manner of ambiguity. Combining these, the cardinality hierarchy for unions is:
> * {{UnionVector}} (like an optional union type)
> * {{ListVector}} (repeated semi-union)
> * {{RepeatedListVector}} (a 2D union array)
> * {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note that this could also be implemented as a {{ListVector}} that contains a {{RepeatedListVector}} or even a {{ListVector}} which contains a {{ListVector}} which contains a {{ListVector}} which implicitly contains a {{UnionVector}}.)
> * {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper grid.)
> * And so on.
> For a primitive type, such as Int, we have, for each value:
> * {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1))
> * {{RepeatedIntVector}} (a 1D list of Int)
> * {{ListVector}} which contains an {{IntVector}} (another 1D list, but individual arrays can be null.)
> * {{RepeatedListVector}} which contains an {{IntVector}} (a 2D array of ints. Presumably could also have been a {{ListVector}} of {{ListVectors}} of {{IntVectors}}.)
> * {{RepeatedListVector}} which contains a {{ListVector} of {{IntVector}} (a 3D cube of ints. This could also be formed by three levels of {{ListVector}}.)
> h4. Examples of Current Behavior
> Lists and repeated types appeared to evolve to support JSON-like structures. For example:
> {code}
> {a: 10} {a: null}
> {code}
> Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}.
> {code}
> {a: [10, 20]}
> {code}
> Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}. 
> Drill does not allow nulls in such vectors, so we cannot represent:
> {code}
> {a: [10, null, 20]}
> {code}
> Once we go beyond 1D, we need lists:
> {code}
> {a: [[10, 20], [30, 40]]}
> {code}
> The above requires a {{RepeatedListVector}} that contains an {{IntVector}}.
> {code}
> {a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]}
> {code}
> The above requires a {{RepeatedListVector}} that contains a {{ListVector}} of {{IntVector}}.
> Similarly, since lists can hold any type (just like a union), we can have repeated objects:
> {code}
> {a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]}
> {code}
> The above would be represented as a {{RepeatedListVector}} that contains a {{MapVector}}.
> Because the List vector is a union type, it can also handle heterogeneous lists:
> {code}
> {a: [10, "fred", 123.45, null]}
> {code}
> Since unions support combinations of not just scalars, but also scalars and complex types, Drill can also support:
> {code}
> {a: [10, {b: "foo"}, null, [10, "bob"]]}
> {code}
> h4. Muddy Semantics
> The above show a number of problems that make lists (and unions) far more complex than necessary:
> * Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a {{RepeatedFooVector}}.
> * Ambiguity of when to use a {{RepeatedListVector}} of {{FooVector}} vs. a {{ListVector}} of {{ListVector}} of {{FooVector}}.
> The same solution used to handle extra layers of repetition is used to handle variant types (DRILL-5955):
> * Lists can handle any combination of scalars.
> * Lists can handle any structure type (map, repeated map, list, repeated list).
> * Lists are thus not typed. They are not a "List of Int", they are just a List.
> h4. Mapping to SQL
> The above muddy semantics give rise to this question. Drill is a SQL engine, how do we map the List semantics to a relational schema? If we don't have a clean answer, then the List type, while clever, does not have a useful purpose and is instead distracting us from the larger question of how we map JSON-like structures to a relational schema.
> This is not a new problem; it has been with the community at least since the days of XML. There simply XML (and JSON) structures that do not map to a relational structure, though all relational structures can be mapped to JSON.
> Drill's job, then, is to encode any JSON structure which has a relational interpretation. IMHO, it is not Drill's job to encode non-relational structures. (In the same way, in years past, it was not the job of BI tools to encode an XML-encoded HTML document to produce charts or reports.)
> h4. Allowable JSON Structure
> Let us take a stab at the allowable JSON structures:
> * Objects (at any level) which map to either the top-level row, or nested tuples (called "maps" in Drill.)
> * Arrays of scalars which map to scalar arrays in Drill (which must be flattened into the global name space, using dotted names, to map to a JDBC or ODBC client.)
> * Arrays of objects, which map to tuple arrays in Drill (which must be flattened or explicitly joined to map to a JDBC or ODBC client.)
> * Fields with heterogeneous types, which map to union/variant types in Drill (and which must be reconciled to present as SQL to an ODBC/JDBD client.)
> * Array with heterogeneous types, which is one convention for efficiently storing tuples. These map to an array of union/variant types, and must be mapped in Drill to a true tuple, then flattened, to present to a JDBC or ODBC client.
> This definition starts with the relational model, then determines the many ways the relational model can be encoded in JSON. This analysis suggests that we need not handle extreme cases:
> * Multi-dimensional arrays (which can't be mapped to the relational model easily, though they could be mapped to an MDX-style multi-dimensional model.)
> * Heterogeneous collections of simple and structured types except in the tuple-as-array case.
> For the heterogeneous cases, Drill would need some up-front information to know that, say, an array represents a tuple and to expect the heterogeneous encoding. In this case, each array slot would have a single type; it would not make sense to have a tuple-as-array in which each slot is a variant. (There is no sane way to map such a structure back to a fixed-type relational model.)
> h4. Space Considerations
> The union vector is a simple collection of vectors of various types. Every row consumes space in very vector. If the union has n types, then (n-1) of the vectors have wasted space for every row. Since list vectors are, in many cases, repeated union vectors, the space wasted is much greater: every entry in every array consumes space in all member vectors. For example, if row 5 has an array of 100 ints, while row 6 has an array of 50 strings, then both vectors will consume 150 elements across the two rows.
> In short, the list concept is handy, and solves the semantic problem, but it does so at a large cost in wasted space.
> h4. Proposal: Generic Array Vector
> Given the above, we can see that the current List and Union types are far too general for use in a relational system. We can suggest simplifications.
> * The union/variant type handles mismatched data types. (But, as explained in DRILL-5955, a better solution is to dispense with the union/vector type and instead allow the user to specify a type to which the input values are mapped at read time.)
> * The map and repeated map vectors already handle nested tuples and nested tables (arrays of tuples).
> Given this, there is really no need for the union or list types. But, suppose we wanted to keep them anyway (because, say, we are not convinced by the above arguments, Drill already has them, and the community does not want to take the time to consider whether they should be removed.)
> One solution is to rationalize the type system:
> * Define a base (required) vector for each primitive type and for the map (tuple) type.
> * Define a single nullable vector type which holds one of the base vectors.
> * Define a repeated vector type that holds one of the base vectors.
> * If necessary, define a variant which can hold any primitive value and itself is considered a nullable vector.
> In this model, there is exactly one mapping from any array structure to a vector structure.
> || JSON || Vector Structure ||
> | &#123;a: 10&#125; | IntVector |
> | &#123;a: 10&#125;, &#123;a: null&#125; | NullableVector(IntVector) |
> | &#123;a: 10&#125;, &#123;a: "foo"&#125; | VariantVector |
> | &#123;a: \[10, 20]&#125; | ArrayVector(IntVector) |
> | &#123;a: \[\[10, 20], \[30, 40]]&#125; | ArrayVector(ArrayVector(IntVector)) |
> | &#123;a: &#123;b: 10&#125;&#125; | TupleVector |
> | &#123;a: \[&#123;b: 10&#125;, &#123;b: 20&#125;]&#125; | ArrayVector(TupleVector) |
> | &#123;a: \[10, "foo"]&#125; | ArrayVector(VariantVector) |
> | &#123;a: \[10, "foo"]&#125; | TupleVector (if hints provided) |
> h4. Arrow
> [Arrow|https://arrow.apache.org/docs/memory_layout.html] has adopted a solution similar to the proposed {{ArrayVector}}. In Arrow, a {{List}} vector is a list of some type: {{List<Int>}}, say.
> h4. Backward Compatibility
> Because Drill exposes internal vector representation through the network APIs, the above is a breaking change for the network API. A solution such as DRILL-5947 is required before we can implement a solution such as the above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)