You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/11/13 01:54:00 UTC

[jira] [Updated] (DRILL-5958) Revisit the List and RepeatedList vectors

     [ https://issues.apache.org/jira/browse/DRILL-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated DRILL-5958:
-------------------------------
    Description: 
Drill provides a List vector used when reading JSON data. The semantics of this vector are somewhat obscure and overly complex. This ticket asks to clean up the design and implementation of this vector.

h4. Current Behavior

Drill contains two kinds of repeated types:

* Repeated vectors, which exist for all Drill types.
* List vectors, which exist outside the repeated vector system.

Lists are rather hard to explain.

Drill has 38 types. Each type comes in three cardinalities: Required (0), Optional (0, 1) or Repeated (0..n). Thus, there is an {{IntVector}}, a {{NullableIntVector}} and a {{RepeatedIntVector}}.

Lists are an an odd duck and exist outside of this system. A list is not simply another level of repetition (a {{RepeatedRepeatedIntVector}}. Rather, a list is heterogeneous: it is just a list of something.

A list is a semi-union: it starts as a list of something. As work progresses, we may realize that we need to store some other thing. At that point, the contents of the list morph from the original type to the union type. The result of this decision is that, when inspecting the metadata for a list (in the form of a {{MaterializedField}}), there is no information about list type. Presumably the code must inspect the "data vector" associated with the list to learn the type of the items within the list. (If done at write time, then, of course, the type can change as described above.)

For this reason, the List type is closely associated with the Union type: a list is, essentially, a "repeated Union", though it starts of as a "repeated something", then evolves to become a repeated union. (Drill has no {{RepeatedUnionVector}}, presumably because {{ListVector}} is the repeated form of the union type.)

Strangely, Drill also has a {{RepeatedListVector}}, which introduces all manner of ambiguity. Combining these, the cardinality hierarchy for unions is:

* {{UnionVector}} (like an optional union type)
* {{ListVector}} (repeated semi-union)
* {{RepeatedListVector}} (a 2D union array)
* {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note that this could also be implemented as a {{ListVector}} that contains a {{RepeatedListVector}}.)
* {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper grid.)
* And so on.

For a primitive type, such as Int, we have:

* {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1))
* {{RepeatedIntVector}} (a 1D list of Int)
* {{ListVector}} which contains a {{RepeatedIntVector}} (a 2D array of ints. Not that this could have been a {{RepeatedListVector}} that stores only ints.)
* {{RepeatedListVector}} which contains a {{RepeatedIntVector}} (a 3D cube of ints. This could also be formed by a {{ListVector}} that contains a {{ListVector}} that contains a {{RepeatedIntVector}} along with several other combinations.)

h4. Examples of Current Behavior

Lists and repeated types appeared to evolve to support JSON-like structures. For example:

{code}
{a: 10} {a: null}
{code}

Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}.

{code}
{a: [10, 20]}
{code}

Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}. 

Drill does not allow nulls in such vectors, so we cannot represent:

{code}
{a: [10, null, 20]}
{code}

Once we go beyond 1D, we need lists:

{code}
{a: [[10, 20], [30, 40]]}
{code}

The above requires a {{ListVector}} that contains a {{RepeatedIntVector}}.

{code}
{a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]}
{code}

The above requires a {{RepeatedListVector}} that contains a {{RepeatedIntVector}}.

Similarly, since lists can hold any type (just like a union), we can have repeated objects:

{code}
{a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]}
{code}

The above would be represented as a {{ListVector}} that contains a {{RepeatedMapVector}}. (Or, equivalently, a {{RepeatedListVector}} that contains a {{MapVector}}.)

Because the List vector is a union type, it can (presumably) also handle heterogeneous lists (though this needs to be checked to see if the code actually supports this case):

{code}
{a: [10, "fred", 123.45, null]}
{code}

Since unions support combinations of not just scalars, but also scalars and complex types, Drill can also support:

{code}
{a: [10, {b: "foo"}, null, [10, "bob"]]}
{code}

h4. Muddy Semantics

The above show a number of problems that make lists (and unions) far more complex than necessary:

* Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a {{RepeatedFooVector}}.
* Ambiguity of when to use a {{ListVector}} of {{RepeatedFooVector}} vs. a {{RepeatedListVector}} of {{FooVector}}.

The same solution used to handle extra layers of repetition is used to handle variant types (DRILL-5955):

* Lists can handle any combination of scalars.
* Lists can handle any structure type (map, repeated map, list, repeated list).
* Lists are thus not typed. They are not a "List of Int", they are just a List.

h4. Mapping to SQL

The above muddy semantics give rise to this question. Drill is a SQL engine, how do we map the List semantics to a relational schema? If we don't have a clean answer, then the List type, while clever, does not have a useful purpose and is instead distracting us from the larger question of how we map JSON-like structures to a relational schema.

This is not a new problem; it has been with the community at least since the days of XML. There simply XML (and JSON) structures that do not map to a relational structure, though all relational structures can be mapped to JSON.

Drill's job, then, is to encode any JSON structure which has a relational interpretation. IMHO, it is not Drill's job to encode non-relational structures. (In the same way, in years past, it was not the job of BI tools to encode an XML-encoded HTML document to produce charts or reports.)

h4. Allowable JSON Structure

Let us take a stab at the allowable JSON structures:

* Objects (at any level) which map to either the top-level row, or nested tuples (called "maps" in Drill.)
* Arrays of scalars which map to scalar arrays in Drill (which must be flattened into the global name space, using dotted names, to map to a JDBC or ODBC client.)
* Arrays of objects, which map to tuple arrays in Drill (which must be flattened or explicitly joined to map to a JDBC or ODBC client.)
* Fields with heterogeneous types, which map to union/variant types in Drill (and which must be reconciled to present as SQL to an ODBC/JDBD client.)
* Array with heterogeneous types, which is one convention for efficiently storing tuples. These map to an array of union/variant types, and must be mapped in Drill to a true tuple, then flattened, to present to a JDBC or ODBC client.

This definition starts with the relational model, then determines the many ways the relational model can be encoded in JSON. This analysis suggests that we need not handle extreme cases:

* Multi-dimensional arrays (which can't be mapped to the relational model easily, though they could be mapped to an MDX-style multi-dimensional model.)
* Heterogeneous collections of simple and structured types except in the tuple-as-array case.

For the heterogeneous cases, Drill would need some up-front information to know that, say, an array represents a tuple and to expect the heterogeneous encoding. In this case, each array slot would have a single type; it would not make sense to have a tuple-as-array in which each slot is a variant. (There is no sane way to map such a structure back to a fixed-type relational model.)

h4. Proposal: Generic Array Vector

Given the above, we can see that the current List and Union types are far too general for use in a relational system. We can suggest simplifications.

* The union/variant type handles mismatched data types. (But, as explained in DRILL-5955, a better solution is to dispense with the union/vector type and instead allow the user to specify a type to which the input values are mapped at read time.)
* The map and repeated map vectors already handle nested tuples and nested tables (arrays of tuples).

Given this, there is really no need for the union or list types. But, suppose we wanted to keep them anyway (because, say, we are not convinced by the above arguments, Drill already has them, and the community does not want to take the time to consider whether they should be removed.)

One solution is to rationalize the type system:

* Define a base (required) vector for each primitive type and for the map (tuple) type.
* Define a single nullable vector type which holds one of the base vectors.
* Define a repeated vector type that holds one of the base vectors.
* If necessary, define a variant which can hold any primitive value and itself is considered a nullable vector.

In this model, there is exactly one mapping from any array structure to a vector structure.

|| JSON || Vector Structure ||
| &#123;a: 10&#125; | IntVector |
| &#123;a: 10&#125;, &#123;a: null&#125; | NullableVector(IntVector) |
| &#123;a: 10&#125;, &#123;a: "foo"&#125; | VariantVector |
| &#123;a: \[10, 20]&#125; | ArrayVector(IntVector) |
| &#123;a: \[\[10, 20], \[30, 40]]&#125; | ArrayVector(ArrayVector(IntVector)) |
| &#123;a: &#123;b: 10&#125;&#125; | TupleVector |
| &#123;a: \[&#123;b: 10&#125;, &#123;b: 20&#125;]&#125; | ArrayVector(TupleVector) |
| &#123;a: \[10, "foo"]&#125; | ArrayVector(VariantVector) |
| &#123;a: \[10, "foo"]&#125; | TupleVector (if hints provided) |

h4. Arrow

[Arrow|https://arrow.apache.org/docs/memory_layout.html] has adopted a solution similar to the proposed {{ArrayVector}}. In Arrow, a {{List}} vector is a list of some type: {{List<Int>}}, say.

h4. Backward Compatibility

Because Drill exposes internal vector representation through the network APIs, the above is a breaking change for the network API. A solution such as DRILL-5947 is required before we can implement a solution such as the above.

  was:
Drill provides a List vector used when reading JSON data. The semantics of this vector are somewhat obscure and overly complex. This ticket asks to clean up the design and implementation of this vector.

h4. Current Behavior

Drill contains two kinds of repeated types:

* Repeated vectors, which exist for all Drill types.
* List vectors, which exist outside the repeated vector system.

Lists are rather hard to explain.

Drill has 38 types. Each type comes in three cardinalities: Required (0), Optional (0, 1) or Repeated (0..n). Thus, there is an {{IntVector}}, a {{NullableIntVector}} and a {{RepeatedIntVector}}.

Lists are an an odd duck and exist outside of this system. A list is not simply another level of repetition (a {{RepeatedRepeatedIntVector}}. Rather, a list is heterogeneous: it is just a list of something.

For this reason, the List type is closely associated with the Union type: a list is, essentially, a "repeated Union", though it is not implemented that way.

Strangely, Drill does have a {{RepeatedListVector}}, which introduces all manner of ambiguity. Combining these, the cardinality hierarchy for unions is:

* {{UnionVector}} (like an optional union type)
* {{ListVector}} (repeated union)
* {{RepeatedListVector}} (a 2D union array)
* {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note that this could also be implemented as a {{ListVector}} that contains a {{RepeatedListVector}}.)
* {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper grid.)
* And so on.

For a primitive type, such as Int, we have:

* {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1))
* {{RepeatedIntVector}} (a 1D list of Int)
* {{ListVector}} which contains a {{RepeatedIntVector}} (a 2D array of ints. Not that this could have been a {{RepeatedListVector}} that stores only ints.)
* {{RepeatedListVector}} which contains a {{RepeatedIntVector}} (a 3D cube of ints. This could also be formed by a {{ListVector}} that contains a {{ListVector}} that contains a {{RepeatedIntVector}} along with several other combinations.)

h4. Examples of Current Behavior

Lists and repeated types appeared to evolve to support JSON-like structures. For example:

{code}
{a: 10} {a: null}
{code}

Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}.

{code}
{a: [10, 20]}
{code}

Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}. 

Drill does not allow nulls in such vectors, so we cannot represent:

{code}
{a: [10, null, 20]}
{code}

Once we go beyond 1D, we need lists:

{code}
{a: [[10, 20], [30, 40]]}
{code}

The above requires a {{ListVector}} that contains a {{RepeatedIntVector}}.

{code}
{a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]}
{code}

The above requires a {{RepeatedListVector}} that contains a {{RepeatedIntVector}}.

Similarly, since lists can hold any type (just like a union), we can have repeated objects:

{code}
{a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]}
{code}

The above would be represented as a {{ListVector}} that contains a {{RepeatedMapVector}}. (Or, equivalently, a {{RepeatedListVector}} that contains a {{MapVector}}.)

Because the List vector is a union type, it can (presumably) also handle heterogeneous lists (though this needs to be checked to see if the code actually supports this case):

{code}
{a: [10, "fred", 123.45, null]}
{code}

Since unions support combinations of not just scalars, but also scalars and complex types, Drill can also support:

{code}
{a: [10, {b: "foo"}, null, [10, "bob"]]}
{code}

h4. Muddy Semantics

The above show a number of problems that make lists (and unions) far more complex than necessary:

* Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a {{RepeatedFooVector}}.
* Ambiguity of when to use a {{ListVector}} of {{RepeatedFooVector}} vs. a {{RepeatedListVector}} of {{FooVector}}.

The same solution used to handle extra layers of repetition is used to handle variant types (DRILL-5955):

* Lists can handle any combination of scalars.
* Lists can handle any structure type (map, repeated map, list, repeated list).
* Lists are thus not typed. They are not a "List of Int", they are just a List.

h4. Mapping to SQL

The above muddy semantics give rise to this question. Drill is a SQL engine, how do we map the List semantics to a relational schema? If we don't have a clean answer, then the List type, while clever, does not have a useful purpose and is instead distracting us from the larger question of how we map JSON-like structures to a relational schema.

This is not a new problem; it has been with the community at least since the days of XML. There simply XML (and JSON) structures that do not map to a relational structure, though all relational structures can be mapped to JSON.

Drill's job, then, is to encode any JSON structure which has a relational interpretation. IMHO, it is not Drill's job to encode non-relational structures. (In the same way, in years past, it was not the job of BI tools to encode an XML-encoded HTML document to produce charts or reports.)

h4. Allowable JSON Structure

Let us take a stab at the allowable JSON structures:

* Objects (at any level) which map to either the top-level row, or nested tuples (called "maps" in Drill.)
* Arrays of scalars which map to scalar arrays in Drill (which must be flattened into the global name space, using dotted names, to map to a JDBC or ODBC client.)
* Arrays of objects, which map to tuple arrays in Drill (which must be flattened or explicitly joined to map to a JDBC or ODBC client.)
* Fields with heterogeneous types, which map to union/variant types in Drill (and which must be reconciled to present as SQL to an ODBC/JDBD client.)
* Array with heterogeneous types, which is one convention for efficiently storing tuples. These map to an array of union/variant types, and must be mapped in Drill to a true tuple, then flattened, to present to a JDBC or ODBC client.

This definition starts with the relational model, then determines the many ways the relational model can be encoded in JSON. This analysis suggests that we need not handle extreme cases:

* Multi-dimensional arrays (which can't be mapped to the relational model easily, though they could be mapped to an MDX-style multi-dimensional model.)
* Heterogeneous collections of simple and structured types except in the tuple-as-array case.

For the heterogeneous cases, Drill would need some up-front information to know that, say, an array represents a tuple and to expect the heterogeneous encoding. In this case, each array slot would have a single type; it would not make sense to have a tuple-as-array in which each slot is a variant. (There is no sane way to map such a structure back to a fixed-type relational model.)

h4. Proposal: Generic Array Vector

Given the above, we can see that the current List and Union types are far too general for use in a relational system. We can suggest simplifications.

* The union/variant type handles mismatched data types. (But, as explained in DRILL-5955, a better solution is to dispense with the union/vector type and instead allow the user to specify a type to which the input values are mapped at read time.)
* The map and repeated map vectors already handle nested tuples and nested tables (arrays of tuples).

Given this, there is really no need for the union or list types. But, suppose we wanted to keep them anyway (because, say, we are not convinced by the above arguments, Drill already has them, and the community does not want to take the time to consider whether they should be removed.)

One solution is to rationalize the type system:

* Define a base (required) vector for each primitive type and for the map (tuple) type.
* Define a single nullable vector type which holds one of the base vectors.
* Define a repeated vector type that holds one of the base vectors.
* If necessary, define a variant which can hold any primitive value and itself is considered a nullable vector.

In this model, there is exactly one mapping from any array structure to a vector structure.

|| JSON || Vector Structure ||
| &#123;a: 10&#125; | IntVector |
| &#123;a: 10&#125;, &#123;a: null&#125; | NullableVector(IntVector) |
| &#123;a: 10&#125;, &#123;a: "foo"&#125; | VariantVector |
| &#123;a: \[10, 20]&#125; | ArrayVector(IntVector) |
| &#123;a: \[\[10, 20], \[30, 40]]&#125; | ArrayVector(ArrayVector(IntVector)) |
| &#123;a: &#123;b: 10&#125;&#125; | TupleVector |
| &#123;a: \[&#123;b: 10&#125;, &#123;b: 20&#125;]&#125; | ArrayVector(TupleVector) |
| &#123;a: \[10, "foo"]&#125; | ArrayVector(VariantVector) |
| &#123;a: \[10, "foo"]&#125; | TupleVector (if hints provided) |

h4. Arrow

[Arrow|https://arrow.apache.org/docs/memory_layout.html] has adopted a solution similar to the proposed {{ArrayVector}}. In Arrow, a {{List}} vector is a list of some type: {{List<Int>}}, say.

h4. Backward Compatibility

Because Drill exposes internal vector representation through the network APIs, the above is a breaking change for the network API. A solution such as DRILL-5947 is required before we can implement a solution such as the above.


> Revisit the List and RepeatedList vectors
> -----------------------------------------
>
>                 Key: DRILL-5958
>                 URL: https://issues.apache.org/jira/browse/DRILL-5958
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.11.0
>            Reporter: Paul Rogers
>
> Drill provides a List vector used when reading JSON data. The semantics of this vector are somewhat obscure and overly complex. This ticket asks to clean up the design and implementation of this vector.
> h4. Current Behavior
> Drill contains two kinds of repeated types:
> * Repeated vectors, which exist for all Drill types.
> * List vectors, which exist outside the repeated vector system.
> Lists are rather hard to explain.
> Drill has 38 types. Each type comes in three cardinalities: Required (0), Optional (0, 1) or Repeated (0..n). Thus, there is an {{IntVector}}, a {{NullableIntVector}} and a {{RepeatedIntVector}}.
> Lists are an an odd duck and exist outside of this system. A list is not simply another level of repetition (a {{RepeatedRepeatedIntVector}}. Rather, a list is heterogeneous: it is just a list of something.
> A list is a semi-union: it starts as a list of something. As work progresses, we may realize that we need to store some other thing. At that point, the contents of the list morph from the original type to the union type. The result of this decision is that, when inspecting the metadata for a list (in the form of a {{MaterializedField}}), there is no information about list type. Presumably the code must inspect the "data vector" associated with the list to learn the type of the items within the list. (If done at write time, then, of course, the type can change as described above.)
> For this reason, the List type is closely associated with the Union type: a list is, essentially, a "repeated Union", though it starts of as a "repeated something", then evolves to become a repeated union. (Drill has no {{RepeatedUnionVector}}, presumably because {{ListVector}} is the repeated form of the union type.)
> Strangely, Drill also has a {{RepeatedListVector}}, which introduces all manner of ambiguity. Combining these, the cardinality hierarchy for unions is:
> * {{UnionVector}} (like an optional union type)
> * {{ListVector}} (repeated semi-union)
> * {{RepeatedListVector}} (a 2D union array)
> * {{RepeatedListVector}} which contains a {{ListVector}} (a 3D union grid. Note that this could also be implemented as a {{ListVector}} that contains a {{RepeatedListVector}}.)
> * {{RepeatedListVector}} which contains a {{RepeatedListVector}} (a 4D hyper grid.)
> * And so on.
> For a primitive type, such as Int, we have:
> * {{IntVector}} or {{NullableIntVector}} (cardinality of 1 or (0,1))
> * {{RepeatedIntVector}} (a 1D list of Int)
> * {{ListVector}} which contains a {{RepeatedIntVector}} (a 2D array of ints. Not that this could have been a {{RepeatedListVector}} that stores only ints.)
> * {{RepeatedListVector}} which contains a {{RepeatedIntVector}} (a 3D cube of ints. This could also be formed by a {{ListVector}} that contains a {{ListVector}} that contains a {{RepeatedIntVector}} along with several other combinations.)
> h4. Examples of Current Behavior
> Lists and repeated types appeared to evolve to support JSON-like structures. For example:
> {code}
> {a: 10} {a: null}
> {code}
> Here, `a` is a nullable scalar and is represented as a {{NullableIntVector}}.
> {code}
> {a: [10, 20]}
> {code}
> Here, `a` is a list of Int and is represented as a {{RepeatedIntVector}}. 
> Drill does not allow nulls in such vectors, so we cannot represent:
> {code}
> {a: [10, null, 20]}
> {code}
> Once we go beyond 1D, we need lists:
> {code}
> {a: [[10, 20], [30, 40]]}
> {code}
> The above requires a {{ListVector}} that contains a {{RepeatedIntVector}}.
> {code}
> {a: [[[110, 120], [130, 140]], [210, 220], [230, 240]]}
> {code}
> The above requires a {{RepeatedListVector}} that contains a {{RepeatedIntVector}}.
> Similarly, since lists can hold any type (just like a union), we can have repeated objects:
> {code}
> {a: [[{x: 0, y: 0}, {x: 1, y: 0}], [{x: 4, y: 0}, {x: 4, y: 1}]]}
> {code}
> The above would be represented as a {{ListVector}} that contains a {{RepeatedMapVector}}. (Or, equivalently, a {{RepeatedListVector}} that contains a {{MapVector}}.)
> Because the List vector is a union type, it can (presumably) also handle heterogeneous lists (though this needs to be checked to see if the code actually supports this case):
> {code}
> {a: [10, "fred", 123.45, null]}
> {code}
> Since unions support combinations of not just scalars, but also scalars and complex types, Drill can also support:
> {code}
> {a: [10, {b: "foo"}, null, [10, "bob"]]}
> {code}
> h4. Muddy Semantics
> The above show a number of problems that make lists (and unions) far more complex than necessary:
> * Ambiguity of when to use a {{ListVector}} of {{FooVector}} vs. a {{RepeatedFooVector}}.
> * Ambiguity of when to use a {{ListVector}} of {{RepeatedFooVector}} vs. a {{RepeatedListVector}} of {{FooVector}}.
> The same solution used to handle extra layers of repetition is used to handle variant types (DRILL-5955):
> * Lists can handle any combination of scalars.
> * Lists can handle any structure type (map, repeated map, list, repeated list).
> * Lists are thus not typed. They are not a "List of Int", they are just a List.
> h4. Mapping to SQL
> The above muddy semantics give rise to this question. Drill is a SQL engine, how do we map the List semantics to a relational schema? If we don't have a clean answer, then the List type, while clever, does not have a useful purpose and is instead distracting us from the larger question of how we map JSON-like structures to a relational schema.
> This is not a new problem; it has been with the community at least since the days of XML. There simply XML (and JSON) structures that do not map to a relational structure, though all relational structures can be mapped to JSON.
> Drill's job, then, is to encode any JSON structure which has a relational interpretation. IMHO, it is not Drill's job to encode non-relational structures. (In the same way, in years past, it was not the job of BI tools to encode an XML-encoded HTML document to produce charts or reports.)
> h4. Allowable JSON Structure
> Let us take a stab at the allowable JSON structures:
> * Objects (at any level) which map to either the top-level row, or nested tuples (called "maps" in Drill.)
> * Arrays of scalars which map to scalar arrays in Drill (which must be flattened into the global name space, using dotted names, to map to a JDBC or ODBC client.)
> * Arrays of objects, which map to tuple arrays in Drill (which must be flattened or explicitly joined to map to a JDBC or ODBC client.)
> * Fields with heterogeneous types, which map to union/variant types in Drill (and which must be reconciled to present as SQL to an ODBC/JDBD client.)
> * Array with heterogeneous types, which is one convention for efficiently storing tuples. These map to an array of union/variant types, and must be mapped in Drill to a true tuple, then flattened, to present to a JDBC or ODBC client.
> This definition starts with the relational model, then determines the many ways the relational model can be encoded in JSON. This analysis suggests that we need not handle extreme cases:
> * Multi-dimensional arrays (which can't be mapped to the relational model easily, though they could be mapped to an MDX-style multi-dimensional model.)
> * Heterogeneous collections of simple and structured types except in the tuple-as-array case.
> For the heterogeneous cases, Drill would need some up-front information to know that, say, an array represents a tuple and to expect the heterogeneous encoding. In this case, each array slot would have a single type; it would not make sense to have a tuple-as-array in which each slot is a variant. (There is no sane way to map such a structure back to a fixed-type relational model.)
> h4. Proposal: Generic Array Vector
> Given the above, we can see that the current List and Union types are far too general for use in a relational system. We can suggest simplifications.
> * The union/variant type handles mismatched data types. (But, as explained in DRILL-5955, a better solution is to dispense with the union/vector type and instead allow the user to specify a type to which the input values are mapped at read time.)
> * The map and repeated map vectors already handle nested tuples and nested tables (arrays of tuples).
> Given this, there is really no need for the union or list types. But, suppose we wanted to keep them anyway (because, say, we are not convinced by the above arguments, Drill already has them, and the community does not want to take the time to consider whether they should be removed.)
> One solution is to rationalize the type system:
> * Define a base (required) vector for each primitive type and for the map (tuple) type.
> * Define a single nullable vector type which holds one of the base vectors.
> * Define a repeated vector type that holds one of the base vectors.
> * If necessary, define a variant which can hold any primitive value and itself is considered a nullable vector.
> In this model, there is exactly one mapping from any array structure to a vector structure.
> || JSON || Vector Structure ||
> | &#123;a: 10&#125; | IntVector |
> | &#123;a: 10&#125;, &#123;a: null&#125; | NullableVector(IntVector) |
> | &#123;a: 10&#125;, &#123;a: "foo"&#125; | VariantVector |
> | &#123;a: \[10, 20]&#125; | ArrayVector(IntVector) |
> | &#123;a: \[\[10, 20], \[30, 40]]&#125; | ArrayVector(ArrayVector(IntVector)) |
> | &#123;a: &#123;b: 10&#125;&#125; | TupleVector |
> | &#123;a: \[&#123;b: 10&#125;, &#123;b: 20&#125;]&#125; | ArrayVector(TupleVector) |
> | &#123;a: \[10, "foo"]&#125; | ArrayVector(VariantVector) |
> | &#123;a: \[10, "foo"]&#125; | TupleVector (if hints provided) |
> h4. Arrow
> [Arrow|https://arrow.apache.org/docs/memory_layout.html] has adopted a solution similar to the proposed {{ArrayVector}}. In Arrow, a {{List}} vector is a list of some type: {{List<Int>}}, say.
> h4. Backward Compatibility
> Because Drill exposes internal vector representation through the network APIs, the above is a breaking change for the network API. A solution such as DRILL-5947 is required before we can implement a solution such as the above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)