You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by GitBox <gi...@apache.org> on 2020/03/10 00:07:38 UTC

[GitHub] [drill] paul-rogers opened a new pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

paul-rogers opened a new pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018
 
 
   # [DRILL-7633](https://issues.apache.org/jira/browse/DRILL-7633): Fixes for union and repeated list accessors
   
   ## Description
   
   Minor fixes and cleanup for the obscure union and repeated list types in the column accesor framework.
   
   Fixed several minor issues in the row set test utilities: improved error message, proper comparing of floating-point values. Support for `Boolean` values in the `RowSetBuilder`.
   
   Fixes for building a repeated list or union schema using `SchemaBuilder`.
   
   Special handling to treat a `LIST` as an array, even though the mode is not `REPEATED`. (This is a very hacky corner of Drill!)
   
   ## Documentation
   
   None
   
   ## Testing
   
   Added unit tests for union and repeated list types. Code comes from the project to improve the JSON reader which is where the issues were discovered and fixes tested in the context of reading JSON.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#issuecomment-603075717
 
 
   Squashed commits and rebased on the latest master.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r390199915
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##########
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+    return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-    return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+    return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   Could you please also implement `public ColumnMetadata copy() {` below since yo are touching this class?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r396947139
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##########
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+    return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-    return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+    return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   @arina-ielchiieva, thanks for the reminder; I did miss your response.
   
   I think I see where we had a misunderstanding. I thought the `typeString()` code already worked because I saw this in `AbstractColumnMetadata`:
   
   ```
     public String typeString() {
       return majorType().toString();
     }
   ```
   
    `PrimitiveColumnMetadata` uses a different set of type names.
   
   It seems this area has gotten a bit muddled: the  `AbstractColumnMetadata` version seems to never be called, except for `VariantColumnMetadata`. (This confused the heck out of me on an earlier attempt to clean up this area, but I didn't clue into the problem then.)
   
   So, now that I understand what you are asking for, I did this:
   
   * Removed the `AbstractColumnMetadata` version.
   * Added a  `VariantColumnMetadata` that produces either `UNION` or `ARRAY<UNION>`.
   
   Until I see a good reason otherwise, I think we should treat `UNION` (what I call a "variant") as an opaque type. I guess I don't see the contents of a variant as something we want to specify, either in the metastore or in a provided schema. For example, can the metastore compute an NDV across an `INT`, `VARCHAR` and `MAP`? Probably not. Nor do the min/max values or other stats make sense for a variant. As a result, a variant is an opaque type for the metastore.
   
   Similarly, in a provided schema, I can't convince myself that the user wants not only to say, "the type of this field can vary", but also that "the type can be an `INT`, `DOUBLE` or `VARCHAR`, but not a `BIGINT`.) Just does not seem super-useful.
   
   An additional concern, which I think I mentioned somewhere else, is that as soon as we start serializing complex types, the SQL-like text format becomes far too cumbersome. We'd be much better of with a JSON format that can capture the complex tree structure. Compare:
   
   ```
   (A UNION<INT, MAP(B UNION<VARCHAR, MAP(C BIGINT)>)>)
   ```
   With something like:
   ```
   { schema: [
     {name: "a",
     type: {
        name: "UNION",
        nullable: true,
        subtypes: [
         {name: INT, nullable: false},
         {name: MAP, nullable: false},
         members: [ ...
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r396333194
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##########
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+    return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-    return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+    return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   @paul-rogers did you have a chance to address code review comment?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r390202680
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##########
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+    return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-    return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+    return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   Also this class should override `public String typeString()` which is used during ser / de.
   For example, `UNION<INT, VARCHAR, BOOLEAN>`. Not sure how to describe LIST though...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r391431754
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##########
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+    return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-    return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+    return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   Nicely done! We discussed the problem with `UNION` a while back and realized it was a mess. Now I gotta deal with it...
   
   The `typeString()` version is used, I presume, in a provided schema. Correct?
   
   If so, then then the base class implementation is fine, the output will be just `UNION`. If we think of Drill's `UNION` as like a Visual Basic or PostgreSQL variant (or a Python variable), then the `UNION` type is sufficient: a `UNION` can hold anything; no need to restrict what it might or might not hold.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#issuecomment-603842339
 
 
   +1, thanks for making the changes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r391479369
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##########
 @@ -96,11 +134,13 @@ public StructureType structureType() {
   public boolean isVariant() { return true; }
 
   @Override
-  public boolean isArray() { return type() == MinorType.LIST; }
+  public boolean isArray() {
+    return super.isArray() || type() == MinorType.LIST;
+  }
 
   @Override
   public ColumnMetadata cloneEmpty() {
-    return new VariantColumnMetadata(name, type, variantSchema.cloneEmpty());
+    return new VariantColumnMetadata(name, type, mode, new VariantSchema());
 
 Review comment:
   Well, it turned out the `typeString` is also used during column metadata serialization / deserialization. Type is serialized in schematic type to consume less space, plus it's also stored in the same way in Metastore (see org.apache.drill.exec.record.metadata.AbstractColumnMetadata#createColumnMetadata). So currently ser / de for `VariantColumnMetadata` will fail. Since you are planning to use it in JSON, we need to ensure this class can be properly serialized and deserialized. To ensure this we need to do two things:
   1. update `typeString` (I propose we do this in this PR).
   2. update schema parser (I can do it in separate PR).
   
   Back to `typeString`:
   1. If `VariantColumnMetadata` is based on `UNION`, we can describe it as `UNION` without types but we'll lose all types defined in `VariantSchema` or we can describe it as UNION<INT, VARCHAR> and preserve types. The question is if we need types during ser / de?
   2. Similar case when `VariantColumnMetadata` is based on `LIST`,  we can describe it as ARRAY<UNION> without types or with types ARRAY<UNION<INT, VARCHAR>>.
   Since you know more about unions, I'll leave to you the final decision on how schematically union types should be represented, afterwards I'll adjust the schema parser.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#issuecomment-603078801
 
 
   After rebasing I'm seeing a number of metastore test failures:
   ```
   [ERROR] Errors: 
   [ERROR]   TestTableMetadataUnitConversion.testBaseTableMetadata:145 » IllegalArgument Un...
   [ERROR]   TestTableMetadataUnitConversion.testFileMetadata:280 » IllegalArgument Unable ...
   [ERROR]   TestTableMetadataUnitConversion.testPartitionMetadata:383 » IllegalArgument Un...
   [ERROR]   TestTableMetadataUnitConversion.testRowGroupMetadata:332 » IllegalArgument Una...
   [ERROR]   TestTableMetadataUnitConversion.testSegmentMetadata:237 » IllegalArgument Unab...
   ```
   
   It is not immediately obvious what went wrong; I'll poke around tomorrow to see if I can find the cause.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#issuecomment-603206563
 
 
   @paul-rogers thanks for addressing code review comment, I have left comment how to fix unit test failures, please apply them and squash the commits.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] paul-rogers commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
paul-rogers commented on issue #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#issuecomment-603434987
 
 
   @arina-ielchiieva, thanks for the suggestions, they solved the problem. I suspect we have some confusion with those methods based on earlier testing, but we'll fix those issues later.
   
   Squashed commits and reran all unit tests. We should be good to go. Thanks again for your help.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r397109112
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/AbstractColumnMetadata.java
 ##########
 @@ -295,12 +295,6 @@ public String toString() {
         .toString();
   }
 
-  @JsonProperty("type")
-  @Override
-  public String typeString() {
-    return majorType().toString();
-  }
-
 
 Review comment:
   Please don't remove this method but leave it as abstract with json property annotation:
   ```
     @JsonProperty("type")
     @Override
     public abstract String typeString();
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva edited a comment on issue #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva edited a comment on issue #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#issuecomment-603206563
 
 
   @paul-rogers thanks for addressing code review comment, I have left comments how to fix unit test failures, please apply them and squash the commits.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva merged pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva merged pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [drill] arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors

Posted by GitBox <gi...@apache.org>.
arina-ielchiieva commented on a change in pull request #2018: DRILL-7633: Fixes for union and repeated list accessors
URL: https://github.com/apache/drill/pull/2018#discussion_r397109336
 
 

 ##########
 File path: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/VariantColumnMetadata.java
 ##########
 @@ -145,16 +147,28 @@ public ColumnMetadata cloneEmpty() {
 
   @Override
   public ColumnMetadata copy() {
-    // TODO Auto-generated method stub
-    assert false;
-    return null;
+    return new VariantColumnMetadata(name, type, mode, variantSchema.copy());
   }
 
   @Override
   public VariantMetadata variantSchema() {
     return variantSchema;
   }
 
+  @JsonProperty("type")
 
 Review comment:
   Remove json property annotation as it will be present in parent class. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services