You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/10/10 23:45:42 UTC

[GitHub] [druid] josephglanville opened a new pull request #10505: WIP: Avro union support

josephglanville opened a new pull request #10505:
URL: https://github.com/apache/druid/pull/10505


   Fixes #10493 
   
   ### Description
   
   Implements better support for Avro unions in the Avro extensions.
   Currently when ingesting data that contains unions unless the union is always the same value it will return mixed type results.
   This PR addresses the problem by exploding union fields into maps keyed by the union member type or type name in the case of named types (enums, fixed, records).
   The method was chosen as it's similar to what is done on other systems that support ingesting Avro data, such as Google BigQuery which details their Avro compatibility here: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro#avro_conversions
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `AvroFlattenerMaker`
    * `GenericAvroJsonProvider`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on a change in pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
josephglanville commented on a change in pull request #10505:
URL: https://github.com/apache/druid/pull/10505#discussion_r649537774



##########
File path: docs/development/extensions-core/avro.md
##########
@@ -39,13 +39,24 @@ Make sure to [include](../../development/extensions.md#loading-extensions) `drui
 
 Druid supports most Avro types natively, there are however some exceptions which are detailed here.
 
-`union` types which aren't of the form `[null, otherType]` aren't supported at this time.
+#### Unions
+Druid has two modes for supporting `union` types. The original legacy mode can only support unions of the form `[null, otherType]`.
+The newer mode can be enabled by setting `extractUnions` on the Avro parser in which case unions will be expanded according to the following rules:

Review comment:
       No problem, I think you are right and I will make these changes next week.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-819225513


   @clintropolis I rebased this and added support in the new InputFormat and extended the tests to cover all the union member types, ready for another review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-853628651


   I don't feel strongly about the name. I'm happy with `extractUnions`. Will make the changes later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] stale[bot] commented on pull request #10505: WIP: Avro union support

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-743877573


   This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If you think that's incorrect or this pull request should instead be reviewed, please simply write any comment. Even if closed, you can still revive the PR at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
clintropolis commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-818535155


   This seems like a reasonable addition :+1:, could you also add support to the avro streaming input format that was added by #11040?
   
   It does look like it might be hard to make the coverage bot happy without testing a lot of different union types, but maybe that is a good thing in this case if not too tedious.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] stale[bot] commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
stale[bot] commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-803968877


   This pull request/issue is no longer marked as stale.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-819914228


   Looks like only the spell check failed. Added to the spelling file and also tweaked the docs a little bit to correct a missing type and be more specific about how named types are handled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #10505:
URL: https://github.com/apache/druid/pull/10505


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-835564529


   @clintropolis would you be able to take another look at this? I think it's ready to merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-853628651


   I don't feel strongly about the name. I'm happy with `extractUnions`. Will make the changes later today.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on a change in pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
clintropolis commented on a change in pull request #10505:
URL: https://github.com/apache/druid/pull/10505#discussion_r648836460



##########
File path: docs/development/extensions-core/avro.md
##########
@@ -39,13 +39,24 @@ Make sure to [include](../../development/extensions.md#loading-extensions) `drui
 
 Druid supports most Avro types natively, there are however some exceptions which are detailed here.
 
-`union` types which aren't of the form `[null, otherType]` aren't supported at this time.
+#### Unions
+Druid has two modes for supporting `union` types. The original legacy mode can only support unions of the form `[null, otherType]`.
+The newer mode can be enabled by setting `extractUnions` on the Avro parser in which case unions will be expanded according to the following rules:

Review comment:
       Hmm, I looked a lot closer at this than I did on a previous pass, and I think the old docs were sort of wrong. Using the example `someMultiMemberUnion` type in this PR, I can still use a flatten spec to extract the values of any type with the existing code, apparently even including for record types, where the extraction path is `$.someMultiMemberUnion.subString` (instead of `$.someMultiMemberUnion.UnionSubRecord.subString` as in the mode added in this PR).
   
   As such, with my better understanding I think it makes sense to instead call this new property `extractUnionsByType` or something similar, and clarify that this new mode _requires_ using a flatten spec to extract the values, but with the benefit that you can selectively extract values of only a certain type so that they can be mapped to separate Druid columns or whatever. I also don't think it necessarily makes sense to refer to the other mode as legacy, since I guess it still has a use if the union is composed mainly of primitive types and all are able to be coerced into a common Druid type, or if it is a simple union type of the legacy form, which the new mode does not effect (since the isUnion code checks for more than 1 non-null type).
   
   Sorry I didn't look closer into this previously and for the review churn, my bad.
   
   Also because the new mode needs to be link to the flatten spec docs, maybe it makes sense to just move the unions description entirely into the complex types section, which also seems to mirror the Avro specification docs https://avro.apache.org/docs/current/spec.html#schema_complex

##########
File path: docs/development/extensions-core/avro.md
##########
@@ -39,13 +39,24 @@ Make sure to [include](../../development/extensions.md#loading-extensions) `drui
 
 Druid supports most Avro types natively, there are however some exceptions which are detailed here.
 
-`union` types which aren't of the form `[null, otherType]` aren't supported at this time.
+#### Unions
+Druid has two modes for supporting `union` types. The original legacy mode can only support unions of the form `[null, otherType]`.
+The newer mode can be enabled by setting `extractUnions` on the Avro parser in which case unions will be expanded according to the following rules:
+* Primitive types and unnamed complex types are keyed their type name. i.e `int`, `string`
+* Complex named types are keyed by their names, this includes `record`, `fixed` and `enum`.
+* The Avro null type is elided as it's value can only ever be null

Review comment:
       super nit: i think it's -> its

##########
File path: extensions-core/avro-extensions/src/main/java/org/apache/druid/data/input/avro/GenericAvroJsonProvider.java
##########
@@ -195,4 +210,46 @@ public Object unwrap(final Object o)
   {
     return o;
   }
+
+  private boolean isExplodableUnion(final Schema.Field field)

Review comment:
       nit: still using explode terminology




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10505: Avro union support

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10505:
URL: https://github.com/apache/druid/pull/10505#issuecomment-819116522


   @clintropolis sure, I will get around to it sometime this week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org