You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/03 16:51:51 UTC

[GitHub] [druid] josephglanville opened a new pull request #10232: Fix Avro support in Web Console

josephglanville opened a new pull request #10232:
URL: https://github.com/apache/druid/pull/10232


   Fixes #10229 
   
   ### Description
   
   This PR includes 2 sets of fixes to make working with Avro OCF files in the Web Console work seamlessly.
   
   The first is fixing the format detection by first changing the detection logic to run against the "raw" input rows and also patching the Avro OCF check to only check for the ASCII `Obj` prefix.
   
   The second corrects some more fundamental issues that wouldn't have been noticed without the web UI as manual ingestion specs that don't use the sampler API or rely on root field enumeration wouldn't have run into.
   First of which is that Enum and Fixed types were not properly supported in the `AvroFlattenerMaker`, they wouldn't be listed as root fields and were not properly converted into primitive types.
   Additionally the `AvroFlattenerMaker` didn't override the `ObjectFlatteners#finalizeConversionForMap` method so raw Avro types leaked into the `SamplerResponse`. The raw Avro types aren't configured to be mapped correctly so were causing errors when serialising `SamplerResponse` whilst using the Web UI. This is what led to the discovery of Enum and Fixed not being supported correctly.
   
   A follow up to this PR will likely include tests for `AvroOCFReader#sample` in a similar vein to what exists for `ParquetReader`.
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] been tested in a test Druid cluster.
   
   <hr>
   
   ##### Key changed/added classes in this PR
    * `AvroFlattenerMaker`
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville edited a comment on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville edited a comment on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-669656388


   The Wikipedia files will demonstrate Avro format detection working but they don't exhibit the more egregious problems because they contain neither Fixed nor Enum fields in the schema.
   
   I have created a gist that contains both the schema and the binary `someAvroDatum.avro` file I used for testing: https://gist.github.com/josephglanville/8c2c07384963b649facf99ef905db95e
   In case you have trouble here is a direct link to the file: https://gist.github.com/josephglanville/8c2c07384963b649facf99ef905db95e/raw/25206acbc1584432c326bdeee9a163291f0f2790/someAvroDatum.avro


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
gianm commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-705234409


   I think we really want to have TS tests for this, because it's the kind of feature that's likely to break accidentally if we don't have tests. @vogievetsky — any suggestions on the best way to add them?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
vogievetsky commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-668728379


   Hi @josephglanville! Thank you for the contribution. I am more than happy to have a look at this PR - at least the web console (TS) part of it.
   
   The `Obj1` -> `Obj` change is great - that must have been a mess up on my part.
   I am reviewing the other (lines) detection change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
vogievetsky commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-668744409


   Oh wait, `Obj1` was you originally - I forget what parts of the console I added.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis merged pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
clintropolis merged pull request #10232:
URL: https://github.com/apache/druid/pull/10232


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-668915468


   @vogievetsky I used the SomeAvroDatum file generated by the Druid Avro tests. Has to be lifted from a tmp directory though by pausing the tests so it will probably be easier if I upload it somewhere.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
vogievetsky commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-705259550


   I would suggest:
   
   Change `function guessInputFormat(sampleData: string[]): InputFormat {` to `export function guessInputFormat(sampleData: string[]): InputFormat {`
   
   Also in `ingestion-spec.spec.ts` change `import { cleanSpec, downgradeSpec, upgradeSpec } from './ingestion-spec';` to `import { cleanSpec, downgradeSpec, guessInputFormat, upgradeSpec } from './ingestion-spec';`
   
   And add this just before the final `});` of the file:
   
   ```
     describe('guessInputFormat', () => {
       it('works for parquet', () => {
         expect(guessInputFormat(['PAR1lol']).type).toEqual('parquet');
       });
   
       it('works for orc', () => {
         expect(guessInputFormat(['ORClol']).type).toEqual('orc');
       });
   
       it('works for AVRO', () => {
         expect(guessInputFormat(['Obj\x01lol']).type).toEqual('avro_ocf');
         expect(guessInputFormat(['Obj1lol']).type).toEqual('regex');
       });
   
       it('works for JSON', () => {
         expect(guessInputFormat(['{"a":1}']).type).toEqual('json');
       });
   
       it('works for TSV', () => {
         expect(guessInputFormat(['A\tB\tX\tY']).type).toEqual('tsv');
       });
   
       it('works for CSV', () => {
         expect(guessInputFormat(['A,B,X,Y']).type).toEqual('csv');
       });
   
       it('works for regex', () => {
         expect(guessInputFormat(['A|B|X|Y']).type).toEqual('regex');
       });
     });
   ```
   
   That should do it.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on a change in pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
vogievetsky commented on a change in pull request #10232:
URL: https://github.com/apache/druid/pull/10232#discussion_r469375416



##########
File path: web-console/src/utils/ingestion-spec.tsx
##########
@@ -2663,7 +2663,7 @@ function guessInputFormat(sampleData: string[]): InputFormat {
       return inputFormatFromType('orc');
     }
     // Avro OCF 4 byte magic header: https://avro.apache.org/docs/current/spec.html#Object+Container+Files
-    if (sampleDatum.startsWith('Obj1')) {
+    if (sampleDatum.startsWith('Obj')) {

Review comment:
       sure thing. just do `if (sampleDatum.startsWith('Obj') && sampleDatum.charCodeAt(3) === 1)` 
   
   ![image](https://user-images.githubusercontent.com/177816/90039386-6746c200-dc7b-11ea-88cc-f0755fb79287.png)
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-680069502


   @gianm @vogievetsky added the docs regarding the type handling and made the change to check the Avro OCF version byte. Additionally I added some tests that ensure we don't regress on the sampler API by ensuring output from `sample()` is properly serialisable to JSON.
   
   Though I think more can be done to make this better I think this should probably be merged to unbreak Avro support in the Web UI for now.
   
   I intend to submit some further changes to support Avro Unions so if you have additional things you want changed in the Avro support I can look into it as part of that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-669656388


   The Wikipedia files will demonstrate Avro format detection working but they don't exhibit the more egregious problems because they contain neither Fixed nor Enum fields in the schema.
   
   I have created a gist that contains both the schema and the binary `someAvroDatum.avro` file I used for testing: https://gist.github.com/josephglanville/8c2c07384963b649facf99ef905db95e
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-668129539


   @vogievetsky could you take a look at the web console detection fix? I think this is probably what was originally intended but I'm not 100% sure.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-668917061


   And yes `Obj` vs `Obj1` is my bad. I added the 1 without properly testing based on a quick read of the spec while adding comments. Arguably we should parse the following byte as uint8 to determine the OCF format version but probably overkill.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
clintropolis commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-668916304


   @vogievetsky you can find some avro files used by integration tests, which will probably work for testing this, here: https://github.com/apache/druid/tree/master/integration-tests/src/test/resources/data/batch_index/avro


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-673265507


   @gianm @vogievetsky part of the problem is the sampler API doesn't actually ever return the actual raw data.
   It returns 2 different types of data currently, "raw" and parsed.
   Raw in this case is actually the result of calling toMap() on the reader, which for the default inputFormat/inputReader used for detection is `csv`.
   So while it works for now (after switching from `parsed` to `raw`) that is rather janky.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10232:
URL: https://github.com/apache/druid/pull/10232#discussion_r469298104



##########
File path: web-console/src/utils/ingestion-spec.tsx
##########
@@ -2663,7 +2663,7 @@ function guessInputFormat(sampleData: string[]): InputFormat {
       return inputFormatFromType('orc');
     }
     // Avro OCF 4 byte magic header: https://avro.apache.org/docs/current/spec.html#Object+Container+Files
-    if (sampleDatum.startsWith('Obj1')) {
+    if (sampleDatum.startsWith('Obj')) {

Review comment:
       It'd be really great if we could test the proper thing here, but I'm not sure how either.
   
   @vogievetsky Any ideas about how we could actually check all 4 bytes here? (See https://github.com/apache/druid/issues/10229; the "1" should actually be the unprintable character 0x01.)
   
   Also, are there tests for this logic block in general? Where could we add one for this particular case?

##########
File path: extensions-core/avro-extensions/src/test/java/org/apache/druid/data/input/avro/AvroFlattenerMakerTest.java
##########
@@ -78,7 +78,8 @@ public void getRootField()
         flattener.getRootField(record, "someNull")
     );
     Assert.assertEquals(
-        record.getSomeFixed(),
+        // Casted to an array by transformValue
+        record.getSomeFixed().bytes(),

Review comment:
       What data type is "someFixed"? This code is expecting it to be translated as a `byte[]`. I think most downstream users of these parsed datums will be interpreting them using Rows.objectToStrings, which is going to translate a `byte[]` into a base64 encoded string. Is that sensible behavior?
   
   Would you be able to add some documentation to this page: https://druid.apache.org/docs/latest/ingestion/data-formats.html#avro-ocf about how the various Avro types that we support will be handled?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-705262306


   Ok I will get that sorted out today.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on a change in pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on a change in pull request #10232:
URL: https://github.com/apache/druid/pull/10232#discussion_r469704324



##########
File path: extensions-core/avro-extensions/src/test/java/org/apache/druid/data/input/avro/AvroFlattenerMakerTest.java
##########
@@ -78,7 +78,8 @@ public void getRootField()
         flattener.getRootField(record, "someNull")
     );
     Assert.assertEquals(
-        record.getSomeFixed(),
+        // Casted to an array by transformValue
+        record.getSomeFixed().bytes(),

Review comment:
       Yes so `fixed` and `bytes` are encoded equivalently in Avro, `fixed` just has an enforced size. There is a `binaryAsString` option for decoding these as strings if need be. It would be nice to support that as a per-field transformation though rather than applying to all binary types as you may only want some to be interpreted as strings and others to be returned verbatim.
   
   If docs were to be added I would probably put them on the `druid-avro-extensions` page and link to them from the various places Avro parsing it used.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
vogievetsky commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-668745358


   Could please you share an Avro file you used to test this? I want to run through the flow.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-680069934


   I attempted to add the Typescript tests but it was a bit beyond my TS/JS-fu for now.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-703162344


   @vogievetsky could you take another look here and see if there is anything that should be changed or can this go in as-is?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
vogievetsky commented on pull request #10232:
URL: https://github.com/apache/druid/pull/10232#issuecomment-672979708


   It could be cool to add a unit test for the `guessInputFormat` in `/src/utils/ingestion-spec.spec.ts` and an integration test that is basically a copy of `/e2e-tests/tutorial-batch.spec.ts` which does a text file load but with a test avro file


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on a change in pull request #10232: Fix Avro support in Web Console

Posted by GitBox <gi...@apache.org>.
josephglanville commented on a change in pull request #10232:
URL: https://github.com/apache/druid/pull/10232#discussion_r475195893



##########
File path: web-console/src/utils/ingestion-spec.tsx
##########
@@ -2663,7 +2663,7 @@ function guessInputFormat(sampleData: string[]): InputFormat {
       return inputFormatFromType('orc');
     }
     // Avro OCF 4 byte magic header: https://avro.apache.org/docs/current/spec.html#Object+Container+Files
-    if (sampleDatum.startsWith('Obj1')) {
+    if (sampleDatum.startsWith('Obj')) {

Review comment:
       Done.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org