You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/01 11:21:36 UTC

[GitHub] [druid] josephglanville opened a new issue #10229: Indexing Avro OCF files with local input source in web-console is broken

josephglanville opened a new issue #10229:
URL: https://github.com/apache/druid/issues/10229


   When attempting to ingest Avro OCF files via the web console from local files I encountered 2 problems.
   The first is the sampler API failing with the following error message when attempting to ingest the SomeAvroDatum test file used in the InputFormat tests:
   ```
   2020-08-01T10:02:21,468 WARN [qtp1391890442-156] org.eclipse.jetty.server.HttpChannel - handleException /druid/indexer/v1/sampler com.fasterxml.jackson.databind.exc.InvalidDefinitionException: No serializer found for class org.apache.avro.generic.GenericData$Fixed and no properties discovered to create BeanSerializer (to avoid exception, disable SerializationFeature.FAIL_ON_EMPTY_BEANS) (through reference chain: org.apache.druid.indexing.overlord.sampler.SamplerResponse["data"]->java.util.ArrayList[0]->org.apache.druid.indexing.overlord.sampler.SamplerResponse$SamplerResponseRow["input"]->java.util.HashMap["someFixed"])
   ```
   
   I was able to fix this by disabling said feature on the default object mapper (because I couldn't find how the mapper for the sampler is initialised) but I don't know if that is a reasonable fix or if something more scoped can be done.
   
   Additionally there are 2 bugs with format detection right now. The first is my fault, the byte prefix used to match Avro OCF files is wrong. It attempts to match `Obj1` all as ASCII char codes however the 1 is actually a byte value so this is incorrect.
   However this doesn't actually come into play as for some reason the first few bytes of the sample datum are missing by the time the selection logic runs.
   The file is correct as can be seen with the hexdump:
   ```
   00000000  4f 62 6a 01 02 16 61 76  72 6f 2e 73 63 68 65 6d  |Obj...avro.schem|
   00000010  61 f4 12 7b 22 74 79 70  65 22 3a 22 72 65 63 6f  |a..{"type":"reco|
   ```
   And is loaded correctly after fixing the the mapper serialisation settings.
   
   In summary:
   - Format detection for Arvo OCF is wrong because the 4th byte should be 0x01 not 0x31 (ASCII `1`)
   - Even if above check was correct it would fail because something is causing missing data ahead of format detection
   - Serialisation of the SamplerResponse is failing on Avro classes that blow up with `FAIL_ON_EMPTY_BEANS` being disabled.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] josephglanville commented on issue #10229: Indexing Avro OCF files with local input source in web-console is broken

Posted by GitBox <gi...@apache.org>.
josephglanville commented on issue #10229:
URL: https://github.com/apache/druid/issues/10229#issuecomment-667806461


   Unfortunately my JS/TS-fu is pretty limited so I can't work out how to reasonably do a byte literal comparison so I propose just changing the tested prefix to `Obj` which doesn't conflict with any existing formats supported and omitting the OCF version number from the test. Will open a PR for this change.
   
   Also I looked into why there appeared to be missing data. Turns out it's not quite missing but rather format detection is applied after sending the data to the sampler API and is performed on the response. The sampler uses an inputFormat to parse the data into rows so even the "raw" data is inappropriate for format detection.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis closed issue #10229: Indexing Avro OCF files with local input source in web-console is broken

Posted by GitBox <gi...@apache.org>.
clintropolis closed issue #10229:
URL: https://github.com/apache/druid/issues/10229


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org