You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/17 01:48:31 UTC

[GitHub] [hudi] sathyaprakashg opened a new issue #1972: Deltasteamer with Transformation has schema issue

sathyaprakashg opened a new issue #1972:
URL: https://github.com/apache/hudi/issues/1972


   I am getting below error when using Deltasteamer with Transformation (`--transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer`) against kafka source. 
   
   ```
    ERROR io.HoodieWriteHandle: Error writing record HoodieRecord{key=HoodieKey { recordKey=123 partitionPath=CA}, currentLocation='null', newLocation='null'}
   java.lang.ArrayIndexOutOfBoundsException: 25
           at org.apache.avro.io.parsing.Symbol$Alternative.getSymbol(Symbol.java:424)
           at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:290)
           at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
           at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:267)
           at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:179)
           at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
           at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:232)
           at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:222)
           at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:175)
           at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:153)
           at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:145)
           at org.apache.hudi.avro.HoodieAvroUtils.bytesToAvro(HoodieAvroUtils.java:119)
           at org.apache.hudi.common.model.OverwriteWithLatestAvroPayload.getInsertValue(OverwriteWithLatestAvroPayload.java:70)
           at org.apache.hudi.execution.LazyInsertIterable$HoodieInsertValueGenResult.<init>(LazyInsertIterable.java:92)
           at org.apache.hudi.execution.LazyInsertIterable.lambda$getTransformFunction$0(LazyInsertIterable.java:105)
           at org.apache.hudi.common.util.queue.BoundedInMemoryQueue.insertRecord(BoundedInMemoryQueue.java:170)
           at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:46)
           at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   ```
   
   I see this is caused by incompatiblity between schema that is used to convert avro to bytes `HoodieAvroUtils.avroToBytes` and bytes to avro `HoodieAvroUtils.bytesToAvro`. 
   
   For eg: below is schema used to convert avro to bytes. This schema is created in this [method](https://github.com/apache/hudi/blob/master/hudi-spark/src/main/scala/org/apache/hudi/AvroConversionHelper.scala#L342) from spark data type.  This method uses [SchemaConverters](https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala#L188), which always seems to be adding null as second data type in union `Schema.createUnion(schema, nullSchema)`
   
   ```
   {
     "type": "record",
     "name": "hoodie_source",
     "namespace": "hoodie.source",
     "fields": [
       {
         "name": "field1",
         "type": [
           "double",
           "null"
         ]
       },
       {
         "name": "field2",
         "type": [
           "double",
           "null"
         ]
       }
     ]
   }
   ```
   
   Below is the schema used to convert bytes to avro. This schema is retrieved from `hoodie.deltastreamer.schemaprovider.registry.url`
   ```
   {
     "type": "record",
     "name": "MyRecord",
     "namespace": "com.xyz.abc",
     "fields": [
       {
         "name": "field1",
         "type": [
           "null",
           "double"
         ],
         "default": null
       },
       {
         "name": "field2",
         "type": [
           "null",
           "double"
         ],
         "default": null
       }
     ]
   }
   ```
   As we can see in the first schema, it is union of double and null but in the second one it is null and double. In the avro schema, union types must be in same order. Since it is not, hudi is not able to convert the bytes back to avro
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1972: Deltasteamer with Transformation has schema issue

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1972:
URL: https://github.com/apache/hudi/issues/1972#issuecomment-691773018


   Closing this as dupe


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1972: Deltasteamer with Transformation has schema issue

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1972:
URL: https://github.com/apache/hudi/issues/1972#issuecomment-678252457


   This is same as https://issues.apache.org/jira/browse/HUDI-1128. Please look at my comment in https://github.com/apache/hudi/issues/1971


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar closed issue #1972: Deltasteamer with Transformation has schema issue

Posted by GitBox <gi...@apache.org>.
bvaradar closed issue #1972:
URL: https://github.com/apache/hudi/issues/1972


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sathyaprakashg commented on issue #1972: Deltasteamer with Transformation has schema issue

Posted by GitBox <gi...@apache.org>.
sathyaprakashg commented on issue #1972:
URL: https://github.com/apache/hudi/issues/1972#issuecomment-675366954


   If I pass schemaprovier as `NullTargetSchemaRegistryProvider` instead of `SchemaRegistryProvider` it is working fine. But I think we should still fix this issue to make sure even if someone passes schema provider that exposes target schema, it works seamlessly. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #1972: Deltasteamer with Transformation has schema issue

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #1972:
URL: https://github.com/apache/hudi/issues/1972#issuecomment-675097346


   @sathyaprakashg : thanks for reporting the issue. I will let @bvaradar handle this along with #1971 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org