You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/29 07:39:50 UTC

[GitHub] [hudi] khiem291 opened a new issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

khiem291 opened a new issue #3558:
URL: https://github.com/apache/hudi/issues/3558


   **Description**
   
   Hi all,
   I tested schema evolution change data type from int to double, but it did not work with Hudi DeltaStream.
   (hudi doc: https://github.com/apache/hudi/blob/asf-site/website/docs/schema_evolution.md).
   I also tried to test change data type from float to double, it's had the same error.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. init spark context
   ```
   ./spark-shell \
     --packages org.apache.spark:spark-avro_2.12:3.1.2,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hudi:hudi-spark3-bundle_2.12:0.8.0 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'\
     --conf 'spark.hadoop.fs.s3a.access.key=xx'\
     --conf 'spark.hadoop.fs.s3a.secret.key=xx'\
     --conf 'spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem'\
     --conf 'spark.hadoop.fs.s3a.endpoint=s3.amazonaws.com'\
     --conf 'spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.profile.ProfileCredentialsProvider'\
     --conf 'spark.hadoop.fs.s3a.fast.upload=true'\
     --conf 'spark.hadoop.fs.s3a.multiobjectdelete.enable=false'\
     --conf 'spark.sql.parquet.filterPushdown=true'\
     --conf 'spark.sql.parquet.mergeSchema=false'\
     --conf 'spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2'\
     --conf 'spark.speculation=false'\
     --conf 'hive.metastore.schema.verification=false'\
     --conf 'hive.metastore.schema.verification.record.version=false'\
     --conf spark.sql.hive.convertMetastoreParquet=false
   ```
   
   2. create base hudi table
   ```
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   import org.apache.spark.sql.types._
   import org.apache.spark.sql.Row
   
   val tableName = "hudi_trips_cow"
   val basePath = "s3a://data-lake/hudi_test/hudi_trips_cow_schema_change"
   val schema = StructType( Array(
        StructField("rowId", StringType,true),
        StructField("partitionId", StringType,true),
        StructField("preComb", LongType,true),
        StructField("name", StringType,true),
        StructField("versionId", StringType,true),
        StructField("intToLong", IntegerType,true),//ok
        StructField("intToDouble", IntegerType,true),
        StructField("longToFloat", LongType,true),//ok
        // StructField("longToDouble", IntegerType,true),
        StructField("floatToDouble", FloatType,true)
        )) // 9 cols
       
   val data1 = Seq(Row("row_1", "part_0", 0L, "bob", "v_0", 0, 1, 1L, 1.1f),
                   Row("row_2", "part_0", 0L, "john", "v_0", 0, 1, 2L, 1.2f),
                   Row("row_3", "part_3", 0L, "tom", "v_0", 0, 1, 3L, 1.3f))
   
   var dfFromData1 = spark.createDataFrame(data1, schema)
   dfFromData1.write.format("hudi").
          options(getQuickstartWriteConfigs).
          option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
          option(RECORDKEY_FIELD_OPT_KEY, "rowId").
          option(PARTITIONPATH_FIELD_OPT_KEY, "partitionId").
          option("hoodie.index.type","SIMPLE").
          option("hoodie.datasource.write.hive_style_partitioning", true).
          option(TABLE_NAME, tableName).
          mode(Overwrite).
          save(basePath)
   ```
   
   3. Change column `intToDouble` data type from int to double and append new data to old table.
   ```
   // Int to double
   val newSchema = StructType( Array(
        StructField("rowId", StringType,true),
        StructField("partitionId", StringType,true),
        StructField("preComb", LongType,true),
        StructField("name", StringType,true),
        StructField("versionId", StringType,true),
        StructField("intToLong", IntegerType,true),
        StructField("intToDouble", DoubleType,true),
        StructField("longToFloat", LongType,true),
        // StructField("longToDouble", IntegerType,true),
        StructField("floatToDouble", FloatType,true)
        )) // 9 col
   
   val data2 = Seq(Row("row_2", "part_0", 5L, "john", "v_3", 3, 1D, 2l, 1.8f),
                   Row("row_5", "part_0", 5L, "maroon", "v_2", 2, 1D, 2l, 1.8f),
                   Row("row_9", "part_9", 5L, "michael", "v_2", 2, 1D, 2l, 1.8f))
   
   var dfFromData2 = spark.createDataFrame(data2, newSchema)
   
   dfFromData2.write.format("hudi").
          options(getQuickstartWriteConfigs).
          option(PRECOMBINE_FIELD_OPT_KEY, "preComb").
          option(RECORDKEY_FIELD_OPT_KEY, "rowId").
          option(PARTITIONPATH_FIELD_OPT_KEY, "partitionId").
          option("hoodie.datasource.write.hive_style_partitioning", true).
          option("hoodie.index.type","SIMPLE").
          option(TABLE_NAME, tableName).
          mode(Append).
          save(basePath)
   ```
   
   4. Read hudi table failed
   ```
   scala> val tripsSnapshotDF1 = spark.read.format("hudi").load(basePath + "/*/*")
   scala> tripsSnapshotDF1.show()
   21/08/29 12:47:42 ERROR Executor: Exception in task 0.0 in stage 59.0 (TID 447)]
   org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3a://de-dev-sb-data-lake/hudi_test/hudi_trips_cow_schema_change_full_hudi0.9/partitionId=part_3/aa852039-ee90-4365-8da9-9ec824b0d872-0_1-19-214_20210829123247.parquet. Column: [intToDouble], Expected: double, Found: INT32
   ```
   
   **Expected behavior**
   
    can read hudi table with spark
   
   **Environment Description**
   
   * Hudi version : 0.8.0
   
   * Spark version : 3.1.2
   
   * Hive version : no
   
   * Hadoop version : 3.2
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   **Stacktrace**
   
   ```
   scala> tripsSnapshotDF1.show()
   21/08/29 12:47:42 ERROR Executor: Exception in task 0.0 in stage 59.0 (TID 447)]
   org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3a://de-dev-sb-data-lake/hudi_test/hudi_trips_cow_schema_change_full_hudi0.9/partitionId=part_3/aa852039-ee90-4365-8da9-9ec824b0d872-0_1-19-214_20210829123247.parquet. Column: [intToDouble], Expected: double, Found: INT32
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:571)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:294)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181)
   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
   	... 20 more
   21/08/29 12:47:42 WARN TaskSetManager: Lost task 0.0 in stage 59.0 (TID 447) (192.168.1.5 executor driver): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3a://de-dev-sb-data-lake/hudi_test/hudi_trips_cow_schema_change_full_hudi0.9/partitionId=part_3/aa852039-ee90-4365-8da9-9ec824b0d872-0_1-19-214_20210829123247.parquet. Column: [intToDouble], Expected: double, Found: INT32
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:571)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:294)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181)
   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
   	... 20 more
   
   21/08/29 12:47:42 ERROR TaskSetManager: Task 0 in stage 59.0 failed 1 times; aborting job
   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 59.0 failed 1 times, most recent failure: Lost task 0.0 in stage 59.0 (TID 447) (192.168.1.5 executor driver): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3a://de-dev-sb-data-lake/hudi_test/hudi_trips_cow_schema_change_full_hudi0.9/partitionId=part_3/aa852039-ee90-4365-8da9-9ec824b0d872-0_1-19-214_20210829123247.parquet. Column: [intToDouble], Expected: double, Found: INT32
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   	at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
   	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
   	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
   	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
   	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
   	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
   	at org.apache.spark.scheduler.Task.run(Task.scala:131)
   	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:571)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:294)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181)
   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
   	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
   	... 20 more
   
   Driver stacktrace:
     at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2258)
     at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2207)
     at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2206)
     at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
     at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
     at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2206)
     at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1079)
     at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1079)
     at scala.Option.foreach(Option.scala:407)
     at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1079)
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2445)
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2387)
     at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2376)
     at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
     at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2196)
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2217)
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2236)
     at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:472)
     at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:425)
     at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
     at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3696)
     at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2722)
     at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3687)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
     at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
     at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
     at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
     at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
     at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3685)
     at org.apache.spark.sql.Dataset.head(Dataset.scala:2722)
     at org.apache.spark.sql.Dataset.take(Dataset.scala:2929)
     at org.apache.spark.sql.Dataset.getRows(Dataset.scala:301)
     at org.apache.spark.sql.Dataset.showString(Dataset.scala:338)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:825)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:784)
     at org.apache.spark.sql.Dataset.show(Dataset.scala:793)
     ... 61 elided
   Caused by: org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file s3a://de-dev-sb-data-lake/hudi_test/hudi_trips_cow_schema_change_full_hudi0.9/partitionId=part_3/aa852039-ee90-4365-8da9-9ec824b0d872-0_1-19-214_20210829123247.parquet. Column: [intToDouble], Expected: double, Found: INT32
     at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
     at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
     at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source)
     at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
     at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
     at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
     at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
     at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
     at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
     at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
     at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
     at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
     at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
     at org.apache.spark.scheduler.Task.run(Task.scala:131)
     at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
     at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
     at java.lang.Thread.run(Thread.java:748)
   Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
     at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
     at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readIntBatch(VectorizedColumnReader.java:571)
     at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:294)
     at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
     at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181)
     at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
     at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
     at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
     ... 20 more
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


   @nsivabalan , Hudi uses `parquet-avro`. So there is a slight variation for the primitive typesconversion  between avro and parquet-avro.
   
   AFAIK,
   int is promoted to int
   long is promoted to int, long
   double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


   @nsivabalan , Hudi uses `parquet-avro`. So there is a slight variation for the primitive types conversion between avro and parquet-avro.
   
   AFAIK,
   int is promoted to int
   long is promoted to int, long
   double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957620142


   @kenny291 sorry, I made a mistake. I have just edited my last 2 messages. I am trying to insert 
   int -> double
   long -> double.
   
   that should be supported by Avro:
   
   ```
   the writer's schema may be promoted to the reader's as follows:
   
       int is promotable to long, float, or double
       long is promotable to float or double
   
   ```
   http://avro.apache.org/docs/current/spec#Schema+Resolution
   
   Besides sync to Hive issue, I also have such exception:
   
   ```scala
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
           at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
           at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
           at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           ... 4 more
   Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldLongConverter
           at org.apache.parquet.io.api.PrimitiveConverter.addDouble(PrimitiveConverter.java:84)
           at org.apache.parquet.column.impl.ColumnReaderImpl$2$2.writeValue(ColumnReaderImpl.java:228)
           at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:367)
           at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
           ... 11 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957334067


   I have gotten the same error when trying to upsert Int to Double in Hudi table:
   
   ```scala
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
           at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
           at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
           at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
           at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
           at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
           at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1056)
           at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert field Type from INT to double for field new_column
           at org.apache.hudi.hive.util.HiveSchemaUtil.getSchemaDifference(HiveSchemaUtil.java:103)
           at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:242)
           at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:182)
           at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:131)
           at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:117)
           ... 90 more
   ``` 
   
   Spark: 3.1.2
   Hudi: 0.9


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

nsivabalan edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-993939947


   I could able to reproduce what @novakov-alexey has reported. just with local spark shell, add one batch of inserts to part_0 and part_1 with a column of integer data type. Did an upsert with an evolved schema with column evolved to double. And reading the hudi table reports above exception
   ```
   scala> spark.sql("select * from hudi_trips_snapshot2").show()
   21/12/14 12:00:40 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 645)
   java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
   	at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:57)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
   	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:484)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
   
   ```
   
   
   Encountered the same when tried integer -> long. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957620142


   @kenny291 sorry, I made a mistake. I have just edited my last 2 messages. I am trying to insert 
   int -> double
   long -> double.
   
   Besides sync to Hive issue, I also have such exception:
   
   ```scala
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
           at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
           at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
           at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           ... 4 more
   Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.avro.AvroConverters$FieldLongConverter
           at org.apache.parquet.io.api.PrimitiveConverter.addDouble(PrimitiveConverter.java:84)
           at org.apache.parquet.column.impl.ColumnReaderImpl$2$2.writeValue(ColumnReaderImpl.java:228)
           at org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:367)
           at org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:406)
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:226)
           ... 11 more
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


   @nsivabalan , Hudi uses `parquet-avro`. So there is a slight variation for the primitive typesconversion  between avro and parquet-avro.
   
   AFAIK,
   int is promoted to int
   long is promoted to int, long
   double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-993939947


   I could able to reproduce what @novakov-alexey has reported. just with local spark shell, add one batch of inserts to part_0 and part_1 with a column of integer data type. Did an upsert with an evolved schema with column evolved to double. And reading the hudi table reports above exception
   ```
   scala> spark.sql("select * from hudi_trips_snapshot2").show()
   21/12/14 12:00:40 ERROR Executor: Exception in task 0.0 in stage 67.0 (TID 645)
   java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
   	at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:57)
   	at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
   	at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:484)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
   	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
   	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
   
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] vinothchandar commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

vinothchandar commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-910392433


   cc @codope could you please take a look at this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


   @nsivabalan , Hudi uses `parquet-avro`. So there is a slight variation for the primitive types conversion between avro and parquet-avro.
   
   AFAIK,
   Ineger is promoted to int
   Long is promoted to int, long
   Double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957466015






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan closed issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

nsivabalan closed issue #3558:
URL: https://github.com/apache/hudi/issues/3558


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


   @nsivabalan , Hudi uses `parquet-avro`. So there is a slight variation for the types conversion between avro and parquet-avro.
   
   AFAIK,
   Ineger is promoted to int
   Long is promoted to int, long
   Double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957334067






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kenny291 commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

kenny291 commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957484205


   @novakov-alexey, double to int is not compatibility in Avro schema revolution 
   http://avro.apache.org/docs/current/spec#Schema+Resolution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957334067


   I have gotten the same error when trying to upsert Double to Int in Hudi table:
   
   ```scala
           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
           at java.lang.reflect.Method.invoke(Method.java:498)
           at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
           at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
           at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
           at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
           at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
           at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1047)
           at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1056)
           at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
   Caused by: org.apache.hudi.hive.HoodieHiveSyncException: Could not convert field Type from INT to double for field new_column
           at org.apache.hudi.hive.util.HiveSchemaUtil.getSchemaDifference(HiveSchemaUtil.java:103)
           at org.apache.hudi.hive.HiveSyncTool.syncSchema(HiveSyncTool.java:242)
           at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:182)
           at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:131)
           at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:117)
           ... 90 more
   ``` 
   
   Spark: 3.1.2
   Hudi: 0.9


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kenny291 commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

kenny291 commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957484205


   @novakov-alexey, double to int is not compatibility in Avro schema revolution 
   http://avro.apache.org/docs/current/spec#Schema+Resolution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] codope commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

codope commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-911683633


   @kenny291 I ran a [quick test locally](https://gist.github.com/codope/0a8945ff46337a6eff1ead9a9f881616) (Spark 3.1.2, Hudi 0.9.0) and I did see the error. Do you have hive/glue metastore sync enabled? If yes, could you check the schema of the corresponding table registered in hive/glue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


    Hudi uses `parquet-avro`. So there is a slight variation for the types conversion between avro and parquet-avro.
   
   AFAIK,
   Ineger is promoted to int
   Long is promoted to int, long
   Double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957466015






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957466015


   @codope it seems the issue is in HiveSync feature. I am trying to upsert column A with double type to existing Hudi table with column A in Int type. Upsert works, but Hudi sync to Hive fails with that exception above. 
   Should I try to recreate Hive table via DDL manually? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957466015


   @codope it seems the issue is in HiveSync feature. I am trying to upsert column A with int type to existing Hudi table with column A in Double type. Upsert works, but Hudi sync to Hive fails with that exception above. 
   Should I try to recreate Hive table via DDL manually? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957860501


   I was also using  `ENABLE_ROW_WRITER.key -> "true"` which cases below: 
   
   ```scala
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
           at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
           at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
           at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           ... 4 more
   Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
           at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
           at org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:75)
           at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:341)
           at org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
           at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
           at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
           at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
           at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
           at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
           at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
           at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
           ... 11 more
   ```
   
   if I set this flag to false, then those earlier posted errors arise.
   
   All in all, it seems we have 3 issues here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957860501


   I was also using  ENABLE_ROW_WRITER.key -> "true" which cases below: 
   
   ```scala
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
           at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:136)
           at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
           at org.apache.hudi.common.util.queue.IteratorBasedQueueProducer.produce(IteratorBasedQueueProducer.java:45)
           at org.apache.hudi.common.util.queue.BoundedInMemoryExecutor.lambda$null$0(BoundedInMemoryExecutor.java:92)
           at java.util.concurrent.FutureTask.run(FutureTask.java:266)
           at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
           ... 4 more
   Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainIntegerDictionary
           at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
           at org.apache.parquet.avro.AvroConverters$BinaryConverter.setDictionary(AvroConverters.java:75)
           at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:341)
           at org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
           at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
           at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
           at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
           at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
           at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
           at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
           at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
           at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
           ... 11 more
   ```
   
   if I set this flag to false, then those earlier posted errors arise.
   
   All in all, it seems we have 3 issues here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] nsivabalan commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

nsivabalan commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-993948978


   I have filed a tracking jira [here](https://issues.apache.org/jira/browse/HUDI-3019). We have working towards a comprehensive schema evolution support. I will ensure this  issue gets addressed as part of the dev work. As of now, since we rely on external parquet avro libs, don't think we can do much here until we have the comprehensive schema evolution. 
   
   thanks for bringing it to our attention. I will close out the github issue as we have a tracking jira. Feel free to add more details to the jira
   and re-open this issue if you have any more things to be addressed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


   @nsivabalan , Hudi uses `parquet-avro`. So there is a slight variation for the primitive typesconversion  between avro and parquet-avro.
   
   AFAIK,
   int is promoted to int
   long is promoted to int, long
   double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957466015


   @codope it seems the issue is in HiveSync feature. I am trying to upsert column A with double type to existing Hudi table with column A in Int type. Upsert works, but Hudi sync to Hive it fails with exception above. 
   Should I try to recreate Hive table via DDL manually? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] novakov-alexey commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

novakov-alexey commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957334067






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kenny291 commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

kenny291 commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957484205


   @novakov-alexey, double to int is not compatibility in Avro schema revolution 
   http://avro.apache.org/docs/current/spec#Schema+Resolution


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] jadireddi edited a comment on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

jadireddi edited a comment on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-957726431


    Hudi uses `parquet-avro`. So there is a slight variation for the types conversion between avro and parquet-avro.
   
   AFAIK,
   Ineger is promoted to int
   Long is promoted to int, long
   Double is promoted to int, long, float, double.
   
   Ref: https://github.com/apache/parquet-mr/blob/59e9f78b8b3a30073db202eb6432071ff71df0ec/parquet-avro/src/main/java/org/apache/parquet/avro/AvroConverters.java#L146
   
   --cc @novakov-alexey


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kenny291 commented on issue #3558: [SUPPORT] Schema evolution error: promoted data type from integer to double

Posted by GitBox <gi...@apache.org>.

kenny291 commented on issue #3558:
URL: https://github.com/apache/hudi/issues/3558#issuecomment-912213385


   Hi @codope,
   Thank for rechecking it. I disabled Hive/Glue sync.
   
   I tested with your example in [quick test locally](https://gist.github.com/codope/0a8945ff46337a6eff1ead9a9f881616).
   It worked, but your data is a little difference. It's only 1 partition `part_0`.
   
   Did you recheck with data as I mentioned ?
   data1: 
   partition: part_0 and part_3
   
   then data2:
   partition: part_0 and part_9 (change data type to double, no new data coming for part_3)
   
   As we saw in hudi-spark exception: data in partition `part_3` still keep data type `INT`. So, spark can not read it with latest schema.
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org