You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/02 08:24:13 UTC

[GitHub] [hudi] YuweiXiao opened a new issue, #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

YuweiXiao opened a new issue, #5740:
URL: https://github.com/apache/hudi/issues/5740

   **Describe the problem you faced**
   
   Currently, spark may have problem reading parquet file with decimal type value [[link](https://stackoverflow.com/questions/63578928/spark-unable-to-read-decimal-columns-in-parquet-files-written-by-avroparquetwrit)]. A workaround is to set `spark.sql.parquet.enableVectorizedReader=false`. 
   
   However, in some spark hudi reading paths, we will explicitly set the config to true [[commit](https://github.com/apache/hudi/pull/5168)]. 
   
   Though it might sound hacky, we need to auto-set this config based on the schema, e.g., turn off if we find there is decimal type. And I maybe we also need to respect those spark configs from users, rather than overriding it directly.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. create a hudi table with decimal type column
   2. use spark to read the table
   
   **Expected behavior**
   
   
   **Environment Description**
   
   * Hudi version : master
   
   * Spark version : 2.4.4
   
   * Hive version : -
   
   * Hadoop version : -
   
   * Storage (HDFS/S3/GCS..) : local
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```
   Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 5, localhost, executor driver): org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:250)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:497)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:220)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
   	at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
   	at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
   	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
   	at org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.hasNextInternal(HoodieMergeOnReadRDD.scala:273)
   	at org.apache.hudi.HoodieMergeOnReadRDD$RecordMergingFileIterator.hasNext(HoodieMergeOnReadRDD.scala:267)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown Source)
   	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
   	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
   	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
   	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
   	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
   	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
   	at org.apache.spark.scheduler.Task.run(Task.scala:123)
   	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
   	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
   	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] YuweiXiao commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

YuweiXiao commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1157468880

   If you want the demo table data, I could prepare one.
   
   ```
   message triprec {
     optional binary _hoodie_commit_time (STRING);
     optional binary _hoodie_commit_seqno (STRING);
     optional binary _hoodie_record_key (STRING);
     optional binary _hoodie_partition_path (STRING);
     optional binary _hoodie_file_name (STRING);
     required int64 ts;
     required binary uuid (STRING);
     required binary rider (STRING);
     required binary driver (STRING);
     required double begin_lat;
     required double begin_lon;
     required double end_lat;
     required double end_lon;
     required double fare;
     required float float_value;
     required boolean boolean_value;
     required binary decimal_value (DECIMAL(10,4));
     required int64 timestamp_value (TIMESTAMP(MILLIS,true));
     required int32 date_value (DATE);
     required binary partitionpath (STRING);
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] minihippo commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

minihippo commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1156643082

   Could u provide the parquet schema?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] minihippo commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

minihippo commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1160653283

   @YuweiXiao The parquet schema is enough. I'll follow this problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] YuweiXiao closed issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

YuweiXiao closed issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark
URL: https://github.com/apache/hudi/issues/5740


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] minihippo commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

minihippo commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1161554490

   I suppose to align with Spark. Consider the situation that a hudi table has two write paths: Flink for real-time and Spark for batch-backfill, Flink has already aligned with Spark (see HUDI-3096). Therefore, when writtern by Spark, decimal with a precision smaller than 18 is recommendded to int64/32.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] YuweiXiao commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

YuweiXiao commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1161333925

   > 
   
   
   
   > @YuweiXiao The root cause is that AVRO serializes Decimal type to bytes. However, for DECIMAL(10,4), Spark writes as int64.
   > 
   > ```
   > message spark_schema {
   >   optional int64 id (DECIMAL(10,4));
   > }
   > ```
   > 
   > For more details, see [sksamuel/avro4s#271](https://github.com/sksamuel/avro4s/issues/271).
   
   Yes, disabling vectorized read is a workaround. What do you think about the solution I mentioned above, i.e., auto-disable vectorization and respect user-defined configs. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] minihippo commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

minihippo commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1160667998

   @YuweiXiao The root cause is that AVRO serializes Decimal type to bytes. However, for DECIMAL(10,4), Spark write as int64. 
   For more details, see https://github.com/sksamuel/avro4s/issues/271.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] xiarixiaoyao commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

xiarixiaoyao commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1164148570

   @YuweiXiao  how to reproduce the parquet file as your provide，
   ```
      spark.sql(
         s"""
            |create table hhx (
            |  id int,
            |  name string,
            |  price double,
            |  qi decimal(10, 4),
            |  ts long
            |) using hudi
            | location '/tmp/default/hhx'
            | options (
            |  type = 'cow',
            |  primaryKey = 'id',
            |  preCombineField = 'ts'
            | )
                """.stripMargin)
   
       spark.sql("insert into hhx values(1, 'meng', 3.4, 5.00, 898)")
   
       spark.sql("select * from hhx").show(false)
   ```
   those code should be ok for hudi 0.11
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] YuweiXiao commented on issue #5740: [SUPPORT] Failed to read hudi table with decimal type using spark

Posted by GitBox <gi...@apache.org>.

YuweiXiao commented on issue #5740:
URL: https://github.com/apache/hudi/issues/5740#issuecomment-1207829417

   Hey @xiarixiaoyao, thanks for the script and it works out fine. The dataset I was testing was created manually by myself, which treat decimal values as binary. 
   
   To sum up this issue, the reason behind is what @minihippo has posted: decimal type with small precision should be represented as fixed length type rather than binary to work with spark. And the default behavior of hudi (both flink/spark engine) is already aligned with the 'fixed length' rule. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org