You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2023/01/06 14:26:59 UTC

[GitHub] [hudi] MichaelUryukin opened a new issue, #7617: [SUPPORT] Hudi "write" command doesn't fail when on incompatible partition type, but "read" command fails.

MichaelUryukin opened a new issue, #7617:
URL: https://github.com/apache/hudi/issues/7617

   **Describe the problem you faced**
   
   
   When we write a DF to a Hudi table, which is partitioned by column of a type "date", and the value of one of the rows for this column is NULL, Hudi will try to write the DF with "default" value instead (https://hudi.apache.org/docs/0.10.1/configurations#partitiondefault_name), write command (`df.write.format("hudi")....`) **succeeds**, but the read command (`spark.read.format("hudi")...`) **fails** on casting value `default` to `DateType` for partition column.
   
   
   Steps to reproduce the behaviour:
   1. create sample dataframe with at least one row with birth_date = NULL
   ```
   import datetime
   
   from pyspark.sql.types import StructType,StructField, StringType, DateType, IntegerType
   data = [
       ("James","","Smith","36636",datetime.date(2000, 1, 1),3000),
       ("Michael","Rose","","40288",None,4000),
       ("Robert","","Williams","42114",None,4000),
       ("Maria","Anne","Jones","39192",None,4000)
     ]
   
   schema = StructType([ 
       StructField("firstname",StringType(),True), 
       StructField("middlename",StringType(),True), 
       StructField("lastname",StringType(),True), 
       StructField("id", StringType(), True), 
       StructField("birth_date", DateType(), True), 
       StructField("salary", IntegerType(), True)])
   ```
   2.  set-up Hudi table configs and write to it
   
   ```
   table_name = 'glue_hudi_null_date_partition_issue'
   hudi_options = {
       'className': 'org.apache.hudi',
       'hoodie.datasource.write.precombine.field': 'id',
       'hoodie.datasource.write.recordkey.field': 'id',
       'hoodie.table.name': table_name,
       'hoodie.consistency.check.enabled': 'true',
       'hive_sync.ignore_exceptions': 'false',
       'hoodie.insert.shuffle.parallelism': '200',
       'hoodie.bulkinsert.shuffle.parallelism': '200',
       'hoodie.upsert.shuffle.parallelism': '200',
       'hoodie.datasource.write.partitionpath.field': 'birth_date',
       'hoodie.datasource.write.hive_style_partitioning': 'true',
       'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator'
   }
   
   df.write.format("hudi").options(**hudi_options).mode("overwrite").save(f"s3://bucket-name/{table_name}/")
   ```
   3. Read from this table:
   ```
   spark.read.format("hudi").options(**hudi_options).load(f"s3://bucket-name/{table_name}/").show()
   ```
   
   
   **Expected behavior**
    I would expect "write" command to fail.
   
   **Environment Description**
   
   * Hudi version : 0.10.1
   
   * Spark version : 3.1.2
   
   * Hive version : 
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   
   **Stacktrace**
   
   ```
   Py4JJavaError: An error occurred while calling o444.load.
   : java.lang.RuntimeException: Failed to cast value `default` to `DateType` for partition column `birth_date`
   	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitionColumn(PartitioningUtils.scala:313)
   	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartition(PartitioningUtils.scala:251)
   	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:37)
   	at org.apache.hudi.HoodieFileIndex.$anonfun$getAllQueryPartitionPaths$3(HoodieFileIndex.scala:586)
   	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:233)
   	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:58)
   	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:51)
   	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   	at scala.collection.TraversableLike.map(TraversableLike.scala:233)
   	at scala.collection.TraversableLike.map$(TraversableLike.scala:226)
   	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
   	at org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:538)
   	at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:602)
   	at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:387)
   	at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:184)
   	at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:119)
   	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:69)
   	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
   	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:326)
   	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:308)
   	at scala.Option.getOrElse(Option.scala:121)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:308)
   	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:240)
   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   	at java.lang.reflect.Method.invoke(Method.java:498)
   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
   	at py4j.Gateway.invoke(Gateway.java:282)
   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
   	at java.lang.Thread.run(Thread.java:750)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] ad1happy2go commented on issue #7617: [SUPPORT] Hudi "write" command doesn't fail when on incompatible partition type, but "read" command fails.

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.
ad1happy2go commented on issue #7617:
URL: https://github.com/apache/hudi/issues/7617#issuecomment-1529395358

   @MichaelUryukin Closing this as this works with the 0.12.2 version and above. Please reopen if you see any issues.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] jonvex commented on issue #7617: [SUPPORT] Hudi "write" command doesn't fail when on incompatible partition type, but "read" command fails.

Posted by GitBox <gi...@apache.org>.
jonvex commented on issue #7617:
URL: https://github.com/apache/hudi/issues/7617#issuecomment-1376342443

   @xushiyan  I tested this and now it succeeds. I didn't test on 0.10.1 to verify it was failing because I'll need to spin up an EC2 and set it up, but I can do that as well if you want me to


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on issue #7617: [SUPPORT] Hudi "write" command doesn't fail when on incompatible partition type, but "read" command fails.

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #7617:
URL: https://github.com/apache/hudi/issues/7617#issuecomment-1374501572

   @jonvex can you help verify this with 0.12.2 and master version pls? just to confirm if the behavior was fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #7617: [SUPPORT] Hudi "write" command doesn't fail when on incompatible partition type, but "read" command fails.

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed issue #7617: [SUPPORT] Hudi "write" command doesn't fail when on incompatible partition type, but "read" command fails.
URL: https://github.com/apache/hudi/issues/7617


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org