You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/01 09:04:15 UTC

[GitHub] [hudi] babumahesh-koo opened a new issue #5198: [SUPPORT] Querying data genereated by TimestampBasedKeyGenerator failed to parse timestamp in EPOCHMILLISECONDS column to date format

babumahesh-koo opened a new issue #5198:
URL: https://github.com/apache/hudi/issues/5198


   Hey guys,
   
   I am trying to use DeltaStreamer along with a CustomKeyGenerator to use ComplexKeyGenerator and TimeBasedKeyGenerator together. 
   
   When timestamp column (EPOCHMILLISECONDS) is used for partition in output format as "yyyy-MM-dd hh", the ingestion works fine but querying failes with casting exceptions.
   
   These are my config properties for the partitoning:
   
   hudiOptions = {
       "hoodie.table.name": "my_hudi_table",
       "hoodie.datasource.write.recordkey.field": "id",
       "hoodie.datasource.write.partitionpath.field": "created_at:timestamp,city:simple",
       "hoodie.datasource.write.precombine.field": "last_update_time",
       "hoodie.index.type": "GLOBAL_BLOOM",
       "hoodie.bloom.index.update.partition.path": "true",
       "hoodie.datasource.write.keygenerator.class":"org.apache.hudi.keygen.CustomKeyGenerator",
       "hoodie.deltastreamer.keygen.timebased.timestamp.type":"EPOCHMILLISECONDS",
       "hoodie.deltastreamer.keygen.timebased.output.dateformat":"yyyy-MM-dd hh",
       "hoodie.deltastreamer.keygen.timebased.timezone":"GMT",
       "hoodie.datasource.write.hive_style_partitioning":"true"
       }
   
   Expected behavior
   
   After ingestion, I should be able to query data
   
   Environment Description
   
   Hudi version :
   0.10.0
   
   Spark version :
   3.1.1
   
   Hive version :
   3.1.2
   
   Hadoop version :
   3.1.1
   
   Storage (HDFS/S3/GCS..) :
   S3
   
   Running on Docker? (yes/no) :
   no
   
   Steps to reproduce the behavior:
   
   https://gist.github.com/babumahesh-koo/c7fe9dd70e1e4f59ecb6fd34925553e9
   
   Stacktrace
   
   File "/home/ubuntu/hadoop/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
   py4j.protocol.Py4JJavaError: An error occurred while calling o37.load.
   : java.lang.RuntimeException: Failed to cast value `2014-12-31 09` to `LongType` for partition column `created_at`
   	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitionColumn(PartitioningUtils.scala:313)
   	at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartition(PartitioningUtils.scala:251)
   	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:37)
   	at org.apache.hudi.HoodieFileIndex.$anonfun$getAllQueryPartitionPaths$3(HoodieFileIndex.scala:586)
   
   **Additional context**
   After the data ingestion, the parquet files has the column type ad LongType, and the partitioncolumn data is String (As we used output format as "yyyy-MM-dd hh" (Expected), Due to these mismatches, it fails to cast.
   
   But, if remove all the spaces and hypens from outout format while ingesting, then querying is success due to the fact that, the partition column data is 2022033010 rather than "2022-03-30 10".
   
   
   
   	
   	
   	
   	


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org