You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/03/03 16:30:17 UTC

[GitHub] [hudi] chrischnweiss opened a new issue #4946: [SUPPORT] TimestampBasedKeyGenerator failed to parse date_string column

chrischnweiss opened a new issue #4946:
URL: https://github.com/apache/hudi/issues/4946


   **Describe the problem you faced**
   
   Hey guys, 
   actually I am trying to use DeltaStreamer along with a CustomKeyGenerator to use ComplexKeyGenerator and TimeBasedKeyGenerator together. Now I am struggling with DateFormat Exception and I canĀ“t figure out why. Maybe you can help me?
   
   These are my config properties for the partitoning:
   
   ```
   hoodie.datasource.write.recordkey.field=id_1,id_2
   hoodie.datasource.write.partitionpath.field=timestamp_col:timestamp
   hoodie.datasource.write.hive_style_partitioning=true
   hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGenerator
   hoodie.deltastreamer.keygen.timebased.timestamp.type=DATE_STRING
   hoodie.deltastreamer.keygen.timebased.input.dateformat="yyyy-MM-dd HH:mm:ss.SSS Z"
   hoodie.deltastreamer.keygen.timebased.output.dateformat="yyyy/MM/dd"
   ```
   
   **Expected behavior**
   
   I want to parse my date_string as input to partition the target dataset.
   My timestamp_col looks like this: `2022-01-13 16:57:05.659 +01:00`
   
   **Environment Description**
   
   Hudi version :
   0.10.0
   
   Spark version :
   3.1.1
   
   Hive version :
   3.1.3000
   
   Hadoop version :
   3.1.1
   
   Storage (HDFS/S3/GCS..) :
   HDFS
   
   Running on Docker? (yes/no) :
   no
   
   
   **Stacktrace**
   
   Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 11) (t-worker.node.asdasd.de executor 1): org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input partition field :2022-01-13 16:57:05.659 +01:00
           at org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:135)
           at org.apache.hudi.keygen.CustomAvroKeyGenerator.getPartitionPath(CustomAvroKeyGenerator.java:89)
           at org.apache.hudi.keygen.CustomKeyGenerator.getPartitionPath(CustomKeyGenerator.java:68)
           at org.apache.hudi.keygen.BaseKeyGenerator.getKey(BaseKeyGenerator.java:62)
           at org.apache.hudi.utilities.deltastreamer.DeltaSync.lambda$readFromSource$d62e16$1(DeltaSync.java:453)
           at org.apache.spark.api.java.JavaPairRDD$.$anonfun$toScalaFunction$1(JavaPairRDD.scala:1070)
           at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
           at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:222)
           at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:349)
           at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1440)
           at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
           at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
           at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
           at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
           at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
           at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
           at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
           at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
           at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
           at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
           at org.apache.spark.scheduler.Task.run(Task.scala:131)
           at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
           at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
           at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
           at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
           at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
           at java.lang.Thread.run(Thread.java:748)
   Caused by: java.lang.IllegalArgumentException: Invalid format: "2022-01-13 16:57:05.659 +01:00"
           at org.joda.time.format.DateTimeFormatter.parseDateTime(DateTimeFormatter.java:945)
           at org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:202)
           at org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
           ... 30 more
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] xushiyan commented on issue #4946: [SUPPORT] TimestampBasedKeyGenerator failed to parse date_string column

Posted by GitBox <gi...@apache.org>.
xushiyan commented on issue #4946:
URL: https://github.com/apache/hudi/issues/4946#issuecomment-1059916802


   @chrischnweiss have you tried verifying the input datetime string works with the datetime format you set?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org