You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org> on 2023/03/17 20:39:08 UTC

[GitHub] [hudi] berniedurfee-renaissance opened a new issue, #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

berniedurfee-renaissance opened a new issue, #8224:
URL: https://github.com/apache/hudi/issues/8224

   I'm running Deltastreamer in AWS EMR Serverless and it seems that Deltastreamer is ignoring `hoodie.datasource.write.precombine.field` in my config file.
    
   **To Reproduce**
   
   Setup a source bucket of parquet files (Mine are from AWS DMS): `s3://my-lakehouse/dms-output-raw/schema_1/table_1/`
   
   Add a properties file (`s3://my-lakehouse/deltastreamer-config/deltastreamer.properties`): 
   ```
   hoodie.schema.on.read.enable = true
   hoodie.datasource.write.recordkey.field = origin_schema,id
   hoodie.datasource.write.precombine.field = updated_at
   hoodie.datasource.write.partitionpath.field = origin_schema
   hoodie.datasource.write.keygenerator.class = org.apache.hudi.keygen.ComplexKeyGenerator
   hoodie.deltastreamer.source.dfs.root=s3://my-lakehouse/dms-output-raw/schema_1/table_1/
   ```
   
   Submit a job to EMR Serverless 6.10.0
   - Script Location: `s3://my-lakehouse/deltastreamer-jar/hudi-utilities-slim-bundle_2.12-0.13.0.jar`
   - Main Class: `org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer`
   - Script arguments
   ```
   ["--table-type","COPY_ON_WRITE","--target-base-path","s3://my-lakehouse/deltastreamer-out","--target-table","table1","--source-class","org.apache.hudi.utilities.sources.ParquetDFSSource","--props","s3://my-lakehouse/deltastreamer-config/deltastreamer.properties"]
   ```
   - Properties key 1: `spark.serializer` = `org.apache.spark.serializer.KryoSerializer`
   - Properties key 2: `spark.jars` = `s3://ec-lakehouse-qa/deltastreamer-jar/hudi-spark3-bundle_2.12-0.13.0.jar`
   
   Everything else is default. I tried 0.12 and 0.13, but same result.
   
   **Expected behavior**
   
   Rows from source are upserted to destination.
   
   Deltastreamer fails and when I look at `/my-lakehouse/deltastreamer-out/mytable1/.hoodie/hoodie.properties` I can see that `hoodie.table.precombine.field=ts`. That should be `hoodie.table.precombine.field=updated_at` because it's what's in the properties file, right?
   
   **Environment Description**
   
   * Hudi version : 0.13
   
   * Spark version : 3.3.1
   
   * Hive version : Not sure for EMR Serverless 6.10.0
   
   * Hadoop version : Not sure for EMR Serverless 6.10.0
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : EMR Serverless 6.10.0
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   ```Job failed, please check complete logs in configured logging destination. ExitCode: 1. Last few exceptions: Caused by: org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, origin_schema, origin_table] Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) ([2600:1f13:4df:c101:e7b0:42bd:e2f6:f474] executor 1): org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, origin_schema, origin_table] Caused by: org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, or
 igin_schema, origin_table] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) ([2600:1f13:4df:c101:e7b0:42bd:e2f6:f474] executor 1): org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, origin_schema, origin_table] 23/03/17 20:13:39 ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down...```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] berniedurfee-renaissance closed issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance closed issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
URL: https://github.com/apache/hudi/issues/8224


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] Zouxxyy commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474840519

   you can try `--source-ordering-field` 
   ![image](https://user-images.githubusercontent.com/37108074/226106704-24315c47-dd47-44fe-895e-b51344195050.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474387765

   I tried this on `emr-6.9.0` and got the same result.
   
   The other settings from my properties file are being propagated over to the file in `.hoodie` like `hoodie.table.partition.fields=origin_schema` and `hoodie.table.recordkey.fields=origin_schema,id`.
   
   It's just the `hoodie.datasource.write.precombine.field = updated_at` setting in my properties file ends up as `hoodie.table.precombine.field=ts` in the table properties file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474389849

   Also, no data is written to the table.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474390249

   Also, also, the target table doesn't exist before the run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field

Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1475533413

   That worked, thank you!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org