You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org> on 2023/03/17 20:39:08 UTC
[GitHub] [hudi] berniedurfee-renaissance opened a new issue, #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
berniedurfee-renaissance opened a new issue, #8224:
URL: https://github.com/apache/hudi/issues/8224
I'm running Deltastreamer in AWS EMR Serverless and it seems that Deltastreamer is ignoring `hoodie.datasource.write.precombine.field` in my config file.
**To Reproduce**
Setup a source bucket of parquet files (Mine are from AWS DMS): `s3://my-lakehouse/dms-output-raw/schema_1/table_1/`
Add a properties file (`s3://my-lakehouse/deltastreamer-config/deltastreamer.properties`):
```
hoodie.schema.on.read.enable = true
hoodie.datasource.write.recordkey.field = origin_schema,id
hoodie.datasource.write.precombine.field = updated_at
hoodie.datasource.write.partitionpath.field = origin_schema
hoodie.datasource.write.keygenerator.class = org.apache.hudi.keygen.ComplexKeyGenerator
hoodie.deltastreamer.source.dfs.root=s3://my-lakehouse/dms-output-raw/schema_1/table_1/
```
Submit a job to EMR Serverless 6.10.0
- Script Location: `s3://my-lakehouse/deltastreamer-jar/hudi-utilities-slim-bundle_2.12-0.13.0.jar`
- Main Class: `org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer`
- Script arguments
```
["--table-type","COPY_ON_WRITE","--target-base-path","s3://my-lakehouse/deltastreamer-out","--target-table","table1","--source-class","org.apache.hudi.utilities.sources.ParquetDFSSource","--props","s3://my-lakehouse/deltastreamer-config/deltastreamer.properties"]
```
- Properties key 1: `spark.serializer` = `org.apache.spark.serializer.KryoSerializer`
- Properties key 2: `spark.jars` = `s3://ec-lakehouse-qa/deltastreamer-jar/hudi-spark3-bundle_2.12-0.13.0.jar`
Everything else is default. I tried 0.12 and 0.13, but same result.
**Expected behavior**
Rows from source are upserted to destination.
Deltastreamer fails and when I look at `/my-lakehouse/deltastreamer-out/mytable1/.hoodie/hoodie.properties` I can see that `hoodie.table.precombine.field=ts`. That should be `hoodie.table.precombine.field=updated_at` because it's what's in the properties file, right?
**Environment Description**
* Hudi version : 0.13
* Spark version : 3.3.1
* Hive version : Not sure for EMR Serverless 6.10.0
* Hadoop version : Not sure for EMR Serverless 6.10.0
* Storage (HDFS/S3/GCS..) : S3
* Running on Docker? (yes/no) : EMR Serverless 6.10.0
**Stacktrace**
```Add the stacktrace of the error.```
```Job failed, please check complete logs in configured logging destination. ExitCode: 1. Last few exceptions: Caused by: org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, origin_schema, origin_table] Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) ([2600:1f13:4df:c101:e7b0:42bd:e2f6:f474] executor 1): org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, origin_schema, origin_table] Caused by: org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, or
igin_schema, origin_table] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 5) ([2600:1f13:4df:c101:e7b0:42bd:e2f6:f474] executor 1): org.apache.hudi.exception.HoodieException: ts(Part -ts) field not found in record. Acceptable fields were :[xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, xxx, created_at, updated_at, origin_schema, origin_table] 23/03/17 20:13:39 ERROR HoodieDeltaStreamer: Got error running delta sync once. Shutting down...```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] berniedurfee-renaissance closed issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance closed issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
URL: https://github.com/apache/hudi/issues/8224
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] Zouxxyy commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
Posted by "Zouxxyy (via GitHub)" <gi...@apache.org>.
Zouxxyy commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474840519
you can try `--source-ordering-field`
![image](https://user-images.githubusercontent.com/37108074/226106704-24315c47-dd47-44fe-895e-b51344195050.png)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474387765
I tried this on `emr-6.9.0` and got the same result.
The other settings from my properties file are being propagated over to the file in `.hoodie` like `hoodie.table.partition.fields=origin_schema` and `hoodie.table.recordkey.fields=origin_schema,id`.
It's just the `hoodie.datasource.write.precombine.field = updated_at` setting in my properties file ends up as `hoodie.table.precombine.field=ts` in the table properties file.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474389849
Also, no data is written to the table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1474390249
Also, also, the target table doesn't exist before the run.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [hudi] berniedurfee-renaissance commented on issue #8224: [SUPPORT] Deltastreamer ignoring hoodie.datasource.write.precombine.field
Posted by "berniedurfee-renaissance (via GitHub)" <gi...@apache.org>.
berniedurfee-renaissance commented on issue #8224:
URL: https://github.com/apache/hudi/issues/8224#issuecomment-1475533413
That worked, thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org