You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/06 21:17:02 UTC

[GitHub] [hudi] joaqs190 opened a new issue #1803: [SUPPORT] hoodie.datasource.write.precombine.field is ignored

joaqs190 opened a new issue #1803:
URL: https://github.com/apache/hudi/issues/1803

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?

- Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.

- If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

Hi Hudi team!

The records in my use case need to leverage hoodie.datasource.write.precombine.field. The records have a multi key and often there are multiple records with the same key, same timestamp and the precombine field is used to break any ties.

During tests with 0.5.2 and 0.6.0 this precombine field is not taken into consideration and the last update is an intermediary value, see example.

Example:

Ouput of the Records in S3 generated by AWS DMS:

Record 1:
"Op": "U",
"timestamp": "2020-07-06 18:57:47.000000",
"items": 61

Record 2:
"Op": "U",
"timestamp": "2020-07-06 18:57:48.000000",
"items": 62

Record 3:
"Op": "U",
"timestamp": "2020-07-06 18:57:52.000000",
"items": 63

Record 4:
"Op": "U",
"timestamp": "2020-07-06 18:57:52.000000",
"items": 64

Record 5:
"Op": "U",
"timestamp": "2020-07-06 18:57:52.000000",
"items": 65

If I visit the Hudi Deltastreamer form within Spark, Record 3 ("items" set to 63) was written to the dataset but not Record 5 (with "items" 65).

**To Reproduce**

Steps to reproduce the behavior:

1. Leverage https://cwiki.apache.org/confluence/display/HUDI/2020/01/20/Change+Capture+Using+AWS+Database+Migration+Service+and+Hudi
2. add a sql transform to extract an unique number from the input file (this number exists in a column in the dataset, it is unique, the transform only puts it in its own column)

**Expected behavior**

A clear and concise description of what you expected to happen.

Record 5 from the example above should have been the value for the record key. I expected that Deltastreamer ordered the records with the same record key and timestamp to be ordered by the precombine field. Instead Deltastreamer uses the first record for that specific time stamp and record key and ignores the records that come after and have higher precombine field value.

**Environment Description**
EMR
* Hudi version :
0.5.2 and 0.6.0
* Spark version :
2.4.5
* Hive version :
x
* Hadoop version :
x
* Storage (HDFS/S3/GCS..) :
S3
* Running on Docker? (yes/no) :
no

**Additional context**

Add any other context about the problem here.

**Stacktrace**

```Add the stacktrace of the error.```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] joaqs190 closed issue #1803: [SUPPORT] hoodie.datasource.write.precombine.field is ignored

Posted by GitBox <gi...@apache.org>.

joaqs190 closed issue #1803:
URL: https://github.com/apache/hudi/issues/1803


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] joaqs190 commented on issue #1803: [SUPPORT] hoodie.datasource.write.precombine.field is ignored

Posted by GitBox <gi...@apache.org>.

joaqs190 commented on issue #1803:
URL: https://github.com/apache/hudi/issues/1803#issuecomment-654768311


   thank you @bhasudha . That explains it, all set, closing. Thank you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] bhasudha commented on issue #1803: [SUPPORT] hoodie.datasource.write.precombine.field is ignored

Posted by GitBox <gi...@apache.org>.

bhasudha commented on issue #1803:
URL: https://github.com/apache/hudi/issues/1803#issuecomment-654521251


   @joaqs190 quick questions:
   
   1. could you describe what is the precombine field here ? 
   2. Hudi uses two way of writing - Spark datasource writer and Deltastreamer. For Deltastreamer we use the config `--source-ordering-field` to configure the precombine field. Can you ensure if this is what you are configuring too ? 
     


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org