You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/06 20:16:33 UTC

[GitHub] [hudi] josuexcxc opened a new issue #4754: upsert operation removes duplicates

josuexcxc opened a new issue #4754:
URL: https://github.com/apache/hudi/issues/4754


   my initial hudi table contains duplicates for several record keys, when writing updates to these duplicate records, hudi keeps me a single record and I need it to keep the same number of duplicate records
    
   initial table
   `+---------+-----------+------------+------+------------+
   |AccountID|CreatedDate|ModifiedDate|Amount|CurrencyCode|
   +---------+-----------+------------+------+------------+
   |      500|   22/10/21|    22/10/21|   502|         MXN|
   |      500|   22/10/21|    22/10/21|   502|         MXN|
   |      501|   22/10/21|    22/10/21|  1969|         MXN|
   |      502|   22/10/21|    22/10/21|  1612|         MXN|
   |      503|   22/10/21|    22/10/21|  1559|         MXN|
   |      504|   22/10/21|    22/10/21|  1494|         MXN|
   |      505|   22/10/21|    22/10/21|  1448|         MXN|
   |      506|   22/10/21|    22/10/21|  1059|         USD|
   |      507|   22/10/21|    22/10/21|   795|         USD|
   |      508|   22/10/21|    22/10/21|   822|         USD|
   |      509|   22/10/21|    22/10/21|  1612|         MXN|
   |      510|   22/10/21|    22/10/21|  1578|         MXN|
   |      510|   22/10/21|    22/10/21|  1578|         MXN|
   |      511|   22/10/21|    22/10/21|   709|         USD|
   +---------+-----------+------------+------+------------+`
   upsertDF
   `+---------+-----------+------------+------+------------+
   |AccountID|CreatedDate|ModifiedDate|Amount|CurrencyCode|
   +---------+-----------+------------+------+------------+
   |520      |22/10/21   |22/10/21    |713   |USD         |
   |520      |22/10/21   |22/10/21    |713   |USD         |
   |510      |22/10/21   |22/10/21    |1578  |MXN         |
   |510      |22/10/21   |22/10/21    |1578  |MXN         |
   |500      |22/10/21   |22/10/21    |502   |MXN         |
   |500      |22/10/21   |22/10/21    |502   |MXN         |
   |515      |22/10/21   |22/10/21    |1803  |MXN         |
   +---------+-----------+------------+------+------------+`
   
   hudi table with applied upsert operation
   
   `+---------+-----------+------------+------+------------+
   |AccountID|CreatedDate|ModifiedDate|Amount|CurrencyCode|
   +---------+-----------+------------+------+------------+
   |501      |22/10/21   |22/10/21    |1969  |MXN         |
   |502      |22/10/21   |22/10/21    |1612  |MXN         |
   |503      |22/10/21   |22/10/21    |1559  |MXN         |
   |504      |22/10/21   |22/10/21    |1494  |MXN         |
   |505      |22/10/21   |22/10/21    |1448  |MXN         |
   |506      |22/10/21   |22/10/21    |1059  |USD         |
   |507      |22/10/21   |22/10/21    |795   |USD         |
   |508      |22/10/21   |22/10/21    |822   |USD         |
   |509      |22/10/21   |22/10/21    |1612  |MXN         |
   |510      |22/10/21   |23/10/21    |1600  |MXN         |
   |511      |22/10/21   |22/10/21    |709   |USD         |`
   
   as you can see in the hudi table I only have one record with AccountID 510, when there should be 2
   
   **Hudi Configuration**
   table_name = "foto"
   localPath = f"s3://dev-vol-model-zone/mdl_com_ancillaries/public.bt_ancillaries_final/hudi_tables/{table_name}/"
   key_fields = "AccountID,Amount,ModifiedDate"
   precombine_fields = "ModifiedDate"
   partition_fields = ""
   
   hudiOptions = {
       'hoodie.table.name': table_name,
       'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator',
       'hoodie.datasource.write.recordkey.field': key_fields,
       'hoodie.datasource.write.partitionpath.field': partition_fields,
       'hoodie.datasource.write.precombine.field': precombine_fields,
       "hoodie.datasource.write.table.type": "MERGE_ON_READ",
       }
   
   foto1.write.format('org.apache.hudi') \
       .option('hoodie.datasource.write.operation', 'insert') \
       .options(**hudiOptions) \
       .mode('overwrite') \
       .save(localPath)
   
   upserts.write.format("org.apache.hudi").options(**hudiOptions).mode("append").save(localPath)
   
   Spark: 2.4.7
   EMR: 5.33.0
   HUDI: 0.7.0-amzn-1
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #4754: upsert operation removes duplicates

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #4754:
URL: https://github.com/apache/hudi/issues/4754#issuecomment-1032007487


   @josuexcxc : thats how "upsert" works. If you wish to not de-dup or not update older version, you can switch the operation type to "insert". 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4754: upsert operation removes duplicates

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4754:
URL: https://github.com/apache/hudi/issues/4754#issuecomment-1032007487


   @josuexcxc : thats how "upsert" works. If you wish to not de-dup or update older version, you can switch the operation type to "insert". 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4754: upsert operation removes duplicates

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4754:
URL: https://github.com/apache/hudi/issues/4754#issuecomment-1039523557


   @josuexcxc : if the above suggestion worked for you, let us know so that we can close the issue. if not, we can dive deeper. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4754: upsert operation removes duplicates

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4754:
URL: https://github.com/apache/hudi/issues/4754#issuecomment-1047218763


   feel free to re-open if you are looking for any more assistance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #4754: upsert operation removes duplicates

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #4754:
URL: https://github.com/apache/hudi/issues/4754


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4754: upsert operation removes duplicates

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4754:
URL: https://github.com/apache/hudi/issues/4754#issuecomment-1039523557


   @josuexcxc : if the above suggestion worked for you, let us know so that we can close the issue. if not, we can dive deeper. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org