You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/02/09 17:29:13 UTC

[GitHub] [hudi] geoffroyatkwiff opened a new issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

geoffroyatkwiff opened a new issue #4778:
URL: https://github.com/apache/hudi/issues/4778


   **Describe the problem you faced**
   
   I am using the `_hoodie_is_deleted` column but some rows are written to the target table shouldn't.
   When both the "Insert" and "Delete" (the delete being for said inserted row) are in the same "source" parquet file, the "Delete" is stored into the target table.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a table using one parquet file (containing only Inserts) as the source
   2. Generate a parquet file that stores the following incremental changes:
       - a Delete for one of the row that's currently in the target table
       - an Insert of a new row
       - a Delete of this new row that was just inserted
   3. Add value to the `_hoodie_is_deleted` column accordingly and process/write the dataframe to the target table. If I follow the above, the rows will have the following values in this column, repsectively: True, False, True
   4. The row that already was in the target table and deleted in the last update is indeed deleted
   5. This other row whose `Insert` and `Delete` operations were stored in the same source parquet file (and so the Insert and Delete are in the same dataframe that was just processed) is present in the target table.
   
   **Expected behavior**
   
   The row whose `Insert` and `Delete` operations were stored in the same source parquet file shouldn't be written to the target table.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] geoffroyatkwiff commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
geoffroyatkwiff commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1036073470


   Hey @nsivabalan , I'm using `upsert`. I'm using a timestamp for `hoodie.datasource.write.precombine.field`.
   So, in the case of a dataframe being processed that would contain an Insert and a Delete for the same record (so same key value), only the Delete is taken into account, and since there is nothing to be deleted in the target table, it looks as if it is inserting this Delete..
   Any advice on the best way to handle this case?
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1039519922


   yes. unfortunately, this behavior can't be fixed easily. 
   For eg, 
   1. insert rec1 at time t0
   2. delete rec1 at time t100
   3. again re-insert rec1 at time t100000
   
   lets say between (2) and (3), there were 100s of commits, hudi may not keep remembering every record it has ever seen. So, after (2), whenever next update happens, hudi will remove rec1 from its storage. and so later if rec1 is ingested again, hudi will consider it as an insert record. 
   
   If not, hudi has to keep track of every record that got inserted and deleted forever. 
   Dont' think makes sense for a large analytical storage system. 
   
   
   wrt your statement "But the target table will contain a row with the values from the delete, whereas this row should not be inserted into the target table in any way (same as above)." : the delete record will be ingested to hudi, but will have "_hoodie_is_deleted" set to true. But during next merge or compaction, the record will be removed. This record will be part of storage only for a short duration. Also, there are other ways to trigger deletes. Please check out the details [here](https://hudi.apache.org/blog/2020/01/15/delete-support-in-hudi/).  not all of them storage the value with _hoodie_is_deleted as true. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] geoffroyatkwiff commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
geoffroyatkwiff commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1038874131


   Yes, it does seem to take precedence, but doesn't work as I would expect:
   - if you process a file that contains a delete for a row that's already in the target table, it is simply going to delete this row, which is what I would expect so that's totally fine.
   - if you process a file that contains an insert, and the delete for this same row, then as you said, the delete takes precedence, yes. But the target table will contain a row with the values from the delete, whereas this row should not be inserted into the target table in any way (same as above).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1039519922


   yes. unfortunately, this behavior can't be fixed easily. 
   For eg, 
   1. insert rec1 at time t0
   2. delete rec1 at time t100
   3. again re-insert rec1 at time t100000
   
   lets say between (2) and (3), there were 100s of commits, hudi may not keep remembering every record it has ever seen. So, after (2), whenever next update happens, hudi will remove rec1 from its storage. and so later if rec1 is ingested again, hudi will consider it as an insert record. 
   
   If not, hudi has to keep track of every record that got inserted and deleted forever. 
   Dont' think makes sense for a large analytical storage system. 
   
   
   wrt your statement "But the target table will contain a row with the values from the delete, whereas this row should not be inserted into the target table in any way (same as above)." : the delete record will be ingested to hudi, but will have "_hoodie_is_deleted" set to true. But during next merge or compaction, the record will be removed. This record will be part of storage only for a short duration. Also, there are other ways to trigger deletes. Please check out the details [here](https://hudi.apache.org/blog/2020/01/15/delete-support-in-hudi/).  not all of them storage the value with _hoodie_is_deleted as true. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] geoffroyatkwiff commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
geoffroyatkwiff commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1036325464


   > sorry. is your requirement, if we have insert and delete for the same record within one batch thats being ingested, you prefer final snapshot in hudi to show the insert record and not delete record?
   
   Hi @nsivabalan , no, I would expect the two to cancel each other out. For example: say someone creates a new user in a table, but realises they already had them under another name. They delete this user that was just created 1 minute before, and wouldn't expect the target table to show a row related to this duplicate user.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1036303695


   sorry. is your requirement, if we have insert and delete for the same record within one batch thats being ingested, you prefer final snapshot in hudi to show the insert record and not delete record? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1047252512


   If you don't have any follow up questions, I will go ahead and close the issue for now. do let us know if you need any more clarifications. thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1035624092


   which operation are you using with "2" step above? "insert" or "upsert". for upsert, depending on preCombine value, records will be deduped before ingesting into hudi. For inserts, you have to enable a config for this https://hudi.apache.org/docs/configurations/#hoodiecombinebeforeinsert
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #4778:
URL: https://github.com/apache/hudi/issues/4778


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1039519922


   yes. unfortunately, this behavior can't be fixed easily. 
   For eg, 
   1. insert rec1 at time t0
   2. delete rec1 at time t100
   3. again re-insert rec1 at time t100000
   
   lets say between (2) and (3), there were 100s of commits, hudi may not keep remembering every record it has ever seen. So, after (2), whenever next update happens, hudi will remove rec1 from its storage. and so later if rec1 is ingested again, hudi will consider it as an insert record. 
   
   If not, hudi has to keep track of every record that got inserted and deleted forever. 
   Dont' think makes sense for a large analytical storage system. 
   
   
   wrt your statement "But the target table will contain a row with the values from the delete, whereas this row should not be inserted into the target table in any way (same as above)." : the delete record will be ingested to hudi, but will have "_hoodie_is_deleted" set to true. But during next merge or compaction, the record will be removed. This record will be part of storage only for a short duration. Also, there are other ways to trigger deletes. Please check out the details [here](https://hudi.apache.org/blog/2020/01/15/delete-support-in-hudi/).  not all of them store the value with _hoodie_is_deleted as true. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1036305248


   guess during dedup, hudi goes by preCombine field value. whichever record has higher preCombine value will win. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #4778: [SUPPORT] Row with _hoodie_is_deleted=True stored into target table

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #4778:
URL: https://github.com/apache/hudi/issues/4778#issuecomment-1036526112


   if the precombine value in delete record is higher than insert record, then delete will take precedence. if not, insert will take precedence.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org