You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/19 16:19:05 UTC

[GitHub] [iceberg] arunb2w commented on issue #6453: Iceberg delete-append causing snapshot error

arunb2w commented on issue #6453:
URL: https://github.com/apache/iceberg/issues/6453#issuecomment-1357910955

   `Input_df` is used in both delete_sql and inflate_sql. Basically, I am just creating the temp view from my input events that needs to be updated to the iceberg table which i aliased it as **source**.
   Instead of merge am trying to perform delete and insert for the update events am getting(which is in input_df)
   
   So, am joining the input_df with target to delete the matching rows in target and then i will insert the deleted rows using append with latest values so that it will be similar to merge.
   
   Before delete, I will prepare a dataframe with full row representation of target with latest values. The inflate_sql portion is doing that by joining with the target again but the precedence to choose a field value whether from the target or source will be decided based on changed_cols field which keep track of list of fields that got changed for a particular record.
   
   For example, if my target iceberg has
   ```
   id, name, key, value
   1 name1 key1 value1
   2 name2 key2 value2
   3 name3 key3 value3
   ```
   
   And my input_df has
   ```
   1 newname1 newvalue1
   2 newvalue2
   ```
   
   Using the inflate_sql, I will prepare a dataframe(inflated_df) like this
   ```
   1 newname1 key1 newvalue1
   2 name2 key2 newvalue2
   ```
   
   Then, I will delete the same 2 rows with id 1 and 2. And then I will call append using the inflated_df dataframe which will add the records with latest value so that it will be similar to merge.
   
   The problem am facing here is that the rows got deleted but it is not getting inserted and the reason for the might be due to spark lazy execution. Even though am preparing my inflated_df well before delete but looks like it is getting executed only during the last append call and thats causing the insert to not add the records properly as they already got deleted.
   
   To overcome this, i used timetravel while preparing the inflated_df so that even though the current version of table dont have  those records the version before delete should definitely have it but thats causing the snapshot error.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org