You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/19 15:12:57 UTC

[GitHub] [iceberg] arunb2w opened a new issue, #6453: Iceberg delete-append causing snapshot error

arunb2w opened a new issue, #6453:
URL: https://github.com/apache/iceberg/issues/6453

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I am working with a table that is around 200G(2B rows) in size and contains more than 100 columns.
   My incoming incremental contains 6M events which am trying to merge using the merge statement. But the merge took around 30 minutes on a 10 node cluster of r5.16xlarge.
   
   So, I tried with delete-append with **'write.delete.mode'='merge-on-read'** to see how the performance is but seeing some discrepancy here.
   Below are the steps am doing
   
   1. Creating temp view for the incoming events.
   ```
   input_df.createOrReplaceTempView("source")
   input_df.cache()
   ```
   2. Getting full row representation to append after delete
   ```
   inflate_sql = "select if(array_contains(source.changed_cols, 'ID'), source.ID, target.ID) as ID, if(array_contains(source.changed_cols, '_ETL_RUN_ID_'), source._ETL_RUN_ID_, target._ETL_RUN_ID_) as _ETL_RUN_ID_,if(array_contains(source.changed_cols, '_ETL_MODIFIED_'), source._ETL_MODIFIED_, target._ETL_MODIFIED_) as _ETL_MODIFIED_,if(array_contains(deflate.changed_cols, '_EXTRACTED_'), deflate._EXTRACTED_, target._EXTRACTED_) as _EXTRACTED_,if(array_contains(source.changed_cols, '_SOURCE_EXTRACTED_'), source._SOURCE_EXTRACTED_, target._SOURCE_EXTRACTED_) as _SOURCE_EXTRACTED_,if(array_contains(source.changed_cols, '_IS_DELETED_'), source._IS_DELETED_, target._IS_DELETED_) as _IS_DELETED_,if(array_contains(source.changed_cols, '_LAST_MODIFIED_SEQ_'), source._LAST_MODIFIED_SEQ_, target._LAST_MODIFIED_SEQ_) as _LAST_MODIFIED_SEQ_,if(array_contains(source.changed_cols, '_SCHEMA_CLASS_'), source._SCHEMA_CLASS_, target._SCHEMA_CLASS_) as _SCHEMA_CLASS_,if(array_contains(source.changed_c
 ols, '_CONTEXT_ID_'), source._CONTEXT_ID_, target._CONTEXT_ID_) as _CONTEXT_ID_ from glue_dev.datalakectxsort.clinicalprescription target join source on target._context_id_ = cast(source._context_id_ as decimal(12, 0)) and target.id = cast(source.id  as decimal(12, 0))"
   
   inflated_df = spark.sql(inflate_sql)
   ```
   3. Deleting from the target
   ```
   delete_sql = "delete from glue_dev.datalakectxsort.clinicalprescription as target
                   where exists(select 1 from deflate_table source where target._context_id_ = source._context_id_ and  target.id = source.id)"
    spark.sql(delete_sql)
   ```
   4. Appending the deleted records based on full row df that prepeared earlier
   `inflated_df.writeTo("glue_dev.datalakectxsort.clinicalprescription").append()`
   
   The problem here is that the deleted rows are not getting inserted even though I prepared the full row dataframe(inflated_df) before the delete command execution. The reason I believe is due to sparks lazy execution where it executes the full row query only during the append call.
   
   To overcome this, I also tried timetravel query to prepare the full row by pointing to a previous version before delete using timestamp keyword, basically my target would look like this in the above select query  `(select * from glue_dev.datalakectxsort.clinicalprescription timestamp as of '2022-12-19 06:13:02')` target but this is causing the below error.
   `Cannot write to table at a specific snapshot: 7305533148640597772`
   
   I am using snapshot only in the SELECT query while preparing the full row for append and during append am using the target table only but not sure why append is throwing this error.
   
   Please advise on how to overcome this. Thanks in advance! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra closed issue #6453: Iceberg delete-append causing snapshot error

Posted by GitBox <gi...@apache.org>.
nastra closed issue #6453: Iceberg delete-append causing snapshot error
URL: https://github.com/apache/iceberg/issues/6453


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #6453: Iceberg delete-append causing snapshot error

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #6453:
URL: https://github.com/apache/iceberg/issues/6453#issuecomment-1357835565

   It's not clear to me what you are trying to do, `input_df` is unused.  I'm not sure how "inflate_df" is being used either. The Delete statement appears to be looking for rows in a deflate table?
   
   I am most confused about what 
   ```inflated_df.writeTo("glue_dev.datalakectxsort.clinicalprescription").append()``` 
   
   Is meant to accomplish.
   
   The time travel error seems to be a bug in the "as of" functionality though since you shouldn't have an error just from reading from an old timestamp. This may be because you are both reading and writing to the same table and there is a bug there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on issue #6453: Iceberg delete-append causing snapshot error

Posted by GitBox <gi...@apache.org>.
nastra commented on issue #6453:
URL: https://github.com/apache/iceberg/issues/6453#issuecomment-1376020697

   @arunb2w is there anything outstanding in this issue that needs to be addressed or can it be closed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] nastra commented on issue #6453: Iceberg delete-append causing snapshot error

Posted by GitBox <gi...@apache.org>.
nastra commented on issue #6453:
URL: https://github.com/apache/iceberg/issues/6453#issuecomment-1385303484

   Closing this for now. Please feel free to re-open if necessary


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] arunb2w commented on issue #6453: Iceberg delete-append causing snapshot error

Posted by GitBox <gi...@apache.org>.
arunb2w commented on issue #6453:
URL: https://github.com/apache/iceberg/issues/6453#issuecomment-1357910955

   `Input_df` is used in both delete_sql and inflate_sql. Basically, I am just creating the temp view from my input events that needs to be updated to the iceberg table which i aliased it as **source**.
   Instead of merge am trying to perform delete and insert for the update events am getting(which is in input_df)
   
   So, am joining the input_df with target to delete the matching rows in target and then i will insert the deleted rows using append with latest values so that it will be similar to merge.
   
   Before delete, I will prepare a dataframe with full row representation of target with latest values. The inflate_sql portion is doing that by joining with the target again but the precedence to choose a field value whether from the target or source will be decided based on changed_cols field which keep track of list of fields that got changed for a particular record.
   
   For example, if my target iceberg has
   ```
   id, name, key, value
   1 name1 key1 value1
   2 name2 key2 value2
   3 name3 key3 value3
   ```
   
   And my input_df has
   ```
   1 newname1 newvalue1
   2 newvalue2
   ```
   
   Using the inflate_sql, I will prepare a dataframe(inflated_df) like this
   ```
   1 newname1 key1 newvalue1
   2 name2 key2 newvalue2
   ```
   
   Then, I will delete the same 2 rows with id 1 and 2. And then I will call append using the inflated_df dataframe which will add the records with latest value so that it will be similar to merge.
   
   The problem am facing here is that the rows got deleted but it is not getting inserted and the reason for the might be due to spark lazy execution. Even though am preparing my inflated_df well before delete but looks like it is getting executed only during the last append call and thats causing the insert to not add the records properly as they already got deleted.
   
   To overcome this, i used timetravel while preparing the inflated_df so that even though the current version of table dont have  those records the version before delete should definitely have it but thats causing the snapshot error.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kmozaid commented on issue #6453: Iceberg delete-append causing snapshot error

Posted by GitBox <gi...@apache.org>.
kmozaid commented on issue #6453:
URL: https://github.com/apache/iceberg/issues/6453#issuecomment-1361381099

   I think, you should create `inflated_df` as -
   ```
   spark.read().format("iceberg")
       .option(SparkReadOptions.AS_OF_TIMESTAMP, "2022-12-19 06:13:02")
       .load("glue_dev.datalakectxsort.clinicalprescription")
       .createOrReplaceTempView("clinicalprescription_before_delete");
   
   inflate_sql = "select <inflated_projection> from clinicalprescription_before_delete target join source on target._context_id_ = cast(source._context_id_ as decimal(12, 0)) and target.id = cast(source.id  as decimal(12, 0))"
   
   inflated_df = spark.sql(inflate_sql)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] arunb2w commented on issue #6453: Iceberg delete-append causing snapshot error

Posted by GitBox <gi...@apache.org>.
arunb2w commented on issue #6453:
URL: https://github.com/apache/iceberg/issues/6453#issuecomment-1361473026

   Thanks for the input. I also ended up doing more or less the same thing by materializing my inflated_df dataframe and using it after delete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org