You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/02/11 13:24:35 UTC

[GitHub] [iceberg] akghbti opened a new issue #4093: Merge (CopyOnWrite) Not Efficient as compared to equivalent Delete/Update Operation

akghbti opened a new issue #4093:
URL: https://github.com/apache/iceberg/issues/4093


   Spark Version - 3.2.0
   Iceberg Version - 0.13
   
   There is table 'table1' partitioned by _c1 
   
   +----------+---+---+
   | _c0|_c1|_c2|
   +----------+---+---+
   |1225526400| 1| a|
   |1228118400| 10| j|
   |1228377600| 11| k|
   |1228809600| 12| l|
   |1228982400| 13| m|
   |1229673600| 14| n|
   |1230019200| 15| o|
   |1230278400| 16| p|
   |1230451200| 17| q|
   |1230624000| 18| r|
   |1230710400| 19| s|
   |1225699200| 2| b|
   |1225785600| 3| c|
   |1226476800| 4| d|
   |1226908800| 5| e|
   |1226995200| 6| f|
   |1227513600| 7| g|
   |1227772800| 8| h|
   |1228032000| 9| i|
   |1230796800| 20| t|
   +----------+---+---+
   
   There would be 25 part files. 
   
   
   Now there is target table 'table2', partitioned by '_c1'
   
   +----------+---+---+
   | _c0|_c1|_c2|
   +----------+---+---+
   |1228377600| 11| k|
   |1228809600| 12| l|
   |1228982400| 13| m|
   +----------+---+---+
   
   
   Now if run following query in Spark: 
   
   sparkSession.sql("MERGE INTO local.db.table1 t USING (SELECT * FROM local.db.table2) u ON t._c1=u._c1 "
   + "WHEN MATCHED AND t._c1='13' THEN DELETE");
   
   The summary in the manifest list output is : 
   
   "summary" : {
   "operation" : "overwrite",
   "spark.app.id" : "local-1644584660016",
   "added-data-files" : "2",
   "deleted-data-files" : "3",
   "added-records" : "2",
   "deleted-records" : "3",
   "added-files-size" : "1836",
   "removed-files-size" : "2754",
   "changed-partition-count" : "3",
   "total-records" : "25",
   "total-files-size" : "22883",
   "total-data-files" : "25",
   "total-delete-files" : "0",
   "total-position-deletes" : "0",
   "total-equality-deletes" : "0"
   }
   It shows that total part files which were re-written are 3 in numbers, Ideally, only 1 input part file should have been re-written because the merge condition only affects 1 input part file. 
   
   Same operation if one runs via plain delete query (as shown below), the summary of manifest reflects what is expected. 
   
   Plain alternate delete query: 
   
   -- sparkSession.sql("Delete from local.db.table1 WHERE _c1 in ('11','12','13') AND _c1 = '13'")
   
   Here is the summary of manifest list -- 
   
   
   "summary" : {
   "operation" : "delete",
   "spark.app.id" : "local-1644585674579",
   "deleted-data-files" : "1",
   "deleted-records" : "1",
   "removed-files-size" : "918",
   "changed-partition-count" : "1",
   "total-records" : "25",
   "total-files-size" : "22883",
   "total-data-files" : "25",
   "total-delete-files" : "0",
   "total-position-deletes" : "0",
   "total-equality-deletes" : "0"
   }
   
   So, from above, if you see two operations Merge with Delete and Plain Delete, the Plain Delete is more efficient as compared to Merge with Delete. 
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] akghbti edited a comment on issue #4093: Merge (CopyOnWrite) Not Efficient as compared to equivalent Delete/Update Operation

Posted by GitBox <gi...@apache.org>.
akghbti edited a comment on issue #4093:
URL: https://github.com/apache/iceberg/issues/4093#issuecomment-1038629845


   @RussellSpitzer  My point is why the Merge command (as shown in the First case in this Issue) is overwriting 3 input part files when the search clause " t._c1='13' " (part of the merge clause) only satisfies 1 input part file. The search clause (on the input table) is not being pushed.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #4093: Merge (CopyOnWrite) Not Efficient as compared to equivalent Delete/Update Operation

Posted by GitBox <gi...@apache.org>.
RussellSpitzer commented on issue #4093:
URL: https://github.com/apache/iceberg/issues/4093#issuecomment-1036399159


   Metadata deletes are always going to be faster than a MERGE command (or any row level operation). Are you asking for the MERGE Command to be automatically converted into the equivalent metadata delete? That seems like it may be difficult.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] akghbti commented on issue #4093: Merge (CopyOnWrite) Not Efficient as compared to equivalent Delete/Update Operation

Posted by GitBox <gi...@apache.org>.
akghbti commented on issue #4093:
URL: https://github.com/apache/iceberg/issues/4093#issuecomment-1038629845


   @RussellSpitzer  My point is why the Merge command (as shown in the case in this Issue) is overwriting 3 input part files when the search clause " t._c1='13' " (part of the merge clause) only satisfies 1 input part file. The search clause (on the input table) is not being pushed.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org