You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/07/14 10:47:00 UTC

[GitHub] [iceberg] GrigorievNick opened a new issue #1202: [Question] Do Spark Iceberg Implement Copy On Write Delete and Update?

GrigorievNick opened a new issue #1202:
URL: https://github.com/apache/iceberg/issues/1202


   Hi, 
   I have a question >
   Do Spark Iceberg Implement Copy On Write Delete and Update?
   * I find [this MR](https://github.com/apache/iceberg/pull/351), so Iceberg itself supports it. 
     But what about implementation in spark? 
     Looking on [this code](https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/source/SparkWriteBuilder.java#L87) I can say that at least Spark 3 supports it, am I right? 
    * But looking to spark the only usage of this builder method is in delete sections of org.apache.spark.sql.execution.datasources.v2.OverwriteByExpressionExecV1 and OverwriteByExpression.
    * So does it possible to implement Eager(CopyOnWriteUpdate) or there is some known issues?
   
   Do Spark 2 Iceberg Implement Copy On Write Delete and Update?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick commented on issue #1202: [Question] Do Spark Iceberg module implement Copy On Write Delete and Update?

Posted by GitBox <gi...@apache.org>.
GrigorievNick commented on issue #1202:
URL: https://github.com/apache/iceberg/issues/1202#issuecomment-659308492


   > Both Spark 2.4 and Spark 3.0 support dynamic partition overwrite. Spark 3.0 also supports overwrite by expression, although the expression must match all rows in a data file or no rows of a data file, or else it will cause an exception because the granularity of delete is a whole data file.
   
   But Overwrite that implemented in delete is match smarter then overwrite all data in the partition. 
   it will change only files that contain changes, while simple overwrite will update all partition.
   So of course I can read data all data from partition -> manipulate -> overwrite. 
   But I can do it with any code. What I am looking for is to update only files that match changes.
   So as I understand, there is no such solution right now, yes?
   
   I can implement it manually using low-level(java-core) API.
   But in this case, I have one more question, which I can't find in docs. 
   Does it possible to do concurrent [Table Operation](https://iceberg.apache.org/api/#table-metadata) -> `newRewrite`?
    Small explanation: I will have different spark partitions that will overwrite one or a few dataFiles.
   And of course, a partition is idempotent and running in parallel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick closed issue #1202: [Question] Do Spark Iceberg module implement Copy On Write Delete and Update?

Posted by GitBox <gi...@apache.org>.
GrigorievNick closed issue #1202:
URL: https://github.com/apache/iceberg/issues/1202


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on issue #1202: [Question] Do Spark Iceberg module implement Copy On Write Delete and Update?

Posted by GitBox <gi...@apache.org>.
rdblue commented on issue #1202:
URL: https://github.com/apache/iceberg/issues/1202#issuecomment-659015742


   Both Spark 2.4 and Spark 3.0 support dynamic partition overwrite. Spark 3.0 also supports overwrite by expression, although the expression must match all rows in a data file or no rows of a data file, or else it will cause an exception because the granularity of a delete is a whole data file.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick edited a comment on issue #1202: [Question] Do Spark Iceberg module implement Copy On Write Delete and Update?

Posted by GitBox <gi...@apache.org>.
GrigorievNick edited a comment on issue #1202:
URL: https://github.com/apache/iceberg/issues/1202#issuecomment-659308492


   > Both Spark 2.4 and Spark 3.0 support dynamic partition overwrite. Spark 3.0 also supports overwrite by expression, although the expression must match all rows in a data file or no rows of a data file, or else it will cause an exception because the granularity of delete is a whole data file.
   
   @rdblue 
   But Overwrite that implemented in delete is match smarter then overwrite all data in the partition. 
   it will change only files that contain changes, while simple overwrite will update all partition.
   So of course I can read data all data from partition -> manipulate -> overwrite. 
   But I can do it with any code. What I am looking for is to update only files that match changes.
   So as I understand, there is no such solution right now, yes?
   
   I can implement it manually using low-level(java-core) API.
   But in this case, I have one more question, which I can't find in docs. 
   Does it possible to do concurrent [Table Operation](https://iceberg.apache.org/api/#table-metadata) -> `newRewrite`?
    Small explanation: I will have different spark partitions that will overwrite one or a few dataFiles.
   And of course, a partition is idempotent and running in parallel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] GrigorievNick edited a comment on issue #1202: [Question] Do Spark Iceberg module implement Copy On Write Delete and Update?

Posted by GitBox <gi...@apache.org>.
GrigorievNick edited a comment on issue #1202:
URL: https://github.com/apache/iceberg/issues/1202#issuecomment-659308492


   > Both Spark 2.4 and Spark 3.0 support dynamic partition overwrite. Spark 3.0 also supports overwrite by expression, although the expression must match all rows in a data file or no rows of a data file, or else it will cause an exception because the granularity of delete is a whole data file.
   
   @rdblue 
   But `Overwrite` that implemented in delete is much smarter then overwrite all data in the partition. 
   it will change only files that contain changes, while simple overwrite will update all partition.
   So of course I can read data all data from partition -> manipulate -> overwrite. 
   But I can do it with any code. What I am looking for is to update only files that match changes.
   So as I understand, there is no such solution right now, yes?
   
   I can implement it manually using low-level(java-core) API.
   But in this case, I have one more question, which I can't find in docs. 
   Does it possible to do concurrent [Table Operation](https://iceberg.apache.org/api/#table-metadata) -> `newRewrite`?
    Small explanation: I will have different spark partitions that will overwrite one or a few dataFiles.
   And of course, a partition is idempotent and running in parallel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org