You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/11 08:37:26 UTC

[GitHub] [iceberg] borislitvak opened a new issue #1912: Concurrent writes from multiple Spark drivers to S3 support

borislitvak opened a new issue #1912:
URL: https://github.com/apache/iceberg/issues/1912


   Can we perform concurrent writes from multiple Spark drivers to S3 with Iceberg without data loss/corruption?
   
   Background: 
   
   - Delta.io [s3 documentation states](https://docs.delta.io/latest/delta-storage.html) that
   "Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss."
   - With AWS announcement on [S3 consistency](https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/), does this relax the requirements in any way?
   
   Thanks, Boris


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] vinothchandar commented on issue #1912: Concurrent writes from multiple Spark drivers to S3 support

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1912:
URL: https://github.com/apache/iceberg/issues/1912#issuecomment-743268138


   FWIW the answers I gave are for Hudi are for the case that does not need an external server like Hive. So its not quite apples-apples. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi edited a comment on issue #1912: Concurrent writes from multiple Spark drivers to S3 support

Posted by GitBox <gi...@apache.org>.
aokolnychyi edited a comment on issue #1912:
URL: https://github.com/apache/iceberg/issues/1912#issuecomment-743115379


   Iceberg avoids listing and renames by design and does not have this limitation as long as you use Iceberg [Hive catalog](http://iceberg.apache.org/java-api-quickstart/#using-a-hive-catalog).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on issue #1912: Concurrent writes from multiple Spark drivers to S3 support

Posted by GitBox <gi...@apache.org>.
aokolnychyi commented on issue #1912:
URL: https://github.com/apache/iceberg/issues/1912#issuecomment-743115379


   Iceberg avoiding listing and renames by design and does not have this limitation as long as you use Iceberg [Hive catalog](http://iceberg.apache.org/java-api-quickstart/#using-a-hive-catalog).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] borislitvak edited a comment on issue #1912: Concurrent writes from multiple Spark drivers to S3 support

Posted by GitBox <gi...@apache.org>.
borislitvak edited a comment on issue #1912:
URL: https://github.com/apache/iceberg/issues/1912#issuecomment-743157280


   For future visitors looking to compare s3 ACID data lake solutions, this scenario over S3 is not supported in Hudi/Delta and there does not seem to be a path for such a support. 
   So Iceberg is a clear winner in this regard!
   
   Sources:
   [Delta.io](https://github.com/delta-io/delta/issues/564)
   [Hudi](https://github.com/apache/hudi/issues/2330)
   
   How did you accomplish this, @aokolnychyi ?
   
   UPDATE: As @vinothchandar mentioned, Iceberg's solution must go through a Hive Metastore which will provide with the synchronization primitives over S3. Hudi is considering providing a similar or better solution. I am not sure about the performance impact here.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] borislitvak commented on issue #1912: Concurrent writes from multiple Spark drivers to S3 support

Posted by GitBox <gi...@apache.org>.
borislitvak commented on issue #1912:
URL: https://github.com/apache/iceberg/issues/1912#issuecomment-743157280


   For future visitors looking to compare s3 ACID data lake solutions, this scenario over S3 is not supported in Hudi/Delta and there does not seem to be a path for such a support. 
   So Iceberg is a clear winner in this regard!
   
   Sources:
   [Delta.io](https://github.com/delta-io/delta/issues/564)
   [Hudi](https://github.com/apache/hudi/issues/2330)
   
   How did you accomplish this, @aokolnychyi ?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] HeartSaVioR commented on issue #1912: Concurrent writes from multiple Spark drivers to S3 support

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on issue #1912:
URL: https://github.com/apache/iceberg/issues/1912#issuecomment-757573819


   @borislitvak 
   I'd suggest closing the issue, as the answer is already provided and no further discussion is expected.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] borislitvak closed issue #1912: Concurrent writes from multiple Spark drivers to S3 support

Posted by GitBox <gi...@apache.org>.
borislitvak closed issue #1912:
URL: https://github.com/apache/iceberg/issues/1912


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org