You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/11 08:35:55 UTC

[GitHub] [spark] steveloughran commented on pull request #37474: [SPARK-40039][SS][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

steveloughran commented on PR #37474:
URL: https://github.com/apache/spark/pull/37474#issuecomment-1211699163

   hey @HeartSaVioR. yes, this is exactly what the API we worked on was designed for.
   
   There is no need to initiate an MPU when writing small files; the OutputStream simply doesn't upload the data anymore. you can check this by calling toString() on the stream, all its IO stats there. This means that the cost is as normal; one PUT for data <= the block size, after that one POST to initiate, one POST per block and one POST in close() to finalize. block uploads are parallelised, though you do need enough https connection for this.
   
   It's no more expensive than normal write; upload performance will be the same. except when you call abort(), when it is faster.
   
   that said, let me review the code to confirm this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org