You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/27 07:46:26 UTC

[GitHub] [spark] attilapiros opened a new pull request, #37687: [SPARK-40039][SS][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

attilapiros opened a new pull request, #37687:
URL: https://github.com/apache/spark/pull/37687

   ### What changes were proposed in this pull request?
   
   Currently on S3 the checkpoint file manager (called `FileContextBasedCheckpointFileManager`) is available which is based on the rename operation. So when a file is opened for an atomic stream a temporary file will be used behind the scenes and when the stream is committed the file is renamed to its final location.
   
   But on S3 the rename operation will be a file copy so it has some serious performance implication.
   
   On Hadoop 3 there is new interface introduce called [Abortable](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Abortable.html) and S3AFileSystem has this capability. When the file is small (<= the block size) this will be a single PUT as commit and no operation if it is aborted. When the file is bigger then S3's multipart upload is used: so when the file is committed [a POST is sent](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html) and when aborted [a DELETE will be sent](https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html) (asynchronously).
   
   This avoids the file copying altogether.
   
   ### Why are the changes needed?
   
   For improving streaming performance.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   
   #### Unit test
   
   I have refactored the existing `CheckpointFileManagerTests` and run against a test filesystem which supports the `Abortable` interface (see `AbortableFileSystem` which is based on `RawLocalFileSystem`). 
   This way we have a unit test.
   
   #### Integration test
   
   Moreover the same test can be run against AWS S3 by using an integration test (see `AwsS3AbortableStreamBasedCheckpointFileManagerSuite`):
   
   ```
   -> S3_PATH=<..> AWS_ACCESS_KEY_ID=<..> AWS_SECRET_ACCESS_KEY=<..> AWS_SESSION_TOKEN=<..>  ./build/mvn install -pl hadoop-cloud  -Phadoop-cloud,hadoop-3,integration-test
   
   Discovery starting.
   Discovery completed in 346 milliseconds.
   Run starting. Expected test count is: 1
   AwsS3AbortableStreamBasedCheckpointFileManagerSuite:
   - mkdirs, list, createAtomic, open, delete, exists
   CommitterBindingSuite:
   AbortableStreamBasedCheckpointFileManagerSuite:
   Run completed in 14 seconds, 407 milliseconds.
   Total number of tests run: 1
   Suites: completed 4, aborted 0
   Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   ```
   
   #### Performance test
   
   I have run a [small performance app](https://github.com/attilapiros/spark-ss-perf/blob/ab4c6004caffc38a218fa81fd5482a6cc07ca14f/src/main/scala/perf.scala) which uses a rate stream and foreach sink with an empty body. The results: 
   
   ```
   ➜  spark git:(SPARK-40039) ✗ ./bin/spark-submit ../spark-ss-perf/target/scala-2.12/performance-spark-ss_2.12-0.1.jar s3a://mybucket org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager  2>&1 | grep "walCommit took" | awk '{print $7}' |  datamash max 1 min 1 mean 1 median 1 perc:90 1 perc:95 1 perc:99 1
   4143    3286    3528.6  3500    3742.8  3840    4076.04
   
   ➜  spark git:(SPARK-40039) ✗ ./bin/spark-submit ../spark-ss-perf/target/scala-2.12/performance-spark-ss_2.12-0.1.jar s3a://mybucket org.apache.spark.internal.io.cloud.AbortableStreamBasedCheckpointFileManager  2>&1 | grep "walCommit took" | awk '{print $7}' |  datamash max 1 min 1 mean 1 median 1 perc:90 1 perc:95 1 perc:99 1
   3765    1447    2187.0217391304 1844.5  2867    2976.5  3437.85 
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #37687: [SPARK-40039][SS][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

Posted by GitBox <gi...@apache.org>.

attilapiros commented on PR #37687:
URL: https://github.com/apache/spark/pull/37687#issuecomment-1229145114

   This is basically https://github.com/apache/spark/pull/37687 (including the Scala style fix) and the [fix build issue](https://github.com/apache/spark/pull/37687/commits/0e0110e44e9eddf78b49ae91a867352a1c6d7037) commit contains the fixing of the build issue with hadoop-2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #37687: [SPARK-40039][SS][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

Posted by GitBox <gi...@apache.org>.

attilapiros commented on PR #37687:
URL: https://github.com/apache/spark/pull/37687#issuecomment-1229146222

   ```
   $ ./build/mvn -Phadoop-2 -Phadoop-cloud -pl hadoop-cloud test 
   ...
   Discovery starting.
   Discovery completed in 62 milliseconds.
   Run starting. Expected test count is: 0
   DiscoverySuite:
   Run completed in 77 milliseconds.
   Total number of tests run: 0
   Suites: completed 1, aborted 0
   Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
   ```
   
   ```
   $ ./build/mvn -Phadoop-3 -Phadoop-cloud -pl hadoop-cloud test
   ...
   Discovery starting.
   Discovery completed in 354 milliseconds.
   Run starting. Expected test count is: 5
   CommitterBindingSuite:
   - BindingParquetOutputCommitter binds to the inner committer
   - committer protocol can be serialized and deserialized
   - local filesystem instantiation
   - reject dynamic partitioning
   AbortableStreamBasedCheckpointFileManagerSuite:
   - mkdirs, list, createAtomic, open, delete, exists
   AwsS3AbortableStreamBasedCheckpointFileManagerSuite:
   Run completed in 711 milliseconds.
   Total number of tests run: 5
   Suites: completed 4, aborted 0
   Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
   All tests passed.
   ```
   
   ```
   $ ./build/sbt -Phadoop-3 -Phadoop-cloud "project hadoop-cloud;testOnly"
   [info] CommitterBindingSuite:
   [info] - BindingParquetOutputCommitter binds to the inner committer (160 milliseconds)
   [info] - committer protocol can be serialized and deserialized (9 milliseconds)
   [info] - local filesystem instantiation (2 milliseconds)
   [info] - reject dynamic partitioning (1 millisecond)
   [info] AwsS3AbortableStreamBasedCheckpointFileManagerSuite:
   [info] AbortableStreamBasedCheckpointFileManagerSuite:
   [info] - mkdirs, list, createAtomic, open, delete, exists (161 milliseconds)
   [info] Run completed in 1 second, 389 milliseconds.
   [info] Total number of tests run: 5
   [info] Suites: completed 3, aborted 0
   [info] Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
   [info] All tests passed.
   [success] Total time: 21 s, completed Aug 27, 2022 12:52:30 AM
   ```
   
   ```
   $  ./build/sbt -Phadoop-2 -Phadoop-cloud "project hadoop-cloud;testOnly"
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37687: [SPARK-40039][SS] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on PR #37687:
URL: https://github.com/apache/spark/pull/37687#issuecomment-1229643211

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #37687: [SPARK-40039][SS][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

Posted by GitBox <gi...@apache.org>.

attilapiros commented on PR #37687:
URL: https://github.com/apache/spark/pull/37687#issuecomment-1229147101

   cc @HeartSaVioR 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37687: [SPARK-40039][SS] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

Posted by GitBox <gi...@apache.org>.

HyukjinKwon closed pull request #37687: [SPARK-40039][SS] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface
URL: https://github.com/apache/spark/pull/37687


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org