You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/11 00:58:49 UTC

[GitHub] [spark] attilapiros opened a new pull request, #37474: [SPARK-40039][Streaming][WIP] Introducing a streaming checkpoint file manager based on Hadoop's Abortable interface

attilapiros opened a new pull request, #37474:
URL: https://github.com/apache/spark/pull/37474

### What changes were proposed in this pull request?

Currently on S3 the checkpoint file manager (called `FileContextBasedCheckpointFileManager`) is based on the rename operation. So when a file is opened for an atomic stream a temporary file will be used behind the scenes and when the stream is committed the file is renamed to its final location.

But on S3 the rename operation will be a file copy so it has some serious performance implication.

On Hadoop 3 there is new interface introduce called [Abortable](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/fs/Abortable.html) and S3AFileSystem has this capability which is implemented by on top S3's multipart upload. So when the file is committed [a POST is sent](https://docs.aws.amazon.com/AmazonS3/latest/API/API_CompleteMultipartUpload.html) and when aborted [a DELETE will be sent](https://docs.aws.amazon.com/AmazonS3/latest/API/API_AbortMultipartUpload.html).

This avoids the file copying altogether.

### Why are the changes needed?

For improving streaming performance.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

I have refactored the existing `CheckpointFileManagerTests` and run against a test filesystem which supports the `Abortable` interface (see `AbortableFileSystem` which is based on `RawLocalFileSystem`).
This way we have a unit test.

Moreover the same test can be run against AWS S3 by using an integration test (see `AwsAbortableStreamBasedCheckpointFileManagerSuite`):

```
-> S3_PATH=<..> AWS_ACCESS_KEY_ID=<..> AWS_SECRET_ACCESS_KEY=<..> AWS_SESSION_TOKEN=<..> ./build/mvn install -pl hadoop-cloud -Phadoop-cloud,hadoop-3,integration-test

Discovery starting.
Discovery completed in 346 milliseconds.
Run starting. Expected test count is: 1
AwsAbortableStreamBasedCheckpointFileManagerSuite:
- mkdirs, list, createAtomic, open, delete, exists
CommitterBindingSuite:
AbortableStreamBasedCheckpointFileManagerSuite:
Run completed in 14 seconds, 407 milliseconds.
Total number of tests run: 1
Suites: completed 4, aborted 0
Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org