You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wjones127 (via GitHub)" <gi...@apache.org> on 2023/05/28 22:33:30 UTC

[GitHub] [arrow] wjones127 opened a new pull request, #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

wjones127 opened a new pull request, #35808:
URL: https://github.com/apache/arrow/pull/35808

   ### Rationale for this change
   
   S3 and Minio allow varied part sizes, but R2 doesn't yet. So for now we'll change to only write 
   
   ### What changes are included in this PR?
   
   Alters the `DoWrite` implementation to used fixed size parts.
   
   ### Are these changes tested?
   
   * [ ] Tested manually against R2
   * [ ] Tested manually against S3
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1574895109

   Thanks @wjones127 . This confirms that piece size doesn't seem to have much of an impact.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1569782985

   I suspect this shouldn't affect performance, but have you tried a large upload against a remote endpoint?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1574637403

   > I suspect this shouldn't affect performance, but have you tried a large upload against a remote endpoint?
   
   Just tested on an EC2 instance in the same region as an S3 bucket. They seem comparable.
   
   Before:
   ```
   Upload 200MB data in 25 pieces (8.0 MB per piece).
   Avg: 2.4996164005998254 Min: 1.839938310999969 Max: 6.8072016670002995
   
   Upload 200MB data in 4 pieces (50.0 MB per piece).
   Avg: 2.3683339380998403 Min: 2.1181160359992646 Max: 3.710967674999665
   ```
   
   After:
   
   ```
   Upload 200MB data in 25 pieces (8.0 MB per piece).
   Avg: 1.989240413999869 Min: 1.9223971040009928 Max: 2.2479570469986356
   
   Upload 200MB data in 4 pieces (50.0 MB per piece).
   Avg: 2.1779588877003335 Min: 2.1119931640005234 Max: 2.3420757010007947
   ```
   <details>
   <summary>Benchmark script</summary>
   
   ```python
   import pyarrow.fs as pa_fs
   import random
   import os
   import time
   
   fs = pa_fs.S3FileSystem(
       region="us-west-2"
   )
   
   # 100 MB across 25 pieces
   pieces = 25
   piece_size = 200 * 1024 * 1024 // pieces
   
   bucket_name = os.environ["OBJECT_STORE_BUCKET"]
   
   fs = pa_fs.S3FileSystem(
       access_key=os.environ["AWS_ACCESS_KEY_ID"],
       secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
       region="us-west-2",
       background_writes=True, # Also test false
   )
   
   iterations = 10
   times = []
   
   for _ in range(iterations):
       start = time.monotonic()
       with fs.open_output_stream(f"{bucket_name}/my_test_file.bin") as f:
           for i in range(pieces):
               data = random.randbytes(piece_size)
               f.write(data)
       end = time.monotonic()
       times.append(end - start)
   
   info = fs.get_file_info(f"{bucket_name}/my_test_file.bin")
   assert info.size == piece_size * pieces
   
   avg_time = sum(times) / len(times)
   print(f"Upload 200MB data in {pieces} pieces ({piece_size / 1024 / 1024} MB per piece).")
   print(f"Avg: {avg_time} Min: {min(times)} Max: {max(times)}")
   ```
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1566303323

   ✅ I've tested this locally against S3 and R2 with the following script:
   
   ```python
   import pyarrow.fs as pa_fs
   import os
   
   # 25 MB across 7 pieces
   pieces = 7
   piece_size = 25 * 1024 * 1024 // pieces
   
   bucket_name = os.environ["OBJECT_STORE_BUCKET"]
   
   fs = pa_fs.S3FileSystem(
       access_key=os.environ["AWS_ACCESS_KEY_ID"],
       secret_key=os.environ["AWS_SECRET_ACCESS_KEY"],
       endpoint_override=os.environ.get("OBJECT_STORE_AWS_ENDPOINT"),
       region="us-west-2",
   )
   
   with fs.open_output_stream(f"{bucket_name}/my_test_file.txt") as f:
       for i in range(pieces):
           f.write(b"0" * piece_size)
   
   info = fs.get_file_info(f"{bucket_name}/my_test_file.txt")
   assert info.size == piece_size * pieces
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1575318455

   Benchmark runs are scheduled for baseline = 5a55fb4f66cbd316568085129054b5ff1d2eb089 and contender = 1ecb04f39851a709a0382df908043b453c42e281. 1ecb04f39851a709a0382df908043b453c42e281 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Failed] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e1ff17fb62e44e7ea4a6175ca79d7b8c...6ef9ea32095a423e958d33e45f933e23/)
   [Failed] [test-mac-arm](https://conbench.ursa.dev/compare/runs/da7c54b4018b4bb8ac380bc60e28fb92...f2a0556110884f97adabe53bb2ed908c/)
   [Failed :arrow_down:0.97% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/ff329c4fbe724445aea287b9877a1493...1563cce1c423429e880fff7dcad75d86/)
   [Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8268f4f539df44698261be23a43f8309...e7227ebe62f64ad39af56a0155949868/)
   Buildkite builds:
   [Failed] [`1ecb04f3` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2976)
   [Failed] [`1ecb04f3` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3012)
   [Finished] [`1ecb04f3` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2977)
   [Failed] [`1ecb04f3` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3002)
   [Finished] [`5a55fb4f` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2975)
   [Finished] [`5a55fb4f` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/3011)
   [Failed] [`5a55fb4f` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2976)
   [Failed] [`5a55fb4f` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/3001)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35808:
URL: https://github.com/apache/arrow/pull/35808#discussion_r1210549409


##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -1154,7 +1154,7 @@ class ObjectInputFile final : public io::RandomAccessFile {
 // AWS doc says "5 MB" but it's not clear whether those are MB or MiB,
 // so I chose the safer value.
 // (see https://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html)
-static constexpr int64_t kMinimumPartUpload = 5 * 1024 * 1024;
+static constexpr int64_t kMinimumPartUpload = 10 * 1024 * 1024;

Review Comment:
   It seems ok to me, it just means that we're using a bit more RAM to buffer outgoing bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35808:
URL: https://github.com/apache/arrow/pull/35808#discussion_r1210555986


##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -1304,27 +1304,43 @@ class ObjectOutputStream final : public io::OutputStream {
       return Status::Invalid("Operation on closed stream");
     }
 
-    if (!current_part_ && nbytes >= part_upload_threshold_) {
-      // No current part and data large enough, upload it directly
-      // (without copying if the buffer is owned)
-      RETURN_NOT_OK(UploadPart(data, nbytes, owned_buffer));
-      pos_ += nbytes;
-      return Status::OK();
+    const int8_t* data_ptr = reinterpret_cast<const int8_t*>(data);
+    int64_t offset = 0;

Review Comment:
   Instead of maintaining a separate offset, you can simply bump `data_ptr` and decrease `nbytes` as needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1574904728

   Rebased to get the latest CI fixes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35808:
URL: https://github.com/apache/arrow/pull/35808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on a diff in pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "wjones127 (via GitHub)" <gi...@apache.org>.
wjones127 commented on code in PR #35808:
URL: https://github.com/apache/arrow/pull/35808#discussion_r1208683874


##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -1154,7 +1154,7 @@ class ObjectInputFile final : public io::RandomAccessFile {
 // AWS doc says "5 MB" but it's not clear whether those are MB or MiB,
 // so I chose the safer value.
 // (see https://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html)
-static constexpr int64_t kMinimumPartUpload = 5 * 1024 * 1024;
+static constexpr int64_t kMinimumPartUpload = 10 * 1024 * 1024;

Review Comment:
   Previously our limit was ~2.4TB. In the new design, it is 100GB. It would be only 50GB if it was left at 5MB per part. I feel like this is an okay balance, but willing to hear other's opinions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35808:
URL: https://github.com/apache/arrow/pull/35808#issuecomment-1566287800

   * Closes: #34363


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35808:
URL: https://github.com/apache/arrow/pull/35808#discussion_r1210557387


##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -1304,27 +1304,43 @@ class ObjectOutputStream final : public io::OutputStream {
       return Status::Invalid("Operation on closed stream");
     }
 
-    if (!current_part_ && nbytes >= part_upload_threshold_) {
-      // No current part and data large enough, upload it directly
-      // (without copying if the buffer is owned)
-      RETURN_NOT_OK(UploadPart(data, nbytes, owned_buffer));
-      pos_ += nbytes;
-      return Status::OK();
+    const int8_t* data_ptr = reinterpret_cast<const int8_t*>(data);
+    int64_t offset = 0;
+
+    // Handle case where we have some bytes bufferred from prior calls.
+    if (current_part_size_ > 0) {
+      // Try to fill current buffer
+      const int64_t to_copy =
+          std::min(nbytes - offset, part_upload_threshold_ - current_part_size_);

Review Comment:
   If `part_upload_threshold_` isn't a threshold anymore, perhaps rename it and make it `const`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35808: GH-34363: [C++] Use equal size parts in S3 upload for R2 compatibility

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35808:
URL: https://github.com/apache/arrow/pull/35808#discussion_r1210552545


##########
cpp/src/arrow/filesystem/s3fs.cc:
##########
@@ -1154,7 +1154,7 @@ class ObjectInputFile final : public io::RandomAccessFile {
 // AWS doc says "5 MB" but it's not clear whether those are MB or MiB,
 // so I chose the safer value.
 // (see https://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPart.html)
-static constexpr int64_t kMinimumPartUpload = 5 * 1024 * 1024;
+static constexpr int64_t kMinimumPartUpload = 10 * 1024 * 1024;

Review Comment:
   Can you add a comment about the limit, though?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org