You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "orf (via GitHub)" <gi...@apache.org> on 2024/03/14 17:22:09 UTC

[I] S3Filesystem always initiates multipart uploads, regardless of input size [arrow]

orf opened a new issue, #40557:
URL: https://github.com/apache/arrow/issues/40557

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Running the following snippet shows that `open_output_stream()` initiates a multipart upload immediately, before anything is written.
   
   This is quite unexpected: I would expect that the `buffer_size` argument would ensure that a multipart upload is not initiated until at least 1,000 bytes are written. The issue with the current behaviour is that writing a single byte results in three requests to s3: one to create the multipart upload, one to upload the 1-byte part, and one to finish the multipart upload.
   
   This is very inefficient if you are writing a small file to S3, where a simple put object (without multipart uploading) would suffice. Using `background_writes=False` and `fs.copy_files(...)` with a local, "known-sized" small file also results in a multipart upload.
   
   While this behaviour keeps the implementation simple, it is surprising and I couldn't find [it described in the documentation anywhere](https://arrow.apache.org/docs/python/filesystems.html).
   
   ```python
   import time
   
   from pyarrow import fs
   
   fs.initialize_s3(fs.S3LogLevel.Debug)
   
   sfs = fs.S3FileSystem()
   with sfs.open_output_stream("a_bucket/test", buffer_size=1000):
       time.sleep(10)
   ```
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] S3Filesystem always initiates multipart uploads, regardless of input size [arrow]

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #40557:
URL: https://github.com/apache/arrow/issues/40557#issuecomment-2073478715

   @OliLay If you want to try this approach, could you open a PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] S3Filesystem always initiates multipart uploads, regardless of input size [arrow]

Posted by "kou (via GitHub)" <gi...@apache.org>.
kou commented on issue #40557:
URL: https://github.com/apache/arrow/issues/40557#issuecomment-2106884684

   Oh, sorry. I missed it.
   I've approved the CI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] S3Filesystem always initiates multipart uploads, regardless of input size [arrow]

Posted by "OliLay (via GitHub)" <gi...@apache.org>.
OliLay commented on issue #40557:
URL: https://github.com/apache/arrow/issues/40557#issuecomment-2071772600

   We observe similar issues using the C++ implementation.
   This behavior adds quite a large constant overhead for writing to S3. You will at least always have 3x RTT to S3, instead of possibly just 1x RTT to S3 using a direct `PutObject` request. 
   I think the best possible solution would be to not directly omit a `CreateMultipartUpload`, but wait until some writes on the OutputStream have accumulated (e.g. >1MB?); if the OutputStream is closed with having written below 1MB, just directly use a `PutObject` request.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] [C++] S3Filesystem always initiates multipart uploads, regardless of input size [arrow]

Posted by "OliLay (via GitHub)" <gi...@apache.org>.
OliLay commented on issue #40557:
URL: https://github.com/apache/arrow/issues/40557#issuecomment-2106880717

   Hi @kou, I opened a PR over here: #41564. As a first-time contributor: Will this automatically be picked up by someone for letting the CI run & reviewing it or do I have to take some further action?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org