You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "tustvold (via GitHub)" <gi...@apache.org> on 2024/03/15 00:15:01 UTC

[I] Use object_store:BufWriter instead of put_multipart [arrow-datafusion]

tustvold opened a new issue, #9614:
URL: https://github.com/apache/arrow-datafusion/issues/9614

   ### Is your feature request related to a problem or challenge?
   
   Currently in many places we use put_multipart for streaming writes. When writing files smaller than 10MiB this is wasteful, as it performs 3 requests when 1 would suffice.
   
   ### Describe the solution you'd like
   
   object_store 0.9.1 added https://docs.rs/object_store/latest/object_store/buffered/struct.BufWriter.html which can automatically switch between using Put and PutMultipart based on the amount of data that has been written
   
   ### Describe alternatives you've considered
   
   We could implement our own adaptive logic in the write path within DF
   
   ### Additional context
   
   A future version of object_store is likely to significantly change put_multipart, and using BufWriter will limit the impact of this - https://github.com/apache/arrow-rs/pull/5500


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Use object_store:BufWriter instead of put_multipart [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #9614:
URL: https://github.com/apache/arrow-datafusion/issues/9614#issuecomment-2001951771

   @yyy1000  good luck -- this ticket will require some API exploration / potential changes so it will likely be a bit trickey.
   
   I think your suggested plan sounds good. 
   
   It will be interesting if you can also capture any experience / improvements that would make using `BufWriter` easier from the context of DataFusion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Use object_store:BufWriter instead of put_multipart [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #9614:
URL: https://github.com/apache/arrow-datafusion/issues/9614#issuecomment-1999651779

   Related ticket for cleaning up parallel parquet writer is https://github.com/apache/arrow-datafusion/issues/9493


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Use object_store:BufWriter instead of put_multipart [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #9614:
URL: https://github.com/apache/arrow-datafusion/issues/9614#issuecomment-1998680710

   FYI @devinjdangelo @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Use object_store:BufWriter instead of put_multipart [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold closed issue #9614: Use object_store:BufWriter instead of put_multipart
URL: https://github.com/apache/arrow-datafusion/issues/9614


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Use object_store:BufWriter instead of put_multipart [arrow-datafusion]

Posted by "yyy1000 (via GitHub)" <gi...@apache.org>.
yyy1000 commented on issue #9614:
URL: https://github.com/apache/arrow-datafusion/issues/9614#issuecomment-2001779884

   Can I take this to get familiar with datasource related code?
   Currently my plan is:
   1. Create `BufWriter` using given `object_store`and path
   2. Remove `put_multipart` method and call `AsyncArrowWriter::try_new` using the new `Writer`
   
   That's my initial plan after investigation, hope to hear your feedback. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Use object_store:BufWriter instead of put_multipart [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #9614:
URL: https://github.com/apache/arrow-datafusion/issues/9614#issuecomment-2002055129

   Your plan sounds good and should be relatively straightforward


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org