You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "theogaraj (via GitHub)" <gi...@apache.org> on 2024/03/23 01:45:16 UTC

[I] Why is pyarrow.dataset direct from S3 so much slower than using dataset locally and upload/download separately? [arrow]

theogaraj opened a new issue, #40758:
URL: https://github.com/apache/arrow/issues/40758

   ### Describe the usage question you have. Please include as many useful details as  possible.
   
   
   I'm using `pyarrow.dataset.dataset` and `pyarrow.dataset.write_dataset` to convert a newline-delimited (jsonl) file to parquet, and seeing very different end-to-end processing times for the following three approaches:
   
   1. Let the `dataset` API handle all the filesystem details (223s)
   2. Pass `dataset` an `s3fs.S3Filesystem` object (70s)
   3. Use `smart_open` to handle download/upload from/to S3 and use `dataset` on local filesystem (30s)
   
   More detail with code snippets documented in [this StackOverflow question](https://stackoverflow.com/questions/78207687/pyarrow-dataset-s3-performance-different-with-pyarrow-filesystem-s3fs-indirect).
   
   From previous use of `pyarrow.parquet.ParquetFile` I know that options like `buffer_size` and `pre_buffer` can impact performance and I thought there might be similar options with the `dataset` API but I couldn't find anything in the documentation, would greatly appreciate some insight into this.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Why is pyarrow.dataset direct from S3 so much slower than using dataset locally and upload/download separately? [arrow]

Posted by "assignUser (via GitHub)" <gi...@apache.org>.

assignUser commented on issue #40758:
URL: https://github.com/apache/arrow/issues/40758#issuecomment-2089069986

   Great to hear, thanks for coming back with an update!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Why is pyarrow.dataset direct from S3 so much slower than using dataset locally and upload/download separately? [arrow]

Posted by "theogaraj (via GitHub)" <gi...@apache.org>.

theogaraj closed issue #40758: Why is pyarrow.dataset direct from S3 so much slower than using dataset locally and upload/download separately?
URL: https://github.com/apache/arrow/issues/40758


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] Why is pyarrow.dataset direct from S3 so much slower than using dataset locally and upload/download separately? [arrow]

Posted by "theogaraj (via GitHub)" <gi...@apache.org>.

theogaraj commented on issue #40758:
URL: https://github.com/apache/arrow/issues/40758#issuecomment-2089003477

   Closing this as I was able to figure it out.  I was able to improve performance by creating a scanner and tweaking the `batch_readahead` and `batch_size` options.  More info posted at https://stackoverflow.com/a/78220987/12309386 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org