You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "cboettig (via GitHub)" <gi...@apache.org> on 2023/02/16 04:08:21 UTC

[GitHub] [arrow] cboettig commented on issue #33312: [R] unify_schemas=FALSE does not improve open_dataset() read times

cboettig commented on issue #33312:
URL: https://github.com/apache/arrow/issues/33312#issuecomment-1432485314

@westonpace Thanks! yeah, the timing I see is similar to the timing to list contents of the bucket recursively (`s3$ls(recursive=TRUE)`, (as you noted in https://github.com/apache/arrow/issues/34145) so that probably explains the additional overhead between the above examples rather than the unify_schema process. I'll keep an eye on whatever you come up with in https://github.com/apache/arrow/issues/34213.

As you noted there, performance is much better when we can work in the same 'datacenter' (i.e. have our MINIO host be on a VM in the same datacenter as the compute), but we want to be able to support access to our typical end-user who will typically be on a laptop and usually be requesting a small subset of the partitions. In some cases we can write wrapper functions such that we call open_dataset() directly on the desired partition rather than the dataset root, it feels hacky but maybe that is indeed the best strategy(?) It's fast but not nearly as ergonomic as allowing the arrow + dplyr::filter() to select those paths from the dataset root.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org