You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2020/12/09 16:00:00 UTC
[jira] [Comment Edited] (ARROW-10846) [C++] Add async filesystem operations

    [ https://issues.apache.org/jira/browse/ARROW-10846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17246627#comment-17246627 ] 

Weston Pace edited comment on ARROW-10846 at 12/9/20, 3:59 PM:
---------------------------------------------------------------

A related topic will be how best to pool filesystem access.  Right now, with the synchronous operations, I'm pretty sure every thread's requests go directly to the underlying filesystem.  This is not ideal in most cases.

For example, if a user wants to read in a dataset of 20 parquet files and they are running against a local disk of some kind then it is probably not the best approach to have 20 filesystem requests running in parallel.  In the worst case this might lead to random reads instead of sequential reads and bog down I/O.

It's related to this issue because any future-returning variants will need to setup I/O threads separate from the calling thread (assuming we aren't enforcing the filesystem implementation be asynchronous).  These probably should not come from the default compute thread pool since we shouldn't block on those.  Do we setup a new thread for every request (this is kind of similar to the current synchronous approach) or do we do some kind of I/O thread pooling?

I think this is an interesting topic to start looking into.  Does anyone know if there is any existing work looking at I/O pooling?

EDIT: On the other hand, for something like S3, having many concurrent requests is ideal.  So this may be something that depends on the filesystem used.


was (Author: westonpace):
A related topic will be how best to pool filesystem access.  Right now, with the synchronous operations, I'm pretty sure every thread's requests go directly to the underlying filesystem.  This is not ideal in most cases.

For example, if a user wants to read in a dataset of 20 parquet files and they are running against a local disk of some kind then it is probably not the best approach to have 20 filesystem requests running in parallel.  In the worst case this might lead to random reads instead of sequential reads and bog down I/O.

It's related to this issue because any future-returning variants will need to setup I/O threads separate from the calling thread (assuming we aren't enforcing the filesystem implementation be asynchronous).  These probably should not come from the default compute thread pool since we shouldn't block on those.  Do we setup a new thread for every request (this is kind of similar to the current synchronous approach) or do we do some kind of I/O thread pooling?

I think this is an interesting topic to start looking into.  Does anyone know if there is any existing work looking at I/O pooling?

> [C++] Add async filesystem operations
> -------------------------------------
>
>                 Key: ARROW-10846
>                 URL: https://issues.apache.org/jira/browse/ARROW-10846
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Antoine Pitrou
>            Priority: Major
>
> It would probably be useful to have Future-returning variants of some filesystem operations (at least {{GetFileInfo}} and {{OpenInput(File|Stream)}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)