You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/02/10 04:15:00 UTC
[jira] [Created] (ARROW-11583) [C++] Filesystem aware disk
scheduling
Weston Pace created ARROW-11583:
-----------------------------------
Summary: [C++] Filesystem aware disk scheduling
Key: ARROW-11583
URL: https://issues.apache.org/jira/browse/ARROW-11583
Project: Apache Arrow
Issue Type: Improvement
Components: C++
Reporter: Weston Pace
Different filesystems have different ideal access strategies. For example:
AWS: Unlimited parallelism?, no penalty for random?
AWS EBS: Depends
SSD: Bounded parallelism (# of hw contexts), penalty for random within context.
HDD: Very limited parallelism (1 usually), penalty for random access
Currently, Arrow does not factor these access strategies into I/O scheduling. For example, when reading a dataset of 100 files then it will start reading X files at once (where X is the parallelism of the thread pool). For AWS this is ideal. For an HDD this is not.
The OS does have a scheduler which attempts to mitigate this. It does not know the scope of the I/O and the dependencies amongt the I/O (e.g. in the above dataset read example it's better to read X quickly and then Y quickly than it is to read X and Y slowly at the same time). I've run some experiments (see comment) which show the OS scheduler fails to achieve ideal performance in fairly typical cases.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)