You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/02/10 04:15:00 UTC

[jira] [Created] (ARROW-11583) [C++] Filesystem aware disk scheduling

Weston Pace created ARROW-11583:
-----------------------------------

             Summary: [C++] Filesystem aware disk scheduling
                 Key: ARROW-11583
                 URL: https://issues.apache.org/jira/browse/ARROW-11583
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


Different filesystems have different ideal access strategies.  For example:

AWS: Unlimited parallelism?, no penalty for random?
AWS EBS: Depends
SSD: Bounded parallelism (# of hw contexts), penalty for random within context.
HDD: Very limited parallelism (1 usually), penalty for random access

Currently, Arrow does not factor these access strategies into I/O scheduling.  For example, when reading a dataset of 100 files then it will start reading X files at once (where X is the parallelism of the thread pool).  For AWS this is ideal.  For an HDD this is not.

The OS does have a scheduler which attempts to mitigate this.  It does not know the scope of the I/O and the dependencies amongt the I/O (e.g. in the above dataset read example it's better to read X quickly and then Y quickly than it is to read X and Y slowly at the same time).  I've run some experiments (see comment) which show the OS scheduler fails to achieve ideal performance in fairly typical cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)