You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Weston Pace (Jira)" <ji...@apache.org> on 2021/02/10 04:37:00 UTC
[jira] [Comment Edited] (ARROW-11583) [C++] Filesystem aware disk scheduling

    [ https://issues.apache.org/jira/browse/ARROW-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282218#comment-17282218 ] 

Weston Pace edited comment on ARROW-11583 at 2/10/21, 4:36 AM:
---------------------------------------------------------------

I ran some experiments to see how effective the OS disk scheduling policies were.  My test was run on an Ubuntu 20.04 machine against an HDD and an SSD.

Methodology:
  * Theoretical max performance was measured using Gnome disks benchmark, this # was verified with `pv` of a large file.
  * % of sequential reads was measured using blktrace/blkparse
  * Reads were always never cached in the O/S thanks to judicious use of posix_fadvise
  * On this system the only OS scheduler available was mq-deadline, with a default read_expire of 500ms
  * "Read serially" with more than 1 thread used pipeline parallelism, but only ever one file at a time
  * "NOT read serially" read all files at once, pipeline parallelism was theoretically possible but likely unused
  * 10 files, 10MB per file, 10 iterations averaged together
  * All reads were done using block stream readers read with background readahead

Parameters:
  * use_ssd - Which disk to use, 0 means the HDD
  * serial - If true, read one file at a time
  * io_threads - How many threads to use for I/O

Results
||SSD/HDD||Serial/Parallel||Num. I/O Threads||% Max Performance||
|SSD|Serial|1|89%|
|SSD|Parallel|1|76%|
|SSD|Serial|8|87%|
|SSD|Parallel|8|97%|
|HDD|Serial|1|78%|
|HDD|Parallel|1|55%|
|HDD|Serial|8|99.8%|
|HDD|Parallel|8|83%|

Conclusion:

If Arrow simply tries to read everything in parallel, and allocates multiple threads for I/O then we are probably good in most cases but we will under perform by about 15% if run on an HDD.

I need to run my experiments on AWS to see how S3 and EBS perform under this load.  Some EBS drives are backed by an HDD and these may underperform.  For S3 we would likely need even more extensive experiments to see how many files we can truly read at once and it may depend on how S3 is actually doing replication.


was (Author: westonpace):
I ran some experiments to see how effective the OS disk scheduling policies were.  My test was run on an Ubuntu 20.04 machine against an HDD and an SSD.

Methodology:
 * Theoretical max performance was measured using Gnome disks benchmark, this # was verified with `pv` of a large file.
 * % of sequential reads was measured using blktrace/blkparse
 * Reads were always never cached in the O/S thanks to judicious use of posix_fadvise
 * On this system the only OS scheduler available was mq-deadline, with a default read_expire of 500ms
 * "Read serially" with more than 1 thread used pipeline parallelism, but only ever one file at a time
 * "NOT read serially" read all files at once, pipeline parallelism was theoretically possible but likely unused
 * 10 files, 10MB per file, 10 iterations averaged together
 * All reads were done using block stream readers read with background readahead

Parameters:
 * use_ssd - Which disk to use, 0 means the HDD
 * serial - If true, read one file at a time
 * io_threads - How many threads to use for I/O

Results
||SSD/HDD||Serial/Parallel||# I/O Threads||% Max Performance||
|SSD|Serial|1|89%|
|SSD|Parallel|1|76%|
|SSD|Serial|8|87%|
|SSD|Parallel|8|97%|
|HDD|Serial|1|78%|
|HDD|Parallel|1|55%|
|HDD|Serial|8|99.8%|
|HDD|Parallel|8|83%|

Conclusion:

If Arrow simply tries to read everything in parallel, and allocates multiple threads for I/O then we are probably good in most cases but we will under perform by about 15% if run on an HDD.

I need to run my experiments on AWS to see how S3 and EBS perform under this load.  Some EBS drives are backed by an HDD and these may underperform.  For S3 we would likely need even more extensive experiments to see how many files we can truly read at once and it may depend on how S3 is actually doing replication.

> [C++] Filesystem aware disk scheduling
> --------------------------------------
>
>                 Key: ARROW-11583
>                 URL: https://issues.apache.org/jira/browse/ARROW-11583
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> Different filesystems have different ideal access strategies.  For example:
> AWS: Unlimited parallelism?, no penalty for random?
> AWS EBS: Depends
> SSD: Bounded parallelism (# of hw contexts), penalty for random within context.
> HDD: Very limited parallelism (1 usually), penalty for random access
> Currently, Arrow does not factor these access strategies into I/O scheduling.  For example, when reading a dataset of 100 files then it will start reading X files at once (where X is the parallelism of the thread pool).  For AWS this is ideal.  For an HDD this is not.
> The OS does have a scheduler which attempts to mitigate this.  It does not know the scope of the I/O and the dependencies amongt the I/O (e.g. in the above dataset read example it's better to read X quickly and then Y quickly than it is to read X and Y slowly at the same time).  I've run some experiments (see comment) which show the OS scheduler fails to achieve ideal performance in fairly typical cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)