You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/08 01:20:42 UTC
[GitHub] [arrow] westonpace edited a comment on pull request #11616: ARROW-14577: [C++] Enable fine grained IO for async IPC reader

westonpace edited a comment on pull request #11616:
URL: https://github.com/apache/arrow/pull/11616#issuecomment-995349558


   I dug into the performance a bit more for the small files case (I'll do S3 soon but I think I want to do real S3 and not minio since the former supports parallelism and the latter, attached to my HDD, does not).
   
   Note: Asynchronous readers in these tests are not being consumed in parallel.  So we wait until batch is returned before reading the next batch.  However, asynchronous readers still issue parallel reads and use threads.  Reading a single batch that needs 8 columns will trigger 8 parallel reads.
   
   Note: Even the synchronous reader will use parallel reads if only a subset of the columns are targeted.  It will use the IoRecordedRandomAccessFile which then uses the read range cache which performs reads in parallel. 
   
   ### Hot In-Memory Memory Mapped (also, arrow::io::BufferReader)
   
   Asynchronous reads should never be used in this case.  A "read" is just pointer arithmetic.  There are no copies.  I didn't benchmark this case.
   
   ### Cold On-Disk Memory Mapped
   
   I did not test this.  I'm not sure if it is an interesting case or not.
   
   ### Hot In-Memory Regular File
   
   Cherry picking some interesting cases (note the rate here is based on the total buffer size of the selected columns.  So selecting fewer columns shouldn't yield a higher rate).
   
   | Sync/Async | # of columns | # of columns selected | Rate (Bps) | Note |
   | - | - | - | - | - |
   | Sync | 16 | 16 | 9.79967G/s | Seems a limit on my machine for 1-thread DRAM bandwidth |
   | Sync | 16 | 2 | 12.8979G/s | Parallel reads increase DRAM bandwidth |
   | Sync | 256 | 256 | 8.73684G/s | Starting to hit CPU bottleneck from excess metadata |
   | Sync | 256 | 32 | 7.28792G/s | Since we are throttled on metadata / CPU, perf gets worse |
   | Async | 16 | 16 | 2.58248G/s | Async is quite a bit worse than baseline for full reads |
   | Async | 16 | 2 | 13.9343G/s | Async perf is similar on partial reads |
   | Async | 256 | 256 | 2.4068G/s | |
   | Async | 256 | 32 | 6.8774G/s | |
   | Old-Async | 16 | 16 | 2.84301G/s | Old implementation has slightly lower overhead I think |
   | Old-Async | 16 | 2 | 556.501M/s | Old implementation does not handle partial reads well |
   | Old-Async | 256 | 256 | 2.78802G/s | |
   | Old-Async | 256 | 32 | 459.484M/s | |
   
   Conclusions: This change significnatly improves performance of partial async reads to the point where partial async reads on "well-formed" files (data >> metadata) is comparable to sync partial read.
   
   Async full read is still considerably worse than async full read which is surprising but possibly due to threading overhead.  This is worth investigating in a future PR.
   
   ### Cold In-Memory Regular File
   
   | Sync/Async | # of columns | # of columns selected | Rate (bps) | Note |
   | - | - | - | - | - |
   | Sync | 16 | 16 | 111.044M/s | Baseline, HDD throughput |
   | Sync | 16 | 2 | 25.205M/s | Surprising, more below |
   | Sync | 256 | 256 | 99.8336M/s | |
   | Sync | 256 | 32 | 15.2597M/s | Surprising |
   | Async | 16 | 16 | 98.5425M/s | Similar to sync, within noise but did consistently seem a bit lower |
   | Async | 16 | 2 | 54.1136M/s | |
   | Async | 256 | 256 | 96.5957M/s |
   | Async | 256 | 32 | 11.911M/s | Within noise of sync result actually, seems to bottom out around a noisy 10-16 |
   | Old-Async | 16 | 16 | 138.266M/s | Not just noise, old async real-file is consistently better than sync |
   | Old-Async | 16 | 2 | 17.4384M/s | |
   | Old-Async | 256 | 256 | 123.721M/s | |
   | Old-Async | 256 | 32 | 16.4605M/s | |
   
   Conclusions: This change does improve performance of partial async reads.  However, it seems to come at a cost of full async reads.  David's suggest to falling back to a full file read should alleviate this.
   
   In all cases the performance of partial reads deteriorates quickly.  This is because we are essentially falling back to either "reading too much" (Old-Async) or random reads.  The random read rate lines up with using `fio` to benchmark my disk.  At 16 batches the data blocks are 520KB and with `fio` random reads@520KB ~ 45MBps.  At 256 batches the data blocks are 32k and with `fio` I get ~4MBps (either `fio` is too pessimistic or we are able to take advantage of the pseudo-sequential nature of the reads).
   
   ### Remaining tasks
   
   - [ ] Add fallback to full-file read for async
   - [x] Investigate S3
   - [ ] Investigate multi-threaded local reads (both multiple files and consuming in parallel)
   - [ ] Recommend that users should structure record batches so that each column contains at least 4MB of data if they plan to be reading from disk.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org