You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Ziheng Wang (Jira)" <ji...@apache.org> on 2022/08/19 21:56:00 UTC

[jira] [Created] (ARROW-17481) Major performance improvements to CSV reading from S3

Ziheng Wang created ARROW-17481:
-----------------------------------

Summary: Major performance improvements to CSV reading from S3
Key: ARROW-17481
URL: https://issues.apache.org/jira/browse/ARROW-17481
Project: Apache Arrow
Issue Type: Improvement
Components: C++, Python
Reporter: Ziheng Wang
Assignee: Ziheng Wang

The current dataset reader for CSV is pretty slow on EC2 reading from S3.

EC2 instances have more than 3Gbps network bandwidth which make them on par with SSD. However reading batches from disk is more than 3x faster than reading from network. This should not happen.

The reason why the dataset reader is not fully leveraging the network bandwidth is because reads are currently serial. We should change the reads to be parallel. Then even if the rest of the pipeline is not parallel we should get same read speed as disk.

Note one might think that if you have many fragments fragment-level parallelism will take care of this. This is true to some extent however to_batches() is ordered. This means that if your fragments are big the fragment readahead will stop being effective after a while as the reader tries to deplete the fragments in order. The batch readahead for the CSV reader current is a serial readahead, which really should be a parallel readahead.

After changing the network IO to be parallel, we should also change the parse and decode to be parallel. It's easy to change the parse to be parallel, a bit harder for the decode because of how the decoder operator works, so I will just tackle the parse first.

On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these changes (parallel ) made reading 60 batches (~10GB) 4x faster.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)