You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Ziheng Wang (Jira)" <ji...@apache.org> on 2022/08/19 21:56:00 UTC

[jira] [Created] (ARROW-17481) Major performance improvements to CSV reading from S3

Ziheng Wang created ARROW-17481:
-----------------------------------

             Summary: Major performance improvements to CSV reading from S3
                 Key: ARROW-17481
                 URL: https://issues.apache.org/jira/browse/ARROW-17481
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Python
            Reporter: Ziheng Wang
            Assignee: Ziheng Wang


The current dataset reader for CSV is pretty slow on EC2 reading from S3.

EC2 instances have more than 3Gbps network bandwidth which make them on par with SSD. However reading batches from disk is more than 3x faster than reading from network. This should not happen.

The reason why the dataset reader is not fully leveraging the network bandwidth is because reads are currently serial. We should change the reads to be parallel. Then even if the rest of the pipeline is not parallel we should get same read speed as disk.

Note one might think that if you have many fragments fragment-level parallelism will take care of this. This is true to some extent however to_batches() is ordered. This means that if your fragments are big the fragment readahead will stop being effective after a while as the reader tries to deplete the fragments in order. The batch readahead for the CSV reader current is a serial readahead, which really should be a parallel readahead.

After changing the network IO to be parallel, we should also change the parse and decode to be parallel. It's easy to change the parse to be parallel, a bit harder for the decode because of how the decoder operator works, so I will just tackle the parse first.

On my test system (i3.2xlarge on EC2 reading from S3 one large CSV), these changes (parallel ) made reading 60 batches (~10GB) 4x faster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)