You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Dror Speiser (Jira)" <ji...@apache.org> on 2020/10/16 13:27:00 UTC
[jira] [Commented] (ARROW-10308) [Python] read_csv from python is slow on some work loads

    [ https://issues.apache.org/jira/browse/ARROW-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215393#comment-17215393 ] 

Dror Speiser commented on ARROW-10308:
--------------------------------------

Thanks for the suggestions :) I am indeed getting the files from a third party, and I'm converting them to parquet on arrival using arrow. I'm actually content with 0.5 GiB/s. I'm here because I saw a tweet by Wes Mckinney saying that the csv parser in arrow is "extremely fast". I tweeted back my results and he suggested that I open an issue.

I would like to note that the numbers don't quite add up. If the cpu usage is totally accounted for by the operations of parsing and building arrays, then that would mean that a single processor is doing between 0.06 to 0.13 GiB/s, which is very slow.

When I run the benchmark without threads I get 0.3 GiB/s, which is reasonable for a single processor. But, it also means that the 48 vcpus I have are very far from achieving a linear speedup, which is in line with my profiling (though the attached images are block size of 1 mb). Do you see a linear speedup on your machine?

As for processing csv's being costly in general, I'm not familiar enough with other libraries to say, but I am familiar with the simdjson library that claims to parse json files at over 2 GiB/s, on a single core. I'm looking at the code of both projects, hoping I'll be able to contribute something from simdjson to the csv parser in arrow.

> [Python] read_csv from python is slow on some work loads
> --------------------------------------------------------
>
>                 Key: ARROW-10308
>                 URL: https://issues.apache.org/jira/browse/ARROW-10308
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 1.0.1
>         Environment: Machine: Azure, 48 vcpus, 384GiB ram
> OS: Ubuntu 18.04
> Dockerfile and script: attached, or here: https://github.com/drorspei/arrow-csv-benchmark
>            Reporter: Dror Speiser
>            Priority: Minor
>              Labels: csv, performance
>         Attachments: Dockerfile, arrow-csv-benchmark-plot.png, arrow-csv-benchmark-times.csv, benchmark-csv.py, profile1.svg, profile2.svg, profile3.svg, profile4.svg
>
>
> Hi!
> I've noticed that `pyarrow.csv.read_csv` can be slow on real workloads, processing data around 0.5GiB/s. "Real workloads" means many string, float, and all-null columns, and large file size (5-10GiB), though the file size didn't matter to much.
> Moreover, profiling a little a bit with py-spy, it seems that maybe 30-50% of the time is spent on shared pointer lock mechanisms (though I'm not sure if this is to be trusted). I've attached the dumps in svg format.
> I've also attached a script and a Dockerfile to run a benchmark, which reproduces the speeds I see. Building the docker image and running it on a large Azure machine, I get speeds around 0.3-1.0 GiB/s, and it's mostly around 0.5GiB/s.
> This is all also available here: https://github.com/drorspei/arrow-csv-benchmark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)