You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/02/03 15:48:00 UTC
[jira] [Comment Edited] (ARROW-11433) [R] Unexpectedly slow results
reading csv
[ https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278089#comment-17278089 ]
Jonathan Keane edited comment on ARROW-11433 at 2/3/21, 3:47 PM:
-----------------------------------------------------------------
I looked at allocators again this morning (building mimalloc and testing against that and re-testing the system allocator).
Mimalloc does not have the issue (and on this particular data set is 2x faster 😱).
The system allocator is *a lot* more variable. The pattern isn't quite the same, but I'm getting iterations that go from 7seconds (on the first run) and then up to 10–14s on all subsequent runs, though it doesn't have the increasing / stairstep pattern that jemalloc does.
Which makes me now suspect jemalloc specifically. I'm going to look into the options from that ticket
was (Author: jonkeane):
I looked at allocators again this morning (building mimalloc and testing against that and re-testing the system allocator).
Mimalloc does not have the issue (and on this particular data set is 2x faster 😱).
The system allocator is *a lot* more variable. The pattern isn't quite the same, but I'm getting iterations that go from 7seconds (on the first run) up to 10–12s on subsequent run, though it doesn't have the increasing / stairstep pattern that jemalloc does.
Which makes me now suspect jemalloc specifically. I'm going to look into the options from that ticket
> [R] Unexpectedly slow results reading csv
> -----------------------------------------
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Jonathan Keane
> Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can tell this only impacts R, and only when reading the csv into arrow (but not pulling it in to R). It appears that most arrow interactions after the csv is read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and longer (frequently in a stair step pattern where every other iteration takes longer).
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> + }
> + })
> user system elapsed
> 24.788 19.485 7.216
> user system elapsed
> 24.952 21.786 9.225
> user system elapsed
> 25.150 23.039 10.332
> user system elapsed
> 25.382 31.012 17.995
> user system elapsed
> 25.309 25.140 12.356
> user system elapsed
> 25.302 26.975 13.938
> user system elapsed
> 25.509 34.390 21.134
> user system elapsed
> 25.674 28.195 15.048
> user system elapsed
> 25.031 28.094 16.449
> user system elapsed
> 25.825 37.165 23.379
> # total time:
> user system elapsed
> 256.178 299.671 175.119
> {code}
> Interestingly, doing something as unrelated as {{arrow:::default_memory_pool()}} which is [only getting the default memory pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. Other interactions totally unrelated to the table also similarly alleviate this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> + }
> + })
> user system elapsed
> 25.257 19.475 6.785
> user system elapsed
> 25.271 19.838 6.821
> user system elapsed
> 25.288 20.103 6.861
> user system elapsed
> 25.188 20.290 7.217
> user system elapsed
> 25.283 20.043 6.832
> user system elapsed
> 25.194 19.947 6.906
> user system elapsed
> 25.278 19.993 6.834
> user system elapsed
> 25.355 20.018 6.833
> user system elapsed
> 24.986 19.869 6.865
> user system elapsed
> 25.130 19.878 6.798
> # total time:
> user system elapsed
> 255.381 210.598 83.109 ​
> >
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)