You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/02/03 00:04:00 UTC
[jira] [Commented] (ARROW-11433) [R] Unexpectedly slow results
reading csv
[ https://issues.apache.org/jira/browse/ARROW-11433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277537#comment-17277537 ]
Jonathan Keane commented on ARROW-11433:
----------------------------------------
Ben and I spent some time on this today. Turns out it's not reproducible on Ubuntu as far as we tried. So we suspect something specific with macOS.
As another example of even the most basic cpp <-> R interaction alleviating the behavior:
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ dev_null <- arrow::CsvParseOptions$create()
+ tab <- NULL
+ }
+ })
user system elapsed
27.894 22.633 12.224
user system elapsed
25.016 19.855 8.576
user system elapsed
25.092 20.625 8.051
user system elapsed
25.263 21.161 8.353
user system elapsed
25.168 20.575 8.745
user system elapsed
25.087 20.207 7.672
user system elapsed
25.311 20.525 7.438
user system elapsed
25.013 20.537 7.962
user system elapsed
25.088 20.671 8.371
user system elapsed
25.409 20.375 7.659
user system elapsed
258.004 224.511 106.238
{code}
> [R] Unexpectedly slow results reading csv
> -----------------------------------------
>
> Key: ARROW-11433
> URL: https://issues.apache.org/jira/browse/ARROW-11433
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Reporter: Jonathan Keane
> Priority: Minor
>
> This came up working on benchmarking Arrow's CSV reading. As far as I can tell this only impacts R, and only when reading the csv into arrow (but not pulling it in to R). It appears that most arrow interactions after the csv is read will result in this behavior not happening.
> What I'm seeing is that on subsequent reads, the time to read gets longer and longer (frequently in a stair step pattern where every other iteration takes longer).
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + tab <- NULL
> + }
> + })
> user system elapsed
> 24.788 19.485 7.216
> user system elapsed
> 24.952 21.786 9.225
> user system elapsed
> 25.150 23.039 10.332
> user system elapsed
> 25.382 31.012 17.995
> user system elapsed
> 25.309 25.140 12.356
> user system elapsed
> 25.302 26.975 13.938
> user system elapsed
> 25.509 34.390 21.134
> user system elapsed
> 25.674 28.195 15.048
> user system elapsed
> 25.031 28.094 16.449
> user system elapsed
> 25.825 37.165 23.379
> # total time:
> user system elapsed
> 256.178 299.671 175.119
> {code}
> Interestingly, doing something as unrelated as {{arrow:::default_memory_pool()}} which is [only getting the default memory pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. Other interactions totally unrelated to the table also similarly alleviate this behavior (e.g. {{empty_tab <- Table$create(data.frame())}}) or proactively invalidating with {{tab$invalidate()}}
> {code:r}
> > system.time({
> + for (i in 1:10) {
> + print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
> + pool <- arrow:::default_memory_pool()
> + tab <- NULL
> + }
> + })
> user system elapsed
> 25.257 19.475 6.785
> user system elapsed
> 25.271 19.838 6.821
> user system elapsed
> 25.288 20.103 6.861
> user system elapsed
> 25.188 20.290 7.217
> user system elapsed
> 25.283 20.043 6.832
> user system elapsed
> 25.194 19.947 6.906
> user system elapsed
> 25.278 19.993 6.834
> user system elapsed
> 25.355 20.018 6.833
> user system elapsed
> 24.986 19.869 6.865
> user system elapsed
> 25.130 19.878 6.798
> # total time:
> user system elapsed
> 255.381 210.598 83.109
> >
> {code}
> I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the same behavior.
> I checked against pyarrow, and do not see the same:
> {code:python}
> from pyarrow import csv
> import time
> for i in range(1, 10):
> start = time.time()
> table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
> print(time.time() - start)
> del table
> {code}
> results:
> {code}
> 7.586184978485107
> 7.542470932006836
> 7.92852783203125
> 7.647372007369995
> 7.742412805557251
> 8.101378917694092
> 7.7359960079193115
> 7.843957901000977
> 7.6457719802856445
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)