You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Jonathan Keane (Jira)" <ji...@apache.org> on 2021/01/29 22:15:00 UTC
[jira] [Created] (ARROW-11433) [R] Unexpectedly slow results
reading csv
Jonathan Keane created ARROW-11433:
--------------------------------------
Summary: [R] Unexpectedly slow results reading csv
Key: ARROW-11433
URL: https://issues.apache.org/jira/browse/ARROW-11433
Project: Apache Arrow
Issue Type: Bug
Components: R
Reporter: Jonathan Keane
This came up working on benchmarking Arrow's CSV reading. As far as I can tell this only impacts R, and only when reading the csv into arrow (but not pulling it in to R). It appears that most arrow interactions after the csv is read will result in this behavior not happening.
What I'm seeing is that on subsequent reads, the time to read gets longer and longer (frequently in a stair step pattern where every other iteration takes longer).
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ tab <- NULL
+ }
+ })
user system elapsed
24.788 19.485 7.216
user system elapsed
24.952 21.786 9.225
user system elapsed
25.150 23.039 10.332
user system elapsed
25.382 31.012 17.995
user system elapsed
25.309 25.140 12.356
user system elapsed
25.302 26.975 13.938
user system elapsed
25.509 34.390 21.134
user system elapsed
25.674 28.195 15.048
user system elapsed
25.031 28.094 16.449
user system elapsed
25.825 37.165 23.379
# total time:
user system elapsed
256.178 299.671 175.119
{code}
Interestingly, doing something as unrelated as {{arrow:::default_memory_pool()}} which is [only getting the default memory pool|https://github.com/apache/arrow/blob/f291cd7b96463a2efd40a976123c64fad5c01058/r/src/memorypool.cpp#L68-L70]. Other interactions totally unrelated to the table also similarly alleviate this behavior (e.g. {{empty_tab <- Table$create(data.frame())}})
{code:r}
> system.time({
+ for (i in 1:10) {
+ print(system.time(tab <- read_csv_arrow("source_data/nyctaxi_2010-01.csv", as_data_frame = FALSE)))
+ pool <- arrow:::default_memory_pool()
+ tab <- NULL
+ }
+ })
user system elapsed
25.257 19.475 6.785
user system elapsed
25.271 19.838 6.821
user system elapsed
25.288 20.103 6.861
user system elapsed
25.188 20.290 7.217
user system elapsed
25.283 20.043 6.832
user system elapsed
25.194 19.947 6.906
user system elapsed
25.278 19.993 6.834
user system elapsed
25.355 20.018 6.833
user system elapsed
24.986 19.869 6.865
user system elapsed
25.130 19.878 6.798
# total time:
user system elapsed
255.381 210.598 83.109
>
{code}
I've tested this against Arrow 3.0.0, 2.0.0, and 1.0.0 and all experience the same behavior.
I checked against pyarrow, and do not see the same:
{code:python}
from pyarrow import csv
import time
for i in range(1, 10):
start = time.time()
table = csv.read_csv("r/source_data/nyctaxi_2010-01.csv")
print(time.time() - start)
del table
{code}
results:
{code}
7.586184978485107
7.542470932006836
7.92852783203125
7.647372007369995
7.742412805557251
8.101378917694092
7.7359960079193115
7.843957901000977
7.6457719802856445
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)