You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Karl Dunkle Werner (Jira)" <ji...@apache.org> on 2020/07/25 20:43:00 UTC
[jira] [Created] (ARROW-9557) [R] Iterating over parquet columns is
slow in R
Karl Dunkle Werner created ARROW-9557:
-----------------------------------------
Summary: [R] Iterating over parquet columns is slow in R
Key: ARROW-9557
URL: https://issues.apache.org/jira/browse/ARROW-9557
Project: Apache Arrow
Issue Type: Improvement
Components: R
Affects Versions: 1.0.0
Reporter: Karl Dunkle Werner
I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python.
An example is below, though it's certainly possible I've done my benchmarking incorrectly.
Python setup and benchmarking:
{code:python}
import numpy as np
import pyarrow
import pyarrow.parquet as pq
from numpy.random import default_rng
from time import time
# Create a large, random array to save. ~1.5 GB.
rng = default_rng(seed = 1)
n_col = 4000
n_row = 50000
mat = rng.standard_normal((n_col, n_row))
col_names = [str(nm) for nm in range(n_col)]
tab = pyarrow.Table.from_arrays(mat, names=col_names)
pq.write_table(tab, "test_tab.parquet", use_dictionary=False)
# How long does it take to read the whole thing in python?
time_start = time()
_ = pq.read_table("test_tab.parquet")
elapsed = time() - time_start
print(elapsed) # under 1 second on my computer
time_start = time()
f = pq.ParquetFile("test_tab.pq")
for one_col in col_names:
_ = f.read(one_col).column(0)
elapsed = time() - time_start
print(elapsed) # about 2 seconds
{code}
R benchmarking, using the same {{test_tab.parquet}} file
{code:r}
library(arrow)
read_by_column <- function(f) {
table = ParquetFileReader$create(f)
cols <- as.character(0:3999)
purrr::walk(cols, ~table$ReadTable(.)$column(0))
}
bench::mark(
read_parquet("test_tab.parquet", as_data_frame=FALSE), # 0.6 s
read_parquet("test_tab.parquet", as_data_frame=TRUE), # 1 s
read_by_column("test_tab.parquet"), # 100 s
check=FALSE
)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)