You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Karl Dunkle Werner (Jira)" <ji...@apache.org> on 2020/07/25 20:43:00 UTC

[jira] [Created] (ARROW-9557) [R] Iterating over parquet columns is slow in R

Karl Dunkle Werner created ARROW-9557:
-----------------------------------------

             Summary: [R] Iterating over parquet columns is slow in R
                 Key: ARROW-9557
                 URL: https://issues.apache.org/jira/browse/ARROW-9557
             Project: Apache Arrow
          Issue Type: Improvement
          Components: R
    Affects Versions: 1.0.0
            Reporter: Karl Dunkle Werner


I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python.

An example is below, though it's certainly possible I've done my benchmarking incorrectly.

 

Python setup and benchmarking:
{code:python}
import numpy as np
import pyarrow
import pyarrow.parquet as pq
from numpy.random import default_rng
from time import time

# Create a large, random array to save. ~1.5 GB.
rng = default_rng(seed = 1)
n_col = 4000
n_row = 50000

mat = rng.standard_normal((n_col, n_row))
col_names = [str(nm) for nm in range(n_col)]
tab = pyarrow.Table.from_arrays(mat, names=col_names)

pq.write_table(tab, "test_tab.parquet", use_dictionary=False)

# How long does it take to read the whole thing in python?
time_start = time()
_ = pq.read_table("test_tab.parquet")
elapsed = time() - time_start
print(elapsed) # under 1 second on my computer


time_start = time()
f = pq.ParquetFile("test_tab.pq")
for one_col in col_names:
    _ = f.read(one_col).column(0)

elapsed = time() - time_start
print(elapsed) # about 2 seconds


{code}
R benchmarking, using the same {{test_tab.parquet}} file
{code:r}
library(arrow)

read_by_column <- function(f) {
    table = ParquetFileReader$create(f)
    cols <- as.character(0:3999)
    purrr::walk(cols, ~table$ReadTable(.)$column(0))
}

bench::mark(
    read_parquet("test_tab.parquet", as_data_frame=FALSE), #   0.6 s
    read_parquet("test_tab.parquet", as_data_frame=TRUE),  #   1 s
    read_by_column("test_tab.parquet"),                    # 100 s
    check=FALSE
)

{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)