You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2020/09/24 15:41:00 UTC
[jira] [Resolved] (ARROW-9557) [R] Iterating over parquet columns
is slow in R
[ https://issues.apache.org/jira/browse/ARROW-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson resolved ARROW-9557.
------------------------------------
Fix Version/s: 2.0.0
Resolution: Fixed
Issue resolved by pull request 8122
[https://github.com/apache/arrow/pull/8122]
> [R] Iterating over parquet columns is slow in R
> -----------------------------------------------
>
> Key: ARROW-9557
> URL: https://issues.apache.org/jira/browse/ARROW-9557
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Affects Versions: 1.0.0
> Reporter: Karl Dunkle Werner
> Assignee: Romain Francois
> Priority: Minor
> Labels: performance, pull-request-available
> Fix For: 2.0.0
>
> Attachments: profile_screenshot.png
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python.
> An example is below, though it's certainly possible I've done my benchmarking incorrectly.
>
> Python setup and benchmarking:
> {code:python}
> import numpy as np
> import pyarrow
> import pyarrow.parquet as pq
> from numpy.random import default_rng
> from time import time
> # Create a large, random array to save. ~1.5 GB.
> rng = default_rng(seed = 1)
> n_col = 4000
> n_row = 50000
> mat = rng.standard_normal((n_col, n_row))
> col_names = [str(nm) for nm in range(n_col)]
> tab = pyarrow.Table.from_arrays(mat, names=col_names)
> pq.write_table(tab, "test_tab.parquet", use_dictionary=False)
> # How long does it take to read the whole thing in python?
> time_start = time()
> _ = pq.read_table("test_tab.parquet") # edit: corrected filename
> elapsed = time() - time_start
> print(elapsed) # under 1 second on my computer
> time_start = time()
> f = pq.ParquetFile("test_tab.parquet")
> for one_col in col_names:
> _ = f.read(one_col).column(0)
> elapsed = time() - time_start
> print(elapsed) # about 2 seconds
> {code}
> R benchmarking, using the same {{test_tab.parquet}} file
> {code:r}
> library(arrow)
> read_by_column <- function(f) {
> table = ParquetFileReader$create(f)
> cols <- as.character(0:3999)
> purrr::walk(cols, ~table$ReadTable(.)$column(0))
> }
> bench::mark(
> read_parquet("test_tab.parquet", as_data_frame=FALSE), # 0.6 s
> read_parquet("test_tab.parquet", as_data_frame=TRUE), # 1 s
> read_by_column("test_tab.parquet"), # 100 s
> check=FALSE
> )
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)