You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2020/09/24 15:41:00 UTC
[jira] [Resolved] (ARROW-9557) [R] Iterating over parquet columns is slow in R

     [ https://issues.apache.org/jira/browse/ARROW-9557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neal Richardson resolved ARROW-9557.
------------------------------------
    Fix Version/s: 2.0.0
       Resolution: Fixed

Issue resolved by pull request 8122
[https://github.com/apache/arrow/pull/8122]

> [R] Iterating over parquet columns is slow in R
> -----------------------------------------------
>
>                 Key: ARROW-9557
>                 URL: https://issues.apache.org/jira/browse/ARROW-9557
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>    Affects Versions: 1.0.0
>            Reporter: Karl Dunkle Werner
>            Assignee: Romain Francois
>            Priority: Minor
>              Labels: performance, pull-request-available
>             Fix For: 2.0.0
>
>         Attachments: profile_screenshot.png
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> I've found that reading in a parquet file one column at a time is slow in R – much slower than reading the whole all at once in R, or reading one column at a time in Python.
> An example is below, though it's certainly possible I've done my benchmarking incorrectly.
>  
> Python setup and benchmarking:
> {code:python}
> import numpy as np
> import pyarrow
> import pyarrow.parquet as pq
> from numpy.random import default_rng
> from time import time
> # Create a large, random array to save. ~1.5 GB.
> rng = default_rng(seed = 1)
> n_col = 4000
> n_row = 50000
> mat = rng.standard_normal((n_col, n_row))
> col_names = [str(nm) for nm in range(n_col)]
> tab = pyarrow.Table.from_arrays(mat, names=col_names)
> pq.write_table(tab, "test_tab.parquet", use_dictionary=False)
> # How long does it take to read the whole thing in python?
> time_start = time()
> _ = pq.read_table("test_tab.parquet") # edit: corrected filename
> elapsed = time() - time_start
> print(elapsed) # under 1 second on my computer
> time_start = time()
> f = pq.ParquetFile("test_tab.parquet")
> for one_col in col_names:
>     _ = f.read(one_col).column(0)
> elapsed = time() - time_start
> print(elapsed) # about 2 seconds
> {code}
> R benchmarking, using the same {{test_tab.parquet}} file
> {code:r}
> library(arrow)
> read_by_column <- function(f) {
>     table = ParquetFileReader$create(f)
>     cols <- as.character(0:3999)
>     purrr::walk(cols, ~table$ReadTable(.)$column(0))
> }
> bench::mark(
>     read_parquet("test_tab.parquet", as_data_frame=FALSE), #   0.6 s
>     read_parquet("test_tab.parquet", as_data_frame=TRUE),  #   1 s
>     read_by_column("test_tab.parquet"),                    # 100 s
>     check=FALSE
> )
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)