You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2022/10/18 12:59:00 UTC
[jira] [Resolved] (ARROW-17559) [R][C++] Regression: big performance hit after removing schema binding
[ https://issues.apache.org/jira/browse/ARROW-17559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson resolved ARROW-17559.
-------------------------------------
Assignee: Vibhatha Lakmal Abeykoon
Resolution: Fixed
> [R][C++] Regression: big performance hit after removing schema binding
> ----------------------------------------------------------------------
>
> Key: ARROW-17559
> URL: https://issues.apache.org/jira/browse/ARROW-17559
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, R
> Affects Versions: 9.0.0
> Environment: ubuntu 2020
> Reporter: Vitalie Spinu
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
> Labels: R, compute
> Fix For: 10.0.0
>
>
> After ARROW-15260 I observe a big memory and compute time increases with basic sumarize queries. My use case shows almost 10x memory and 10x computation time increases in some cases.
> Here is a less dramatic replication along my real use case which gives 2x time increase:
> {code:R}
> library(arrow)
> dir.create(dir <- "/tmp/iris", showWarnings = F)
> for (day in seq_len(100)) {
> dir.create(glue("{dir}/day={day}"), showWarnings = F)
> for (i in seq_len(10)) {
> dfs <- map(seq_len(20), function(j) {
> names(iris) <- paste0(names(iris), j)
> iris
> })
> df <- dplyr::bind_cols(!!!dfs)
> write_parquet(df, glue("{dir}/day={day}/{i}.parquet"))
> }
> }
> library(arrow)
> system.time(
> open_dataset("/tmp/iris") %>%
> group_by(day, Species1) %>%
> summarise(N = n(), .groups = "drop") %>%
> collect())
> {code}
> Before commit 838687178: 0.33sec, after: 0.73sec.
> If I put back the schema Binding which was removed [here|https://github.com/apache/arrow/pull/12826/files#diff-0d1ff6f17f571f6a348848af7de9c05ed588d3339f46dd3bcf2808489f7dca92L235] I get the performance back.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)