You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Vitalie Spinu (Jira)" <ji...@apache.org> on 2022/08/29 22:13:00 UTC

[jira] [Created] (ARROW-17559) [R][C++] Regression: big performance hit after removing schema binding

Vitalie Spinu created ARROW-17559:
-------------------------------------

             Summary: [R][C++] Regression: big performance hit after removing schema binding
                 Key: ARROW-17559
                 URL: https://issues.apache.org/jira/browse/ARROW-17559
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, R
    Affects Versions: 9.0.1
         Environment: ubuntu 2020
            Reporter: Vitalie Spinu


After ARROW-15260 I observe a big memory  and compute time increases with basic sumarize queries. My use case shows almost 10x memory and 10x computation time increases in some cases.  

Here is a less dramatic replication along my real use case which gives 2x time increase:

{code:R}

  library(arrow)
  dir.create(dir <- "/tmp/iris", showWarnings = F)
  for (day in seq_len(100)) {
    dir.create(glue("{dir}/day={day}"), showWarnings = F)
    for (i in seq_len(10)) {
      dfs <- map(seq_len(10), function(j) {
        df <- mutate(iris, A = as.factor(sample(3, n(), replace = TRUE)))
        names(df) <- paste0(names(df), j)
        df
      })
      df <- dplyr::bind_cols(!!!dfs)
      write_parquet(df, glue("{dir}/day={day}/{i}.parquet"))
    }
  }

  library(arrow)
  system.time(
    open_dataset("/tmp/iris") %>%
    group_by(day, Species1) %>%
    summarise(N = n(), .groups = "drop") %>%
    collect())

{code}

Before commit 838687178: 0.2sec, after 0.4sec. 

If I put back the schema Binding which was removed [here|https://github.com/apache/arrow/pull/12826/files#diff-0d1ff6f17f571f6a348848af7de9c05ed588d3339f46dd3bcf2808489f7dca92L235] I get the performance back. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)