You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Will Jones (Jira)" <ji...@apache.org> on 2021/11/22 18:35:00 UTC
[jira] [Comment Edited] (ARROW-14727) Excessive memory usage on Windows

    [ https://issues.apache.org/jira/browse/ARROW-14727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446716#comment-17446716 ] 

Will Jones edited comment on ARROW-14727 at 11/22/21, 6:34 PM:
---------------------------------------------------------------

Hi András! I've started to work to reproduce this, though haven't yet had success. You might try using the profmem package like below, or even adapt the below script to be closer to your data such that it starts reproducing the behavior.

I tested this on Windows 10 with Arrow 6.0.0 and R 4.1.2. If you run both open_dataset() in same R session, you'll notice the one you run first having a larger number of allocations. But I consistently saw the version just selecting i and j to allocate less memory in total.

Let me know if there is some tweak to the script to make it more like your situation. Or what results you are seeing using profmem.
{code:r}
library(dplyr)
library(tidyr)
library(arrow)
library(purrr)
library(profmem)

path <- "test_data"

# Create big dataset
rows_per_partition <- 5e6
i_values <- letters[1:4]
j_values <- letters[1:4]

rpartition <- function(n) {
  tibble(x=rnorm(n), y=rnorm(n), z=sample(letters, size=n, replace=TRUE))
}

ds <- expand_grid(i=i_values, j=j_values) %>%
  mutate(data = rerun(n(), rpartition(n=rows_per_partition))) %>%
  unnest(c("data"))

ds %>%
  group_by(i, j) %>%
  arrow::write_dataset(path, format="parquet")


# Try 1 : partition cols only
remove(ds)
gc()

p1 <- profmem({
  ds <- open_dataset(path) %>%
    select(i, j) %>%
    collect()
})
print(p1, expr = FALSE)


# Try 2 : add another column
remove(ds)
gc()


p2 <- profmem({
  ds <- open_dataset(path) %>%
    select(i, j, x) %>%
    collect()
})
print(p2, expr = FALSE)


sum(p1$bytes, na.rm=TRUE)
# 1280025656
sum(p2$bytes, na.rm=TRUE)
# 1934404384

 {code}


was (Author: willjones127):
Hi András! I've started to work to reproduce this, though haven't yet had success. You might try using the profmem package like below, or even adapt the below script to be closer to your data such that it starts reproducing the behavior.

I tested this on Windows 10 with Arrow 6.0.0 and R 4.1.2. If you run both open_dataset() in same R session, you'll notice the one you run first having a larger number of allocations. But I consistently saw the version just selecting i and j to make fewer allocations and use less memory.

Let me know if there is some tweak to the script to make it more like your situation. Or what results you are seeing using profmem.

{code:r}
library(dplyr)
library(tidyr)
library(arrow)
library(purrr)
library(profmem)

path <- "test_data"

# Create big dataset
rows_per_partition <- 5e6
i_values <- letters[1:4]
j_values <- letters[1:4]

rpartition <- function(n) {
  tibble(x=rnorm(n), y=rnorm(n), z=sample(letters, size=n, replace=TRUE))
}

ds <- expand_grid(i=i_values, j=j_values) %>%
  mutate(data = rerun(n(), rpartition(n=rows_per_partition))) %>%
  unnest(c("data"))

ds %>%
  group_by(i, j) %>%
  arrow::write_dataset(path, format="parquet")


# Try 1 : partition cols only
remove(ds)
gc()

p1 <- profmem({
  ds <- open_dataset(path) %>%
    select(i, j) %>%
    collect()
})
print(p1, expr = FALSE)


# Try 2 : add another column
remove(ds)
gc()


p2 <- profmem({
  ds <- open_dataset(path) %>%
    select(i, j, x) %>%
    collect()
})
print(p2, expr = FALSE)


sum(p1$bytes, na.rm=TRUE)
# 1280025656
sum(p2$bytes, na.rm=TRUE)
# 1934404384

 {code}

> Excessive memory usage on Windows
> ---------------------------------
>
>                 Key: ARROW-14727
>                 URL: https://issues.apache.org/jira/browse/ARROW-14727
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.0
>            Reporter: András Svraka
>            Priority: Major
>
> I have the following workflow which worked on Arrow 5.0 on Windows 10 and R 4.1.2:
> {code:r}
> open_dataset(path) %>%
>   select(i, j) %>%
>   collect()
> {code}
> The dataset in {{path}} is partitioned by {{i}} and {{{}j{}}}, with 16 partitions in total, 5 million rows in each partition and each partition has several other regular columns (i.e. present in every partition). The entire dataset can be read into memory on my 16GB machine, which results in an R data.frame of around 3GB. However, on Arrow 6.0 the same operation fails, and R runs out of memory. Interestingly, this still works:
> {code:r}
> open_dataset(path) %>%
>   select(i, j, x) %>%
>   collect() %>%
> {code}
> where {{x}} is a regular column.
> I cannot reproduce the same issue on Linux. Measuring the actual memory consumption with GNU time ({{{}--format=%Mmax{}}}) I get very similar figures for the first pipeline both on 5.0 and 6.0. The same is true for the second pipeline, which of course consumes slightly more memory, as expected. On Windows I don’t know of a simple method to measure maximum memory consumption but eyeballing it from Process Explorer, Arrow 5.0 needs around 0.5GB for the first example, while with Arrow 6.0 my 16GB machine becomes unresponsive, starts swapping, and depending on the circumstances, other apps might crash before R crashes with this error:
> {noformat}
> terminate called after throwing an instance of 'std::bad_alloc'
>   what():  std::bad_alloc {noformat}
> With the second example, both versions consume roughly the same amount of memory.
> With the new features in Arrow 6.0, this doesn’t work in Windows either, memory consumption shoots up into the 10s of GBs:
> {code:r}
> open_dataset(path) %>%
>   distinct(i, j) %>%
>   collect()
> {code}
> Meanwhile this works, with under 1GB memory needed:
> {code:r}
> open_dataset(path) %>%
>   distinct(i, j, x) %>%
>   collect()
> {code}
> These last two examples work without any issue on Linux, and as expected, they consume significantly less memory, as the select-then-collect examples.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)