You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "kevinpemonon (via GitHub)" <gi...@apache.org> on 2023/06/20 13:26:27 UTC

[GitHub] [arrow] kevinpemonon commented on issue #36161: R - Problem retrieving memory used after gc() using arrow library

kevinpemonon commented on issue #36161:
URL: https://github.com/apache/arrow/issues/36161#issuecomment-1598790170

   Hello paleolimbot,
   
   Thank you for your reply.
   
   Compared to the results I get by adding the functions you advised, I don't quite understand what exactly the functions default_memory_pool()$bytes_allocated and default_memory_pool()$max_memory do.
   
   Below is the code with the outputs I ran :
   
   ```
   > gc(verbose = TRUE)
   Garbage collection 2 = 0+0+2 (level 2) ... 
   14.2 Mbytes of cons cells used (41%)
   3.9 Mbytes of vectors used (6%)
            used (Mb) gc trigger (Mb) max used (Mb)
   Ncells 264908 14.2     648748 34.7   401965 21.5
   Vcells 500529  3.9    8388608 64.0  1671274 12.8
   > 
   > # basic memory
   > memory.size(max=F)
   [1] 28.78
   > 
   > library(arrow, warn.conflicts = FALSE)
   > 
   > # Memory after loading the arrow library with memory.size
   > memory.size(max=F)
   [1] 51.01
   > 
   > # bytes_allocated after loading the arrow library
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after loading the arrow library
   > default_memory_pool()$max_memory
   [1] 0
   > 
   > library(dplyr)
   
   Attachement du package : ‘dplyr’
   
   Les objets suivants sont masqués depuis ‘package:stats’:
   
       filter, lag
   
   Les objets suivants sont masqués depuis ‘package:base’:
   
       intersect, setdiff, setequal, union
   
   > 
   > # Memory after loading the dplyr library with memory.size
   > memory.size(max=F)
   [1] 90.74
   > 
   > # bytes_allocated after loading the dplyr library
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after loading the dplyr library
   > default_memory_pool()$max_memory
   [1] 0
   > 
   > df <- data.frame(
   +   col1 = rnorm(1000000),
   +   col2 = rnorm(1000000),
   +   col3 = runif(1000000),
   +   col4 = sample(1:999, size = 1000000, replace = T),
   +   col5 = sample(c("GroupA", "GroupB"), size = 1000000, replace = T),
   +   col6 = sample(c("TypeA", "TypeB"), size = 1000000, replace = T)
   + )
   > 
   > # Memory after df object creation
   > memory.size(max=F)
   [1] 133.23
   > 
   > # bytes_allocated after df object creation
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after df object creation
   > default_memory_pool()$max_memory
   [1] 0
   > 
   > arrow::write_dataset(
   +   df,
   +   paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"),
   +   format = "parquet"
   + )
   > 
   > # Memory after writing to disk
   > memory.size(max=F)
   [1] 120.07
   > 
   > # bytes_allocated after writing to disk
   > default_memory_pool()$bytes_allocated
   [1] 19000128
   > 
   > # max_memory after writing to disk
   > default_memory_pool()$max_memory
   [1] 27126592
   > 
   > rm(df)
   > 
   > # Memory after deletion df
   > memory.size(max=F)
   [1] 120.07
   > 
   > # bytes_allocated after deletion df
   > default_memory_pool()$bytes_allocated
   [1] 19000128
   > 
   > # max_memory after deletion df
   > default_memory_pool()$max_memory
   [1] 27126592
   > 
   > gc(verbose = TRUE)
   Garbage collection 15 = 9+2+4 (level 2) ... 
   45.0 Mbytes of cons cells used (61%)
   38.0 Mbytes of vectors used (49%)
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells  842008   45    1387691 74.2  1387691 74.2
   Vcells 4975717   38   10146329 77.5  8388601 64.0
   > 
   > # Memory after gc(verbose = TRUE)
   > memory.size(max=F)
   [1] 101.29
   > 
   > # bytes_allocated after gc(verbose = TRUE)
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after gc(verbose = TRUE)
   > default_memory_pool()$max_memory
   [1] 27126592
   > 
   > gc(verbose = TRUE)
   Garbage collection 16 = 9+2+5 (level 2) ... 
   45.0 Mbytes of cons cells used (61%)
   11.3 Mbytes of vectors used (15%)
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells  841895 45.0    1387691 74.2  1387691 74.2
   Vcells 1475542 11.3   10146329 77.5  8388601 64.0
   > 
   > # Memory after gc(verbose = TRUE)
   > memory.size(max=F)
   [1] 74.35
   > 
   > # bytes_allocated after gc(verbose = TRUE)
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after gc(verbose = TRUE)
   > default_memory_pool()$max_memory
   [1] 27126592
   > 
   > ds <- arrow::open_dataset(paste0(Sys.getenv("USERPROFILE"),"/ExProblemeGc"))
   > 
   > # Memory after ds creation
   > memory.size(max=F)
   [1] 79.01
   > 
   > # bytes_allocated after ds creation
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after ds creation
   > default_memory_pool()$max_memory
   [1] 27126592
   > 
   > req <-
   +   ds %>%
   +   collect()
   > 
   > # Memory after req creation
   > memory.size(max=F)
   [1] 84.46
   > 
   > # bytes_allocated after req creation
   > default_memory_pool()$bytes_allocated
   [1] 47504192
   > 
   > # max_memory after req creation
   > default_memory_pool()$max_memory
   [1] 83176320
   > 
   > rm(req)
   > 
   > # Memory after deletion req
   > memory.size(max=F)
   [1] 84.47
   > 
   > # bytes_allocated after deletion req
   > default_memory_pool()$bytes_allocated
   [1] 47504192
   > 
   > # max_memory after deletion req
   > default_memory_pool()$max_memory
   [1] 83176320
   > 
   > gc(verbose = TRUE)
   Garbage collection 17 = 9+2+6 (level 2) ... 
   49.6 Mbytes of cons cells used (52%)
   12.5 Mbytes of vectors used (16%)
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells  927153 49.6    1792975 95.8  1387691 74.2
   Vcells 1627339 12.5   10146329 77.5  8388601 64.0
   > 
   > # Memory after gc(verbose = TRUE)
   > memory.size(max=F)
   [1] 75.8
   > 
   > # bytes_allocated after gc(verbose = TRUE)
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after gc(verbose = TRUE)
   > default_memory_pool()$max_memory
   [1] 83176320
   > 
   > gc(verbose = TRUE)
   Garbage collection 18 = 9+2+7 (level 2) ... 
   49.6 Mbytes of cons cells used (52%)
   12.5 Mbytes of vectors used (16%)
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells  927081 49.6    1792975 95.8  1387691 74.2
   Vcells 1627219 12.5   10146329 77.5  8388601 64.0
   > 
   > # bytes_allocated after gc(verbose = TRUE)
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after gc(verbose = TRUE)
   > default_memory_pool()$max_memory
   [1] 83176320
   > 
   > # Memory after gc(verbose = TRUE)
   > memory.size(max=F)
   [1] 75.8
   > 
   > rm(ds)
   > 
   > # Memory after deletion df
   > memory.size(max=F)
   [1] 75.8
   > 
   > # bytes_allocated after deletion df
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after deletion df
   > default_memory_pool()$max_memory
   [1] 83176320
   > 
   > gc(verbose = TRUE)
   Garbage collection 19 = 9+2+8 (level 2) ... 
   49.6 Mbytes of cons cells used (52%)
   12.5 Mbytes of vectors used (16%)
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells  926997 49.6    1792975 95.8  1387691 74.2
   Vcells 1627193 12.5   10146329 77.5  8388601 64.0
   > 
   > # Memory after gc(verbose = TRUE)
   > memory.size(max=F)
   [1] 75.8
   > 
   > # bytes_allocated after gc(verbose = TRUE)
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after gc(verbose = TRUE)
   > default_memory_pool()$max_memory
   [1] 83176320
   > 
   > gc(verbose = TRUE)
   Garbage collection 20 = 9+2+9 (level 2) ... 
   49.6 Mbytes of cons cells used (52%)
   12.5 Mbytes of vectors used (16%)
             used (Mb) gc trigger (Mb) max used (Mb)
   Ncells  926988 49.6    1792975 95.8  1387691 74.2
   Vcells 1627178 12.5   10146329 77.5  8388601 64.0
   > 
   > # Memory after gc(verbose = TRUE)
   > memory.size(max=F)
   [1] 75.8
   > 
   > # bytes_allocated after gc(verbose = TRUE)
   > default_memory_pool()$bytes_allocated
   [1] 0
   > 
   > # max_memory after gc(verbose = TRUE)
   > default_memory_pool()$max_memory
   [1] 83176320
   ```
   
   1 - After loading all the necessary libraries: 
   - memory.size() = 90.74
   - default_memory_pool()$bytes_allocated = 0
   - default_memory_pool()$max_memory = 0
   
   2 - After using data.frame to create the object df : 
   - memory.size() = 133.23
   - default_memory_pool()$bytes_allocated = 0
   - default_memory_pool()$max_memory = 0
   
   **There is no use of the arrow function, I think I understand that this is why the values of $bytes_allocated and $max_memory are not impacted?** 
   
   3 - After using arrow::write_dataset : 
   - memory.size() = 120.07
   - default_memory_pool()$bytes_allocated = 19000128
   - default_memory_pool()$max_memory = 27126592
   
   **Using the arrow function affects the values of $bytes_allocated and $max_memory**
   
   4 - After deleting the df object and gc() : 
   - memory.size() = 74.35
   - default_memory_pool()$bytes_allocated = 0
   - default_memory_pool()$max_memory = 27126592
   
   **I don't understand why default_memory_pool()$bytes_allocated = 0 after deleting df, when it was 0 when creating df and 19000128 after arrow::write_dataset. Shouldn't it be 19000128?**
   
   5 - After using arrow::open_dataset when creating the ds object : 
   - memory.size() = 79.01
   - default_memory_pool()$bytes_allocated = 0
   - default_memory_pool()$max_memory = 27126592
   
   **Using the arrow function when creating ds does not, this time, affect the values of $bytes_allocated and $max_memory. Why not?**
   
   6 - After passing the contents of ds and using collect() to create the req object : 
   - memory.size() = 84.46
   - default_memory_pool()$bytes_allocated = 47504192
   - default_memory_pool()$max_memory = 83176320
   
   **Using the arrow function again impacts the values of $bytes_allocated and $max_memory. Why not?**
   
   7 - After deleting the req object and gc() : 
   - memory.size() = 75.8
   - default_memory_pool()$bytes_allocated = 0
   - default_memory_pool()$max_memory = 83176320
   
   **Deleting the req object affects the value of $bytes_allocated**
   
   8 - After deleting the ds object and gc() : 
   - memory.size() = 75.8
   - default_memory_pool()$bytes_allocated = 0
   - default_memory_pool()$max_memory = 83176320
   
   I don't quite understand how $bytes_allocated and $max_memory work. Could you please explain?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org