You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/21 16:34:15 UTC

[GitHub] [arrow] ablack3 opened a new issue, #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

ablack3 opened a new issue, #14474:
URL: https://github.com/apache/arrow/issues/14474

   I would like to create a fileSystemDataset object in a temp folder, process it in batches, and then remove it. This works fine on Linux and Mac but on Windows a file lock prevents removal of the temp folder. I think the lock is created by arrow and is not released until manually call the garbage collector. I don't think I should (or am allowed to) call the garbage collector from inside a function that is part of a CRAN hosted R package. So how do I remove the file lock in Windows so I can delete the fileSystemDataset?
   
   Reproducible example below.
   
   
   ``` r
   library(arrow)
   
   # create a FileSystemDataset object
   
   filename <- here::here("tmp")
   write_dataset(cars, filename, format = "feather")
   ds <- open_dataset(filename, format = "feather")
   ds
   #> FileSystemDataset with 1 Feather file
   #> speed: double
   #> dist: double
   #> 
   #> See $metadata for additional Schema metadata
   
   # process the file in batches
   scanner <- ScannerBuilder$create(ds)$BatchSize(batch_size = 4)$Finish()
   reader <- scanner$ToRecordBatchReader()
   
   batch_num <- 1
   while(!is.null(batch <- reader$read_next_batch())) {
     print(paste("Reading batch", batch_num, "with", nrow(batch), "rows"))
     batch_num <- batch_num + 1
   }
   #> [1] "Reading batch 1 with 4 rows"
   #> [1] "Reading batch 2 with 4 rows"
   #> [1] "Reading batch 3 with 4 rows"
   #> [1] "Reading batch 4 with 4 rows"
   #> [1] "Reading batch 5 with 4 rows"
   #> [1] "Reading batch 6 with 4 rows"
   #> [1] "Reading batch 7 with 4 rows"
   #> [1] "Reading batch 8 with 4 rows"
   #> [1] "Reading batch 9 with 4 rows"
   #> [1] "Reading batch 10 with 4 rows"
   #> [1] "Reading batch 11 with 4 rows"
   #> [1] "Reading batch 12 with 4 rows"
   #> [1] "Reading batch 13 with 2 rows"
   
   rm(reader)
   rm(scanner)
   rm(ds)
   
   # remove the file
   rc <- unlink(filename, recursive = TRUE)
   if(rc == 1) print("removal of file failed")
   #> [1] "removal of file failed"
   
   file.exists(filename)
   #> [1] TRUE
   
   # call gc()
   gc()
   #>           used (Mb) gc trigger  (Mb) max used (Mb)
   #> Ncells 1115105 59.6    2401181 128.3  1234217 66.0
   #> Vcells 1937727 14.8    8388608  64.0  3294370 25.2
   
   # remove the file
   rc <- unlink(filename, recursive = TRUE)
   if(rc == 1) print("removal of file failed")
   
   file.exists(filename)
   #> [1] FALSE
   
   ```
   
   <sup>Created on 2022-10-19 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup>
   
   <details style="margin-bottom:10px;">
   <summary>
   Session info
   </summary>
   
   ``` r
   sessioninfo::session_info()
   #> - Session info ---------------------------------------------------------------
   #>  setting  value                       
   #>  version  R version 4.0.5 (2021-03-31)
   #>  os       Windows 10 x64              
   #>  system   x86_64, mingw32             
   #>  ui       RTerm                       
   #>  language (EN)                        
   #>  collate  English_United States.1252  
   #>  ctype    English_United States.1252  
   #>  tz       America/New_York            
   #>  date     2022-10-19                  
   #> 
   #> - Packages -------------------------------------------------------------------
   #>  package     * version date       lib source        
   #>  arrow       * 9.0.0.2 2022-10-02 [1] CRAN (R 4.0.5)
   #>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.5)
   #>  backports     1.4.0   2021-11-23 [1] CRAN (R 4.0.5)
   #>  bit           4.0.4   2020-08-04 [1] CRAN (R 4.0.5)
   #>  bit64         4.0.5   2020-08-30 [1] CRAN (R 4.0.5)
   #>  cli           3.0.1   2021-07-17 [1] CRAN (R 4.0.5)
   #>  crayon        1.5.1   2022-03-26 [1] CRAN (R 4.0.5)
   #>  DBI           1.1.2   2021-12-20 [1] CRAN (R 4.0.5)
   #>  digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.5)
   #>  dplyr         1.0.8   2022-02-08 [1] CRAN (R 4.0.5)
   #>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.0.5)
   #>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.5)
   #>  fansi         0.5.0   2021-05-25 [1] CRAN (R 4.0.5)
   #>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.0.5)
   #>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.5)
   #>  generics      0.1.2   2022-01-31 [1] CRAN (R 4.0.5)
   #>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.0.5)
   #>  here          1.0.1   2020-12-13 [1] CRAN (R 4.0.5)
   #>  highr         0.9     2021-04-16 [1] CRAN (R 4.0.5)
   #>  htmltools     0.5.2   2021-08-25 [1] CRAN (R 4.0.5)
   #>  knitr         1.36    2021-09-29 [1] CRAN (R 4.0.5)
   #>  lifecycle     1.0.1   2021-09-24 [1] CRAN (R 4.0.5)
   #>  magrittr      2.0.1   2020-11-17 [1] CRAN (R 4.0.5)
   #>  pillar        1.7.0   2022-02-01 [1] CRAN (R 4.0.5)
   #>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.5)
   #>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.5)
   #>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.0.5)
   #>  reprex        2.0.1   2021-08-05 [1] CRAN (R 4.0.5)
   #>  rlang         1.0.2   2022-03-04 [1] CRAN (R 4.0.5)
   #>  rmarkdown     2.10    2021-08-06 [1] CRAN (R 4.0.5)
   #>  rprojroot     2.0.2   2020-11-15 [1] CRAN (R 4.0.5)
   #>  rstudioapi    0.13    2020-11-12 [1] CRAN (R 4.0.5)
   #>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.5)
   #>  stringi       1.7.5   2021-10-04 [1] CRAN (R 4.0.5)
   #>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.5)
   #>  styler        1.5.1   2021-07-13 [1] CRAN (R 4.0.5)
   #>  tibble        3.1.2   2021-05-16 [1] CRAN (R 4.0.5)
   #>  tidyselect    1.1.2   2022-02-21 [1] CRAN (R 4.0.5)
   #>  tzdb          0.2.0   2021-10-27 [1] CRAN (R 4.0.5)
   #>  utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.5)
   #>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.0.5)
   #>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.0.5)
   #>  xfun          0.25    2021-08-06 [1] CRAN (R 4.0.5)
   #>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.5)
   #> 
   #> [1] C:/Users/adam.DESKTOP-D3KQQA1/Documents/R/win-library/4.0
   #> [2] C:/Program Files/R/R-4.0.5/library
   ```
   
   </details>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1339762416

   Also that is "should have a close method" in the sense of "we should do this" and not "it should already exist" (RecordBatchReader does have Close but I don't think the record batch reader that R uses today has an implemented close method)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot closed issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
paleolimbot closed issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?
URL: https://github.com/apache/arrow/issues/14474


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1340037076

   > The R record batch reader does implement Close(), but adding it to the repro doesn't seem to fix the issue.
   
   Hmm...I may take a look.  This may be something the newer scanner fixes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
thisisnic commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1338279910

   I've asked around, but I'm not sure there is a workaround; calling `gc()` might be the best solution for the moment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wjones127 commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
wjones127 commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1339788192

   > (RecordBatchReader does have Close but I don't think the record batch reader that R uses today has an implemented close method)
   
   The R record batch reader does implement `Close()`, but adding it to the repro doesn't seem to fix the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1380867051

   I'm fairly sure that the linked PR works (or at least helps)...I created a binary package from crossbow that you can install to try it without building Arrow from source...I'd be grateful for testing! See https://github.com/apache/arrow/pull/15278#issuecomment-1380863503 for instructions on how to install the binary fix (and PR comments for the various things I tried to verify that the fix worked).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] paleolimbot commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
paleolimbot commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1376156346

   I'm going to try to rig a solution for this for the upcoming release, since we have a lot of open issues about this one ([ARROW-18313](https://issues.apache.org/jira/browse/ARROW-18313), [ARROW-17208](https://issues.apache.org/jira/browse/ARROW-17208), [ARROW-17002](https://issues.apache.org/jira/browse/ARROW-17002), [ARROW-16421](https://issues.apache.org/jira/browse/ARROW-16421), [ARROW-16452](https://issues.apache.org/jira/browse/ARROW-16452).
   
   We can discuss on the PR, but basically, we create many temporary R6 objects in the process of creating an ExecPlan. Those R objects keep shared pointers alive until the garbage collector runs. There are some cases where we can clean up some of those references by resetting the shared pointer when the function exists (which is predictable) rather than when the garbage collector runs (which is not). In the case of a `dplyr::collect()` we don't surface *any* R6 objects to the user so there shouldn't be any need for any lingering shared_ptr references to exist (at least because of R).
   
   I'd propose that we add a `$unsafe_delete()` method to `ArrowObject` - or at least to a few types of objects - and see to what extent cleaning up those temporary references can avoid open files by the time `collect()` returns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] thisisnic commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
thisisnic commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1293545377

   This looks like a similar problem to https://issues.apache.org/jira/browse/ARROW-16421.  @wjones127 - that JIRA ticket is currently assigned to you - any ideas on possible workarounds etc


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
westonpace commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1339759439

   It might be good to revisit this after 11.0.0.  The reader created from a scanner should have a close method that can be called.  It should abort the plan and wait for the remaining tasks to finish up.  I don't know that this fully removes all objects from memory (R is still doing something weird here) but it would ensure the file is closed and this test case should pass.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ablack3 commented on issue #14474: How do I remove a fileSystemDataset object without calling garbage collection from R?

Posted by GitBox <gi...@apache.org>.
ablack3 commented on issue #14474:
URL: https://github.com/apache/arrow/issues/14474#issuecomment-1337819477

   @thisisnic, @wjones127 - Any progress on a workaround for this? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org