You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/10/26 18:24:00 UTC

[GitHub] [arrow] bkietz opened a new pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

bkietz opened a new pull request #8533:
URL: https://github.com/apache/arrow/pull/8533


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#discussion_r512242522



##########
File path: r/src/memorypool.cpp
##########
@@ -19,10 +19,45 @@
 #if defined(ARROW_R_WITH_ARROW)
 #include <arrow/memory_pool.h>
 
+class GcMemoryPool : public arrow::MemoryPool {
+ public:
+  GcMemoryPool()
+      : pool_(arrow::default_memory_pool()), gc_(cpp11::package("base")["gc"]) {}
+
+  arrow::Status Allocate(int64_t size, uint8_t** out) override {
+    return GcAndTryAgain([&] { return pool_->Allocate(size, out); });
+  }
+
+  arrow::Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override {
+    return GcAndTryAgain([&] { return pool_->Reallocate(old_size, new_size, ptr); });
+  }
+
+  void Free(uint8_t* buffer, int64_t size) override { pool_->Free(buffer, size); }
+
+  int64_t bytes_allocated() const override { return pool_->bytes_allocated(); }
+
+  int64_t max_memory() const override { return pool_->max_memory(); }
+
+  std::string backend_name() const override { return pool_->backend_name() + "-gc"; }
+
+ private:
+  template <typename Call>
+  arrow::Status GcAndTryAgain(const Call& call) {
+    if (call().ok()) {

Review comment:
       wilco




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] terencehonles commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
terencehonles commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-722045397


   It looks like the offending line is https://github.com/apache/arrow/blob/master/cpp/src/jni/orc/jni_wrapper.cpp#L229 and I'm updating that to not pass null, but since passing a null pointer was previously supported I'm going to also remove the check in case it's used elsewhere.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] wesm commented on a change in pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#discussion_r512231253



##########
File path: r/src/memorypool.cpp
##########
@@ -19,10 +19,45 @@
 #if defined(ARROW_R_WITH_ARROW)
 #include <arrow/memory_pool.h>
 
+class GcMemoryPool : public arrow::MemoryPool {
+ public:
+  GcMemoryPool()
+      : pool_(arrow::default_memory_pool()), gc_(cpp11::package("base")["gc"]) {}
+
+  arrow::Status Allocate(int64_t size, uint8_t** out) override {
+    return GcAndTryAgain([&] { return pool_->Allocate(size, out); });
+  }
+
+  arrow::Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override {
+    return GcAndTryAgain([&] { return pool_->Reallocate(old_size, new_size, ptr); });
+  }
+
+  void Free(uint8_t* buffer, int64_t size) override { pool_->Free(buffer, size); }
+
+  int64_t bytes_allocated() const override { return pool_->bytes_allocated(); }
+
+  int64_t max_memory() const override { return pool_->max_memory(); }
+
+  std::string backend_name() const override { return pool_->backend_name() + "-gc"; }
+
+ private:
+  template <typename Call>
+  arrow::Status GcAndTryAgain(const Call& call) {
+    if (call().ok()) {

Review comment:
       Can you add a comment explaining the circumstances when this could fail / why (also put this in the PR description)?

##########
File path: r/src/memorypool.cpp
##########
@@ -19,10 +19,45 @@
 #if defined(ARROW_R_WITH_ARROW)
 #include <arrow/memory_pool.h>
 
+class GcMemoryPool : public arrow::MemoryPool {
+ public:
+  GcMemoryPool()
+      : pool_(arrow::default_memory_pool()), gc_(cpp11::package("base")["gc"]) {}
+
+  arrow::Status Allocate(int64_t size, uint8_t** out) override {
+    return GcAndTryAgain([&] { return pool_->Allocate(size, out); });
+  }
+
+  arrow::Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override {
+    return GcAndTryAgain([&] { return pool_->Reallocate(old_size, new_size, ptr); });
+  }
+
+  void Free(uint8_t* buffer, int64_t size) override { pool_->Free(buffer, size); }
+
+  int64_t bytes_allocated() const override { return pool_->bytes_allocated(); }
+
+  int64_t max_memory() const override { return pool_->max_memory(); }
+
+  std::string backend_name() const override { return pool_->backend_name() + "-gc"; }
+
+ private:
+  template <typename Call>
+  arrow::Status GcAndTryAgain(const Call& call) {
+    if (call().ok()) {
+      return arrow::Status::OK();
+    }
+    // ARROW-10080
+    gc_();

Review comment:
       Are `cpp11::function` objects threadsafe?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-718021313


   I also had, in a branch that builds on top of #8256 ways to prematurely invalidate objects when we know they won't be used anymore. For example, in this function: 
   
   ```r
   collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) {
     x <- ensure_group_vars(x)
     # Pull only the selected rows and cols into R
     if (query_on_dataset(x)) {
       # See dataset.R for Dataset and Scanner(Builder) classes
       tab <- Scanner$create(x)$ToTable()
     } else {
       # This is a Table/RecordBatch. See record-batch.R for the [ method
       tab <- x$.data[x$filtered_rows, x$selected_columns, keep_na = FALSE]
     }
     if (as_data_frame) {
       df <- as.data.frame(tab)
       tab$invalidate()
       restore_dplyr_features(df, x)
     } else {
       restore_dplyr_features(tab, x)
     }
   }
   ```
   
   inside the `if (as_data_frame)` as soon as `tab` is converted to a `data.frame` we will no longer need or use `tab`, so calling `$invalidate()` on it calls the destructor of the shared pointer held by the external pointer that lives in `tab`. 
   
   Is this still worth having ? And in that case should I push this to #8256 cc @nealrichardson 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on a change in pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on a change in pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#discussion_r513545431



##########
File path: r/src/memorypool.cpp
##########
@@ -18,11 +18,55 @@
 #include "./arrow_types.h"
 #if defined(ARROW_R_WITH_ARROW)
 #include <arrow/memory_pool.h>
+#include <arrow/util/mutex.h>
+
+class GcMemoryPool : public arrow::MemoryPool {
+ public:
+  GcMemoryPool() : pool_(arrow::default_memory_pool()) {}
+
+  arrow::Status Allocate(int64_t size, uint8_t** out) override {
+    return GcAndTryAgain([&] { return pool_->Allocate(size, out); });
+  }
+
+  arrow::Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override {
+    return GcAndTryAgain([&] { return pool_->Reallocate(old_size, new_size, ptr); });
+  }
+
+  void Free(uint8_t* buffer, int64_t size) override { pool_->Free(buffer, size); }
+
+  int64_t bytes_allocated() const override { return pool_->bytes_allocated(); }
+
+  int64_t max_memory() const override { return pool_->max_memory(); }
+
+  std::string backend_name() const override { return pool_->backend_name() + "-gc"; }
+
+ private:
+  template <typename Call>
+  arrow::Status GcAndTryAgain(const Call& call) {
+    if (call().ok()) {
+      return arrow::Status::OK();
+    } else {
+      auto lock = mutex_.Lock();
+
+      // ARROW-10080: Allocation may fail spuriously since the garbage collector is lazy.
+      // Force it to run then try again in case any reusable allocations have been freed.
+      static cpp11::function gc = cpp11::package("base")["gc"];

Review comment:
       In the first call of each version of the template method `GcAndTryAgain` right ? Probably no big deal anyway. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-717516187


   Not in scope for this PR, may result in further spurious OOMs later:
   - Feather APIs don't accept a MemoryPool
   - `LocalFileSystem::OpenInput{Stream,File}` use the default memory pool if `use_mmap` is not set. This affects dataset scanning transitively since we use those methods to open files for reading
   - Any file opened with `FileSystem$OpenInputStream` is also affected


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois edited a comment on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
romainfrancois edited a comment on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-718021313


   I also had, in a branch that builds on top of #8256 ways to prematurely invalidate objects when we know they won't be used anymore. For example, in this function: 
   
   ```r
   collect.arrow_dplyr_query <- function(x, as_data_frame = TRUE, ...) {
     x <- ensure_group_vars(x)
     # Pull only the selected rows and cols into R
     if (query_on_dataset(x)) {
       # See dataset.R for Dataset and Scanner(Builder) classes
       tab <- Scanner$create(x)$ToTable()
     } else {
       # This is a Table/RecordBatch. See record-batch.R for the [ method
       tab <- x$.data[x$filtered_rows, x$selected_columns, keep_na = FALSE]
     }
     if (as_data_frame) {
       df <- as.data.frame(tab)
       tab$invalidate() # HERE <<<<<<------------- 
       restore_dplyr_features(df, x)
     } else {
       restore_dplyr_features(tab, x)
     }
   }
   ```
   
   inside the `if (as_data_frame)` as soon as `tab` is converted to a `data.frame` we will no longer need or use `tab`, so calling `$invalidate()` on it calls the destructor of the shared pointer held by the external pointer that lives in `tab`, so that the memory is free right now instead of later when the gc is called
   
   Is this still worth having ? And in that case should I push this to #8256 cc @nealrichardson 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] terencehonles commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
terencehonles commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-721815501


   > CI failures are unrelated. Merging
   
   @bkietz I've bisected the commit history, and it does look like [this change](https://github.com/apache/arrow/pull/8533/files#diff-2b5392b203328a6b9e7ff692e51eb39691cc1d05329882bef0200f0d9bd51f84R57)
   
   ```diff
   diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc
   index 1cde5e64e..88f006fe0 100644
   --- a/cpp/src/arrow/io/memory.cc
   +++ b/cpp/src/arrow/io/memory.cc
   @@ -54,7 +54,7 @@ BufferOutputStream::BufferOutputStream(const std::shared_ptr<ResizableBuffer>& b
    Result<std::shared_ptr<BufferOutputStream>> BufferOutputStream::Create(
        int64_t initial_capacity, MemoryPool* pool) {
      // ctor is private, so cannot use make_shared
   +  DCHECK_NE(pool, nullptr);
      auto ptr = std::shared_ptr<BufferOutputStream>(new BufferOutputStream);
      RETURN_NOT_OK(ptr->Reset(initial_capacity, pool));
      return ptr;
   
   ```
   
    is "responsible" for breaking the JNI status check (you can also see it was passing on the first commit in this PR https://github.com/apache/arrow/runs/1310578738 and then failing at the end https://github.com/apache/arrow/runs/1329149428).
   
   I am not familiar with the code, and I was hoping for you to possibly help me understand why the check was added and why it might be segfaulting the test java/adapter/orc/src/test/java/org/apache/arrow/adapter/orc/OrcReaderTest.java. I have experimented with removing the check and the test passes as expected, but if the check was added to catch something I'd rather not cover it up. I noticed there were two follow ups ARROW-10420 and ARROW-10421 and I was wondering if this might be related?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-721841024


   @terencehonles Sorry for the breakage! I added this check when debugging injection of the customized memory pool. I didn't see that `PoolBuffer::MakeUnique()` coerces null `MemoryPool*` to `default_memory_pool()`, so I assumed leaving it in would be harmless. It can be removed, though it's still slightly suspicious to me that null would be passed for a memory pool given that the test logs seem to indicate a different memory manager (`allocation manager type not specified, using netty as the default type`) was selected.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz closed pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz closed pull request #8533:
URL: https://github.com/apache/arrow/pull/8533


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] terencehonles edited a comment on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
terencehonles edited a comment on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-721815501


   > CI failures are unrelated. Merging
   
   @bkietz I've bisected the commit history, and it does look like [this change](https://github.com/apache/arrow/pull/8533/files#diff-2b5392b203328a6b9e7ff692e51eb39691cc1d05329882bef0200f0d9bd51f84R57)
   
   ```diff
   diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc
   index 1cde5e64e..88f006fe0 100644
   --- a/cpp/src/arrow/io/memory.cc
   +++ b/cpp/src/arrow/io/memory.cc
   @@ -54,7 +54,7 @@ BufferOutputStream::BufferOutputStream(const std::shared_ptr<ResizableBuffer>& b
    Result<std::shared_ptr<BufferOutputStream>> BufferOutputStream::Create(
        int64_t initial_capacity, MemoryPool* pool) {
      // ctor is private, so cannot use make_shared
   +  DCHECK_NE(pool, nullptr);
      auto ptr = std::shared_ptr<BufferOutputStream>(new BufferOutputStream);
      RETURN_NOT_OK(ptr->Reset(initial_capacity, pool));
      return ptr;
   
   ```
   
    is "responsible" for breaking the JNI status check (you can also see it was passing on the first commit in this PR https://github.com/apache/arrow/runs/1310578738 and then failing at the end https://github.com/apache/arrow/runs/1329149428).
   
   I am not familiar with the code, and I was hoping for you to possibly help me understand why the check was added and why it might be segfaulting the test java/adapter/orc/src/test/java/org/apache/arrow/adapter/orc/OrcReaderTest.java. I have experimented with removing the check and the test passes as expected, but if the check was added to catch something I'd rather not cover it up. I noticed there were two follow ups ARROW-10420 and ARROW-10421 and I was wondering if this might be related?
   
   Edit: since the runs above were not merged and their branch was deleted it looks like GitHub dropped the logs. This is the specific test error I am referring to (grabbed from a run on the `master` branch) https://github.com/apache/arrow/runs/1351819845#step:8:7374


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#discussion_r512242429



##########
File path: r/src/memorypool.cpp
##########
@@ -19,10 +19,45 @@
 #if defined(ARROW_R_WITH_ARROW)
 #include <arrow/memory_pool.h>
 
+class GcMemoryPool : public arrow::MemoryPool {
+ public:
+  GcMemoryPool()
+      : pool_(arrow::default_memory_pool()), gc_(cpp11::package("base")["gc"]) {}
+
+  arrow::Status Allocate(int64_t size, uint8_t** out) override {
+    return GcAndTryAgain([&] { return pool_->Allocate(size, out); });
+  }
+
+  arrow::Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override {
+    return GcAndTryAgain([&] { return pool_->Reallocate(old_size, new_size, ptr); });
+  }
+
+  void Free(uint8_t* buffer, int64_t size) override { pool_->Free(buffer, size); }
+
+  int64_t bytes_allocated() const override { return pool_->bytes_allocated(); }
+
+  int64_t max_memory() const override { return pool_->max_memory(); }
+
+  std::string backend_name() const override { return pool_->backend_name() + "-gc"; }
+
+ private:
+  template <typename Call>
+  arrow::Status GcAndTryAgain(const Call& call) {
+    if (call().ok()) {
+      return arrow::Status::OK();
+    }
+    // ARROW-10080
+    gc_();

Review comment:
       I don't think so. I'll guard with a mutex.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on a change in pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on a change in pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#discussion_r513265845



##########
File path: r/src/memorypool.cpp
##########
@@ -18,11 +18,55 @@
 #include "./arrow_types.h"
 #if defined(ARROW_R_WITH_ARROW)
 #include <arrow/memory_pool.h>
+#include <arrow/util/mutex.h>
+
+class GcMemoryPool : public arrow::MemoryPool {
+ public:
+  GcMemoryPool() : pool_(arrow::default_memory_pool()) {}
+
+  arrow::Status Allocate(int64_t size, uint8_t** out) override {
+    return GcAndTryAgain([&] { return pool_->Allocate(size, out); });
+  }
+
+  arrow::Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override {
+    return GcAndTryAgain([&] { return pool_->Reallocate(old_size, new_size, ptr); });
+  }
+
+  void Free(uint8_t* buffer, int64_t size) override { pool_->Free(buffer, size); }
+
+  int64_t bytes_allocated() const override { return pool_->bytes_allocated(); }
+
+  int64_t max_memory() const override { return pool_->max_memory(); }
+
+  std::string backend_name() const override { return pool_->backend_name() + "-gc"; }
+
+ private:
+  template <typename Call>
+  arrow::Status GcAndTryAgain(const Call& call) {
+    if (call().ok()) {
+      return arrow::Status::OK();
+    } else {
+      auto lock = mutex_.Lock();
+
+      // ARROW-10080: Allocation may fail spuriously since the garbage collector is lazy.
+      // Force it to run then try again in case any reusable allocations have been freed.
+      static cpp11::function gc = cpp11::package("base")["gc"];

Review comment:
       Maybe make `gc()` a member of `GcMemoryPool` instead of repeating it for each `GcAndTryAgain` ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-718981284


   Follow ups:
   https://issues.apache.org/jira/browse/ARROW-10420
   https://issues.apache.org/jira/browse/ARROW-10421


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] nealrichardson commented on a change in pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on a change in pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#discussion_r514535632



##########
File path: r/tests/testthat/test-arrow.R
##########
@@ -60,3 +60,17 @@ test_that("arrow gracefully fails to load objects from other sessions (ARROW-100
 test_that("check for an ArrowObject in functions use std::shared_ptr", {
   expect_error(Array__length(1), "Invalid R object")
 })
+
+test_that("MemoryPool calls gc() to free memory when allocation fails (ARROW-10080)", {
+  env <- new.env()
+  trace(gc, print = FALSE, tracer = function() {
+          env$gc_was_called <- TRUE
+        })
+  expect_error(
+    BufferOutputStream$create(2 ** 60),
+    "Out of memory"
+  )
+  expect_true(env$gc_was_called)
+
+  untrace(gc)

Review comment:
       ```suggestion
     on.exit(untrace(gc))
     # We expect this should fail because we don't have this much memory,
     # but it should gc() and retry (and fail again)
     expect_error(BufferOutputStream$create(2 ** 60))
     expect_true(env$gc_was_called)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-718032982


   I think it's worthwhile to invalidate individual objects as soon as we know we won't need them since it removes unnecessary burden from the garbage collector; it'll essentially be a performance hint (and therefore removal is also justifiable if you want to keep things simple-as-possible).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-716754516


   https://issues.apache.org/jira/browse/ARROW-10080


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] terencehonles commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
terencehonles commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-721835140


   Here's some more context which could be helpful, the first coming from the run logs and the latter being a file that would be generated after the crash.
   
   Test logs:
   
   ```
   [INFO] Running org.apache.arrow.adapter.orc.OrcReaderTest
   16:21:58.149 [main] INFO org.apache.arrow.memory.BaseAllocator - Debug mode enabled.
   16:21:58.172 [main] INFO org.apache.arrow.memory.DefaultAllocationManagerOption - allocation manager type not specified, using netty as the default type
   16:21:58.175 [main] INFO org.apache.arrow.memory.CheckAllocator - Using DefaultAllocationManager at memory-netty/3.0.0-SNAPSHOT/arrow-memory-netty-3.0.0-SNAPSHOT.jar!/org/apache/arrow/memory/DefaultAllocationManagerFactory.class
   ```
   
   Test crash logs:
   
   ```
   # Created at 2020-11-04T16:22:01.347
   WARNING: Logging before InitGoogleLogging() is written to STDERR
   
   # Created at 2020-11-04T16:22:01.347
   F1104 16:22:01.345963  7719 memory.cc:57]  Check failed: (pool) != (nullptr) 
   
   # Created at 2020-11-04T16:22:01.347
   *** Check failure stack trace: ***
   
   # Created at 2020-11-04T16:22:04.748
   Aborted (core dumped)
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] terencehonles commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
terencehonles commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-721857972


   > @terencehonles Sorry for the breakage! I added this check when debugging injection of the customized memory pool. I didn't see that `PoolBuffer::MakeUnique()` coerces null `MemoryPool*` to `default_memory_pool()`, so I assumed leaving it in would be harmless. It can be removed, though it's still slightly suspicious to me that null would be passed for a memory pool given that the test logs seem to indicate a different memory manager (`allocation manager type not specified, using netty as the default type`) was selected.
   
   No prob, I'll try to see if I can figure out what the Java code is doing, but thanks for the input!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on a change in pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#discussion_r513490264



##########
File path: r/src/memorypool.cpp
##########
@@ -18,11 +18,55 @@
 #include "./arrow_types.h"
 #if defined(ARROW_R_WITH_ARROW)
 #include <arrow/memory_pool.h>
+#include <arrow/util/mutex.h>
+
+class GcMemoryPool : public arrow::MemoryPool {
+ public:
+  GcMemoryPool() : pool_(arrow::default_memory_pool()) {}
+
+  arrow::Status Allocate(int64_t size, uint8_t** out) override {
+    return GcAndTryAgain([&] { return pool_->Allocate(size, out); });
+  }
+
+  arrow::Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override {
+    return GcAndTryAgain([&] { return pool_->Reallocate(old_size, new_size, ptr); });
+  }
+
+  void Free(uint8_t* buffer, int64_t size) override { pool_->Free(buffer, size); }
+
+  int64_t bytes_allocated() const override { return pool_->bytes_allocated(); }
+
+  int64_t max_memory() const override { return pool_->max_memory(); }
+
+  std::string backend_name() const override { return pool_->backend_name() + "-gc"; }
+
+ private:
+  template <typename Call>
+  arrow::Status GcAndTryAgain(const Call& call) {
+    if (call().ok()) {
+      return arrow::Status::OK();
+    } else {
+      auto lock = mutex_.Lock();
+
+      // ARROW-10080: Allocation may fail spuriously since the garbage collector is lazy.
+      // Force it to run then try again in case any reusable allocations have been freed.
+      static cpp11::function gc = cpp11::package("base")["gc"];

Review comment:
       The instance is static so it will only be initialized in the *first* call to `GcAndTryAgain()`.
   
   I had `gc` as a member of `GcMemoryPool` initially but it resulted in an instance of `gc` which didn't show up in `trace()`. I assume this is due to a subtlety with the timing of the constructor of `g_pool` but I didn't debug further.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-717239034


   I've replace explicit calls to `arrow::default_memory_pool()` with `gc_memory_pool()` to pick up this change. However there are many Arrow APIs which don't require an explicit `MemoryPool` and will invoke `arrow::default_memory_pool()` if none is provided. The leak will persist if we fail to identify one of these and keep the default memory pool. I've fixed two implicit default pool usages: [`dataset___Dataset__NewScan`](https://github.com/apache/arrow/pull/8533/files#diff-5a95c2439a17adb8dec5e2f4760b6f632da493a24b2a85f2e00368fa982dcca9R34-R39) and [`compute__CallFunction`](https://github.com/apache/arrow/pull/8533/files#diff-a138b87f0da58d824c72293eab62d9f352718dcb7f41a47ea7e1c3c84bbe27ddR179-R186). I'll look for others but please call out any I miss


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] bkietz commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
bkietz commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-719108858


   CI failures are unrelated. Merging


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] romainfrancois commented on pull request #8533: ARROW-10080: [R] Call gc() and try again in MemoryPool

Posted by GitBox <gi...@apache.org>.
romainfrancois commented on pull request #8533:
URL: https://github.com/apache/arrow/pull/8533#issuecomment-718040728


   Agree. So I cherry picked that commit into the #8256 pull request so that we have `$invalidate()` for Table and RecordBatch. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org