You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2024/03/25 14:41:20 UTC

[PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

pitrou opened a new pull request, #40774:
URL: https://github.com/apache/arrow/pull/40774

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538489549


##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
     return Status::OK();
   }
 
+  Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,

Review Comment:
   Name suggestion: `TryResize`.
   
   Workloads taking advantage of this API are going to "try a resize" with the exact size needed for the operation being performed and then they are going to call `Reallocate` with a more generous size to amortize the cost of copies. Naming it `ReallocateNoCopy` creates the need to explain to users/developers why isn't `ReallocateNoCopy` the implementation of `Reallocate` in the first place. The "Try" prefix will also indicate that caller should be reader to handle failure gracefully. A bad Status is not an exceptional case, it's a very likely result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018287913

   Ok, I removed the benchmark numbers because they were based on an incorrect logic for jemalloc.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538490823


##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
     return Status::OK();
   }
 
+  Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
+                          uint8_t** ptr) override {
+    return Reallocate(old_size, new_size, alignment, ptr);

Review Comment:
   My comment above also implies that the implementation should be `return Status::NotImplemented()` so that caller immediately fallback to the `Reallocate()` with a size that is not necessarily the size that was passed to `TryResize`. The size passed to `TryResize` is conservative whereas the size passed to `Reallocate` is often more generous.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018197907

   > Right. But now I'm curious for the motivating use-cases.
   
   It's explained in the issue description.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018298226

   Benchmark results are encouraging except for jemalloc which needs more massaging to fully exploit (see comments in the source):
   ```
   -------------------------------------------------------------------------------------------------
   Benchmark                                       Time             CPU   Iterations UserCounters...
   -------------------------------------------------------------------------------------------------
   ReallocateGrowing<SystemAlloc>             267140 ns       267082 ns         2620 items_per_second=239.627k/s percent_in_place=0
   ReallocateGrowingNoCopy<SystemAlloc>         6271 ns         6270 ns       111590 items_per_second=10.2078M/s percent_in_place=75
   ReallocateGrowing<Jemalloc>                  9133 ns         9132 ns        76586 items_per_second=7.00834M/s percent_in_place=93.75
   ReallocateGrowingNoCopy<Jemalloc>           23503 ns        23500 ns        29719 items_per_second=2.72346M/s percent_in_place=26.0635
   ReallocateGrowing<Mimalloc>                 72325 ns        72311 ns         9674 items_per_second=885.071k/s percent_in_place=54.6875
   ReallocateGrowingNoCopy<Mimalloc>           10122 ns        10120 ns        69251 items_per_second=6.32395M/s percent_in_place=78.125
   
   ReallocateShrinking<SystemAlloc>           206661 ns       206608 ns         5258 items_per_second=309.765k/s percent_in_place=0
   ReallocateShrinkingNoCopy<SystemAlloc>       6571 ns         6570 ns       106790 items_per_second=9.74056M/s percent_in_place=73.4375
   ReallocateShrinking<Jemalloc>                9396 ns         9394 ns        74619 items_per_second=6.81305M/s percent_in_place=93.75
   ReallocateShrinkingNoCopy<Jemalloc>         23662 ns        23658 ns        29572 items_per_second=2.70523M/s percent_in_place=26.0422
   ReallocateShrinking<Mimalloc>                5882 ns         5881 ns       119515 items_per_second=10.8828M/s percent_in_place=89.0625
   ReallocateShrinkingNoCopy<Mimalloc>          2064 ns         2064 ns       339045 items_per_second=31.0091M/s percent_in_place=98.4375
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1537991608


##########
cpp/src/arrow/memory_pool.cc:
##########
@@ -498,6 +532,46 @@ class BaseMemoryPoolImpl : public MemoryPool {
     return Status::OK();
   }
 
+  Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
+                          uint8_t** ptr) override {
+    if (new_size == old_size) {
+      return Status::OK();
+    }
+    if (old_size == 0 || new_size == 0) {
+      return Reallocate(old_size, new_size, alignment, ptr);
+    }
+    if (new_size < 0) {
+      return Status::Invalid("negative realloc size");
+    }
+    if (static_cast<uint64_t>(new_size) >= std::numeric_limits<size_t>::max()) {
+      return Status::OutOfMemory("realloc overflows size_t");
+    }
+    // First try resizing in place
+    if (!Allocator::ResizeInPlace(old_size, new_size, *ptr)) {

Review Comment:
   So this would do allocate best-effort?



##########
cpp/src/arrow/memory_pool.cc:
##########
@@ -498,6 +532,46 @@ class BaseMemoryPoolImpl : public MemoryPool {
     return Status::OK();
   }
 
+  Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
+                          uint8_t** ptr) override {
+    if (new_size == old_size) {
+      return Status::OK();
+    }
+    if (old_size == 0 || new_size == 0) {
+      return Reallocate(old_size, new_size, alignment, ptr);
+    }
+    if (new_size < 0) {
+      return Status::Invalid("negative realloc size");
+    }
+    if (static_cast<uint64_t>(new_size) >= std::numeric_limits<size_t>::max()) {
+      return Status::OutOfMemory("realloc overflows size_t");
+    }
+    // First try resizing in place
+    if (!Allocator::ResizeInPlace(old_size, new_size, *ptr)) {

Review Comment:
   So this would do re-allocate local best-effort?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018161966

   @felipecrv Just so that you understand better what I have in mind.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2019917084

   For the record, this is very much work-in-progress and the real usefulness of this will be judged once it is plugged in existing functionality (for example some Parquet reader or decoder facility).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538495332


##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
     return Status::OK();
   }
 
+  Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,

Review Comment:
   You could also change the return type to `[[nodiscard]] bool`. I don't think we really need the richness of `Status` to express what happened. `false` for exceptional/validation error and also impossibility of the resize-in-place will cause the caller to call a different allocation function that can then be more detailed in the `Status` it returns.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018234002

   For reference, here are the micro-benchmark results on my machine. But bear in mind that evaluating the actual usefulness of this first requires plugging it into existing code (such as a Parquet decoder):
   ```
   -------------------------------------------------------------------------------------------------
   Benchmark                                       Time             CPU   Iterations UserCounters...
   -------------------------------------------------------------------------------------------------
   ReallocateGrowing<SystemAlloc>             259625 ns       259558 ns         2697 items_per_second=246.573k/s percent_in_place=0
   ReallocateGrowingNoCopy<SystemAlloc>         6024 ns         6023 ns       119303 items_per_second=10.6263M/s percent_in_place=75
   ReallocateGrowing<Jemalloc>                  9389 ns         9387 ns        74572 items_per_second=6.81772M/s percent_in_place=93.75
   ReallocateGrowingNoCopy<Jemalloc>            5572 ns         5571 ns       125529 items_per_second=11.4872M/s percent_in_place=93.75
   ReallocateGrowing<Mimalloc>                 71605 ns        71591 ns         9742 items_per_second=893.964k/s percent_in_place=54.6875
   ReallocateGrowingNoCopy<Mimalloc>           10113 ns        10112 ns        69571 items_per_second=6.32909M/s percent_in_place=78.125
   
   ReallocateShrinking<SystemAlloc>           203288 ns       203237 ns         5293 items_per_second=314.904k/s percent_in_place=0
   ReallocateShrinkingNoCopy<SystemAlloc>       6659 ns         6658 ns       105071 items_per_second=9.61302M/s percent_in_place=73.4375
   ReallocateShrinking<Jemalloc>                9691 ns         9689 ns        71743 items_per_second=6.60514M/s percent_in_place=93.75
   ReallocateShrinkingNoCopy<Jemalloc>          6433 ns         6253 ns       114823 items_per_second=10.2355M/s percent_in_place=98.4375
   ReallocateShrinking<Mimalloc>                5895 ns         5894 ns       118059 items_per_second=10.858M/s percent_in_place=89.0625
   ReallocateShrinkingNoCopy<Mimalloc>          2047 ns         2046 ns       341842 items_per_second=31.2752M/s percent_in_place=98.4375
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018195912

   > @felipecrv Just so that you understand better what I have in mind.
   
   Right. But now I'm curious for the motivating use-cases.
   
   One abstract case I have in mind i dynamically-growing containers: let's say I'm appending a value to a container and size==capacity already, `Reallocate(capacity+1)` is wasteful (better to grow capacity geometrically), but if I can resize to `capacity+1` in-place, that's preferrable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538489549


##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
     return Status::OK();
   }
 
+  Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,

Review Comment:
   Name suggestion: `TryResize`.
   
   Workloads taking advantage of this API are going to "try a resize" with the exact size needed for the operation being performed and then they are going to call `Reallocate` with a more generous size to amortize the cost of copies. Naming it `ReallocateNoCopy` creates the need to explain to users/developers why isn't `ReallocateNoCopy` the implementation of `Reallocate` in the first place. The "Try" prefix will also indicate that caller should be ready to handle failure gracefully. A bad Status is not an exceptional case, it's a very likely result.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org