You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2024/03/25 14:41:20 UTC
[PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
pitrou opened a new pull request, #40774:
URL: https://github.com/apache/arrow/pull/40774
<!--
Thanks for opening a pull request!
If this is your first pull request you can find detailed information on how
to contribute here:
* [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
* [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
Then could you also rename the pull request title in the following format?
GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
or
MINOR: [${COMPONENT}] ${SUMMARY}
In the case of PARQUET issues on JIRA the title also supports:
PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
-->
### Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
-->
### What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
-->
### Are these changes tested?
<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
-->
### Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
-->
<!-- **This PR includes breaking changes to public APIs.** -->
<!--
Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
-->
<!-- **This PR contains a "Critical Fix".** -->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538489549
##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
return Status::OK();
}
+ Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
Review Comment:
Name suggestion: `TryResize`.
Workloads taking advantage of this API are going to "try a resize" with the exact size needed for the operation being performed and then they are going to call `Reallocate` with a more generous size to amortize the cost of copies. Naming it `ReallocateNoCopy` creates the need to explain to users/developers why isn't `ReallocateNoCopy` the implementation of `Reallocate` in the first place. The "Try" prefix will also indicate that caller should be reader to handle failure gracefully. A bad Status is not an exceptional case, it's a very likely result.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018287913
Ok, I removed the benchmark numbers because they were based on an incorrect logic for jemalloc.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538490823
##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
return Status::OK();
}
+ Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
+ uint8_t** ptr) override {
+ return Reallocate(old_size, new_size, alignment, ptr);
Review Comment:
My comment above also implies that the implementation should be `return Status::NotImplemented()` so that caller immediately fallback to the `Reallocate()` with a size that is not necessarily the size that was passed to `TryResize`. The size passed to `TryResize` is conservative whereas the size passed to `Reallocate` is often more generous.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018197907
> Right. But now I'm curious for the motivating use-cases.
It's explained in the issue description.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018298226
Benchmark results are encouraging except for jemalloc which needs more massaging to fully exploit (see comments in the source):
```
-------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------
ReallocateGrowing<SystemAlloc> 267140 ns 267082 ns 2620 items_per_second=239.627k/s percent_in_place=0
ReallocateGrowingNoCopy<SystemAlloc> 6271 ns 6270 ns 111590 items_per_second=10.2078M/s percent_in_place=75
ReallocateGrowing<Jemalloc> 9133 ns 9132 ns 76586 items_per_second=7.00834M/s percent_in_place=93.75
ReallocateGrowingNoCopy<Jemalloc> 23503 ns 23500 ns 29719 items_per_second=2.72346M/s percent_in_place=26.0635
ReallocateGrowing<Mimalloc> 72325 ns 72311 ns 9674 items_per_second=885.071k/s percent_in_place=54.6875
ReallocateGrowingNoCopy<Mimalloc> 10122 ns 10120 ns 69251 items_per_second=6.32395M/s percent_in_place=78.125
ReallocateShrinking<SystemAlloc> 206661 ns 206608 ns 5258 items_per_second=309.765k/s percent_in_place=0
ReallocateShrinkingNoCopy<SystemAlloc> 6571 ns 6570 ns 106790 items_per_second=9.74056M/s percent_in_place=73.4375
ReallocateShrinking<Jemalloc> 9396 ns 9394 ns 74619 items_per_second=6.81305M/s percent_in_place=93.75
ReallocateShrinkingNoCopy<Jemalloc> 23662 ns 23658 ns 29572 items_per_second=2.70523M/s percent_in_place=26.0422
ReallocateShrinking<Mimalloc> 5882 ns 5881 ns 119515 items_per_second=10.8828M/s percent_in_place=89.0625
ReallocateShrinkingNoCopy<Mimalloc> 2064 ns 2064 ns 339045 items_per_second=31.0091M/s percent_in_place=98.4375
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1537991608
##########
cpp/src/arrow/memory_pool.cc:
##########
@@ -498,6 +532,46 @@ class BaseMemoryPoolImpl : public MemoryPool {
return Status::OK();
}
+ Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
+ uint8_t** ptr) override {
+ if (new_size == old_size) {
+ return Status::OK();
+ }
+ if (old_size == 0 || new_size == 0) {
+ return Reallocate(old_size, new_size, alignment, ptr);
+ }
+ if (new_size < 0) {
+ return Status::Invalid("negative realloc size");
+ }
+ if (static_cast<uint64_t>(new_size) >= std::numeric_limits<size_t>::max()) {
+ return Status::OutOfMemory("realloc overflows size_t");
+ }
+ // First try resizing in place
+ if (!Allocator::ResizeInPlace(old_size, new_size, *ptr)) {
Review Comment:
So this would do allocate best-effort?
##########
cpp/src/arrow/memory_pool.cc:
##########
@@ -498,6 +532,46 @@ class BaseMemoryPoolImpl : public MemoryPool {
return Status::OK();
}
+ Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
+ uint8_t** ptr) override {
+ if (new_size == old_size) {
+ return Status::OK();
+ }
+ if (old_size == 0 || new_size == 0) {
+ return Reallocate(old_size, new_size, alignment, ptr);
+ }
+ if (new_size < 0) {
+ return Status::Invalid("negative realloc size");
+ }
+ if (static_cast<uint64_t>(new_size) >= std::numeric_limits<size_t>::max()) {
+ return Status::OutOfMemory("realloc overflows size_t");
+ }
+ // First try resizing in place
+ if (!Allocator::ResizeInPlace(old_size, new_size, *ptr)) {
Review Comment:
So this would do re-allocate local best-effort?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018161966
@felipecrv Just so that you understand better what I have in mind.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2019917084
For the record, this is very much work-in-progress and the real usefulness of this will be judged once it is plugged in existing functionality (for example some Parquet reader or decoder facility).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538495332
##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
return Status::OK();
}
+ Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
Review Comment:
You could also change the return type to `[[nodiscard]] bool`. I don't think we really need the richness of `Status` to express what happened. `false` for exceptional/validation error and also impossibility of the resize-in-place will cause the caller to call a different allocation function that can then be more detailed in the `Status` it returns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018234002
For reference, here are the micro-benchmark results on my machine. But bear in mind that evaluating the actual usefulness of this first requires plugging it into existing code (such as a Parquet decoder):
```
-------------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------------------
ReallocateGrowing<SystemAlloc> 259625 ns 259558 ns 2697 items_per_second=246.573k/s percent_in_place=0
ReallocateGrowingNoCopy<SystemAlloc> 6024 ns 6023 ns 119303 items_per_second=10.6263M/s percent_in_place=75
ReallocateGrowing<Jemalloc> 9389 ns 9387 ns 74572 items_per_second=6.81772M/s percent_in_place=93.75
ReallocateGrowingNoCopy<Jemalloc> 5572 ns 5571 ns 125529 items_per_second=11.4872M/s percent_in_place=93.75
ReallocateGrowing<Mimalloc> 71605 ns 71591 ns 9742 items_per_second=893.964k/s percent_in_place=54.6875
ReallocateGrowingNoCopy<Mimalloc> 10113 ns 10112 ns 69571 items_per_second=6.32909M/s percent_in_place=78.125
ReallocateShrinking<SystemAlloc> 203288 ns 203237 ns 5293 items_per_second=314.904k/s percent_in_place=0
ReallocateShrinkingNoCopy<SystemAlloc> 6659 ns 6658 ns 105071 items_per_second=9.61302M/s percent_in_place=73.4375
ReallocateShrinking<Jemalloc> 9691 ns 9689 ns 71743 items_per_second=6.60514M/s percent_in_place=93.75
ReallocateShrinkingNoCopy<Jemalloc> 6433 ns 6253 ns 114823 items_per_second=10.2355M/s percent_in_place=98.4375
ReallocateShrinking<Mimalloc> 5895 ns 5894 ns 118059 items_per_second=10.858M/s percent_in_place=89.0625
ReallocateShrinkingNoCopy<Mimalloc> 2047 ns 2046 ns 341842 items_per_second=31.2752M/s percent_in_place=98.4375
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #40774:
URL: https://github.com/apache/arrow/pull/40774#issuecomment-2018195912
> @felipecrv Just so that you understand better what I have in mind.
Right. But now I'm curious for the motivating use-cases.
One abstract case I have in mind i dynamically-growing containers: let's say I'm appending a value to a container and size==capacity already, `Reallocate(capacity+1)` is wasteful (better to grow capacity geometrically), but if I can resize to `capacity+1` in-place, that's preferrable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] EXPERIMENTAL: GH-40686: [C++] Add MemoryPool API to resize without copying the data [arrow]
Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #40774:
URL: https://github.com/apache/arrow/pull/40774#discussion_r1538489549
##########
cpp/src/arrow/io/file_test.cc:
##########
@@ -461,6 +461,11 @@ class MyMemoryPool : public MemoryPool {
return Status::OK();
}
+ Status ReallocateNoCopy(int64_t old_size, int64_t new_size, int64_t alignment,
Review Comment:
Name suggestion: `TryResize`.
Workloads taking advantage of this API are going to "try a resize" with the exact size needed for the operation being performed and then they are going to call `Reallocate` with a more generous size to amortize the cost of copies. Naming it `ReallocateNoCopy` creates the need to explain to users/developers why isn't `ReallocateNoCopy` the implementation of `Reallocate` in the first place. The "Try" prefix will also indicate that caller should be ready to handle failure gracefully. A bad Status is not an exceptional case, it's a very likely result.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org