You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/04/26 19:48:21 UTC

[GitHub] [arrow] westonpace opened a new pull request, #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

westonpace opened a new pull request, #35347:
URL: https://github.com/apache/arrow/pull/35347

   ### Rationale for this change
   
   The current code has two storage buffers in the key map which are allocated with MemoryPool::Allocate which does not use smart pointers.  This could have led to a potential crash in an OOM scenario where the first allocate fails and it also led to some convoluted code keeping track of the previously allocated size in order to properly call Free.
   
   Furthermore, it seems that this key map could have been getting potentially copied in the swiss join code.  While that was probably not happening (since the copy happened before the key map was initialized) it is still an easy recipe for an accidental double-free later on as we maintain the class.
   
   ### What changes are included in this PR?
   
   Those raw buffers are changed to std::shared_ptr<Buffer> to avoid these issues.
   
   ### Are these changes tested?
   
   Somewhat, the existing unit tests should ensure we didn't cause a regression.  I didn't introduce a regression test to introduce this potential bug because it would be very difficult to do so.
   
   ### Are there any user-facing changes?
   
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35347:
URL: https://github.com/apache/arrow/pull/35347#discussion_r1178567594


##########
cpp/src/arrow/compute/key_map.h:
##########
@@ -226,12 +228,12 @@ class ARROW_EXPORT SwissTable {
   // ---------------------------------------------------
   // * Empty bucket has value 0x80. Non-empty bucket has highest bit set to 0.
   //
-  uint8_t* blocks_;
+  std::shared_ptr<Buffer> blocks_;
 
   // Array of hashes of values inserted into slots.
   // Undefined if the corresponding slot is empty.
   // There is 64B padding at the end.

Review Comment:
   Should here still mention padding here? Since Buffer already ensure padding?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35347:
URL: https://github.com/apache/arrow/pull/35347#issuecomment-1541475733

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/4709e8e0f34041b0ba4af7a64673f17f...056f7485bb694ae986da3d4d7d34ecd0/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35347:
URL: https://github.com/apache/arrow/pull/35347#issuecomment-1523957218

   * Closes: #35270


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35347:
URL: https://github.com/apache/arrow/pull/35347#issuecomment-1523957264

   :warning: GitHub issue #35270 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35347:
URL: https://github.com/apache/arrow/pull/35347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35347:
URL: https://github.com/apache/arrow/pull/35347#discussion_r1186671345


##########
cpp/src/arrow/compute/key_map.h:
##########
@@ -226,12 +228,12 @@ class ARROW_EXPORT SwissTable {
   // ---------------------------------------------------
   // * Empty bucket has value 0x80. Non-empty bucket has highest bit set to 0.
   //
-  uint8_t* blocks_;
+  std::shared_ptr<Buffer> blocks_;
 
   // Array of hashes of values inserted into slots.
   // Undefined if the corresponding slot is empty.
   // There is 64B padding at the end.

Review Comment:
   Got it, padding and aligned are different, I missed it previously. the patch LGTM



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35347:
URL: https://github.com/apache/arrow/pull/35347#issuecomment-1541472394

   Benchmark runs are scheduled for baseline = b372242b686d0cc6eace5bd912dbf460edfc36e1 and contender = 18c976048bc989cf9d2c31139b67f7cc8e143d66. 18c976048bc989cf9d2c31139b67f7cc8e143d66 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/841df5ec022a44d4bfcd1b50c482db65...0700770e974a4084925d4f37ae4f1d5e/)
   [Finished :arrow_down:8.57% :arrow_up:0.23%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/4709e8e0f34041b0ba4af7a64673f17f...056f7485bb694ae986da3d4d7d34ecd0/)
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/ef1a7b4a94a44f0d8bbf1d80204cec46...c788f6747d4e473e975b7396c4876781/)
   [Finished :arrow_down:1.42% :arrow_up:0.36%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/b8286e170e774ce6aebe4270d1786817...7087cccfd0514c02849bad96ddcfe9af/)
   Buildkite builds:
   [Finished] [`18c97604` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2838)
   [Finished] [`18c97604` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2875)
   [Finished] [`18c97604` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2840)
   [Finished] [`18c97604` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2865)
   [Finished] [`b372242b` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2837)
   [Finished] [`b372242b` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2874)
   [Finished] [`b372242b` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2839)
   [Finished] [`b372242b` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2864)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35347: GH-35270: [C++] Use Buffer instead of raw buffer in hash join internals

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35347:
URL: https://github.com/apache/arrow/pull/35347#discussion_r1186547232


##########
cpp/src/arrow/compute/key_map.h:
##########
@@ -226,12 +228,12 @@ class ARROW_EXPORT SwissTable {
   // ---------------------------------------------------
   // * Empty bucket has value 0x80. Non-empty bucket has highest bit set to 0.
   //
-  uint8_t* blocks_;
+  std::shared_ptr<Buffer> blocks_;
 
   // Array of hashes of values inserted into slots.
   // Undefined if the corresponding slot is empty.
   // There is 64B padding at the end.

Review Comment:
   Buffer ensures that allocations are aligned (and I think, at least 64 bytes) but I don't know if it actually guarantees padding.  I think this buffer needs to go 64 bytes past the end regardless of the size which is a slightly different requirement.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org