You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ra...@apache.org on 2023/04/17 12:58:37 UTC

[arrow] 03/11: GH-34474: [C++] Detect and raise an error if a join will need too much key data (#35087)

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch maint-12.0.0
in repository https://gitbox.apache.org/repos/asf/arrow.git

commit a275bc2cd18aef5859e3734896c2a1aa711c48b4
Author: Weston Pace <we...@gmail.com>
AuthorDate: Thu Apr 13 07:25:28 2023 -0700

    GH-34474: [C++] Detect and raise an error if a join will need too much key data (#35087)
    
    ### Rationale for this change
    
    This fixes the test in #34474 though there are likely still other bad scenarios with large joins.  I've fixed this one since the behavior (invalid data) is particularly bad.  Most of the time if there is too much data I'm guessing we probably just crash.  Still, I think a test suite of some kind stressing large joins would be good to have.  Perhaps this could be added if someone finds time to work on join spilling.
    
    ### What changes are included in this PR?
    
    If the join will require more than 4GiB of key data it should now return an invalid status instead of invalid data.
    
    ### Are these changes tested?
    
    No.  I created a unit test but it requires over 16GiB of RAM (Besides the input data itself (4GiB), by the time you get 4GiB of key data there are various other join state buffers that also grow.  The test also took nearly a minute to run.  I think investigation and creation of a test suite for large joins is probably a standalone effort.
    
    ### Are there any user-facing changes?
    
    No.
    * Closes: #34474
    
    Authored-by: Weston Pace <we...@gmail.com>
    Signed-off-by: Joris Van den Bossche <jo...@gmail.com>
---
 cpp/src/arrow/acero/swiss_join.cc | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/cpp/src/arrow/acero/swiss_join.cc b/cpp/src/arrow/acero/swiss_join.cc
index ed1608e67d..3f11b89af3 100644
--- a/cpp/src/arrow/acero/swiss_join.cc
+++ b/cpp/src/arrow/acero/swiss_join.cc
@@ -473,6 +473,12 @@ Status RowArrayMerge::PrepareForMerge(RowArray* target,
     (*first_target_row_id)[sources.size()] = num_rows;
   }
 
+  if (num_bytes > std::numeric_limits<uint32_t>::max()) {
+    return Status::Invalid(
+        "There are more than 2^32 bytes of key data.  Acero cannot "
+        "process a join of this magnitude");
+  }
+
   // Allocate target memory
   //
   target->rows_.Clean();