You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Light-City (via GitHub)" <gi...@apache.org> on 2024/03/27 08:17:51 UTC

[I] Add partition rows check for hashjoin [arrow]

Light-City opened a new issue, #40816:
URL: https://github.com/apache/arrow/issues/40816

   ### Describe the bug, including details regarding any error messages, version, and platform.
   
   Here is 1 << 15 equals 32768, partition_range is uint16, the maximum value is 65535, can we adjust it to the maximum?
   
   ```
     static void Eval(int64_t num_rows, int num_prtns, uint16_t* prtn_ranges,
                      INPUT_PRTN_ID_FN prtn_id_impl, OUTPUT_POS_FN output_pos_impl) {
       ARROW_DCHECK(num_rows > 0 && num_rows <= (1 << 15));
       ARROW_DCHECK(num_prtns >= 1 && num_prtns <= (1 << 15));
   
       memset(prtn_ranges, 0, (num_prtns + 1) * sizeof(uint16_t));
   }
   ```
   
   **Background:** Today, the release package set a batch size of 65536(batch hash rows) and found a crash. The crash information is:
   ```
   terminate called after throwing an instance of 'std::length_error'
     what():  vector::_M_default_append
   ```
   
   The crash location is the following line of code in SwissTableForJoinBuild::ProcessPartition:
   ```
   prtn_state.key_ids.resize(num_rows_before + num_rows_new);
   ```
   When we printed the log, we found that num_rows_new overflowed and became a negative number. Then the review code found that there are some restrictions on rows in PartitionSort::Eval. We found under the release package that we can run any batch size from 32768 + 1 to 65535 because of the following The assertion is invalid. At the same time, it also gives us some thinking. Since everything within uint16_t can run, why not set it to the maximum value?
   My suggestion, make this assertion
   ```
   ARROW_DCHECK(num_rows > 0 && num_rows <= (1 << 15));
   ```
   becomes:
   
   ```
   ARROW_DCHECK(num_rows > 0 && num_rows <= (1 << 16) - 1);
   ```
   
   ### Component(s)
   
   C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org