You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "mapleFU (via GitHub)" <gi...@apache.org> on 2023/05/18 16:35:40 UTC
[GitHub] [arrow] mapleFU commented on a diff in pull request #35565: GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers

mapleFU commented on code in PR #35565:
URL: https://github.com/apache/arrow/pull/35565#discussion_r1198037601


##########
cpp/src/arrow/util/align_util.cc:
##########
@@ -30,12 +32,120 @@ bool CheckAlignment(const Buffer& buffer, int64_t alignment) {
   return buffer.address() % alignment == 0;
 }
 
-bool CheckAlignment(const ArrayData& array, int64_t alignment) {
-  for (const auto& buffer : array.buffers) {
-    if (buffer) {
-      if (!CheckAlignment(*buffer, alignment)) return false;
+namespace {
+
+// Some buffers are frequently type-punned.  For example, in an int32 array the
+// values buffer is frequently cast to int32_t*
+//
+// This sort of punning is only valid if the pointer is aligned to a proper width
+// (e.g. 4 bytes in the case of int32).
+//
+// We generally assume that all buffers are at least 8-bit aligned and so we only
+// need to worry about buffers that are commonly cast to wider data types.  Note that
+// this alignment is something that is guaranteed by malloc (e.g. new int32_t[] will
+// return a buffer that is 4 byte aligned) or common libraries (e.g. numpy) but it is
+// not currently guaranteed by flight (GH-32276).
+//
+// By happy coincedence, for every data type, the only buffer that might need wider
+// alignment is the second buffer (at index 1).  This function returns the expected
+// alignment (in bits) of the second buffer for the given array to safely allow this cast.
+//
+// If the array's type doesn't have a second buffer or the second buffer is not expected
+// to be type punned, then we return 8.
+int GetMallocValuesAlignment(const ArrayData& array) {
+  // Make sure to use the storage type id
+  auto type_id = array.type->storage_id();
+  if (type_id == Type::DICTIONARY) {
+    // The values buffer is in a different ArrayData and so we only check the indices
+    // buffer here.  The values array data will be checked by the calling method.
+    type_id = ::arrow::internal::checked_pointer_cast<DictionaryType>(array.type)
+                  ->index_type()
+                  ->id();
+  }
+  switch (type_id) {
+    case Type::NA:                 // No buffers
+    case Type::FIXED_SIZE_LIST:    // No second buffer (values in child array)
+    case Type::FIXED_SIZE_BINARY:  // Fixed size binary could be dangerous but the
+                                   // compute kernels don't type pun this.  E.g. if
+                                   // an extension type is storing some kind of struct
+                                   // here then the user should do their own alignment
+                                   // check before casting to an array of structs
+    case Type::BOOL:               // Always treated as uint8_t*
+    case Type::INT8:               // Always treated as uint8_t*
+    case Type::UINT8:              // Always treated as uint8_t*
+    case Type::DECIMAL128:         // Always treated as uint8_t*
+    case Type::DECIMAL256:         // Always treated as uint8_t*

Review Comment:
   ```c++
   struct DecimalToIntegerMixin {
     template <typename OutValue, typename Arg0Value>
     OutValue ToInteger(KernelContext* ctx, const Arg0Value& val, Status* st) const {
       constexpr auto min_value = std::numeric_limits<OutValue>::min();
       constexpr auto max_value = std::numeric_limits<OutValue>::max();
   
       if (!allow_int_overflow_ && ARROW_PREDICT_FALSE(val < min_value || val > max_value)) {
         *st = Status::Invalid("Integer value out of bounds");
         return OutValue{};  // Zero
       } else {
         return static_cast<OutValue>(val.low_bits());
       }
     }
   ```
   
   I'm not so familiar with acero, so I wonder why `Decimal128` and `Decimal256` can use 1B, isn't them `Decimal128` and `Decimal256`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org