You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "pitrou (via GitHub)" <gi...@apache.org> on 2023/05/17 14:56:54 UTC

[GitHub] [arrow] pitrou commented on a diff in pull request #35565: GH-35498: [C++] Relax EnsureAlignment check in Acero from requiring 64-byte aligned buffers to requiring value-aligned buffers

pitrou commented on code in PR #35565:
URL: https://github.com/apache/arrow/pull/35565#discussion_r1196654862


##########
cpp/src/arrow/util/align_util.cc:
##########
@@ -30,12 +32,120 @@ bool CheckAlignment(const Buffer& buffer, int64_t alignment) {
   return buffer.address() % alignment == 0;
 }
 
-bool CheckAlignment(const ArrayData& array, int64_t alignment) {
-  for (const auto& buffer : array.buffers) {
-    if (buffer) {
-      if (!CheckAlignment(*buffer, alignment)) return false;
+namespace {
+
+// Some buffers are frequently type-punned.  For example, in an int32 array the
+// values buffer is frequently cast to int32_t*
+//
+// This sort of punning is only valid if the pointer is aligned to a proper width
+// (e.g. 4 bytes in the case of int32).
+//
+// We generally assume that all buffers are at least 8-bit aligned and so we only
+// need to worry about buffers that are commonly cast to wider data types.  Note that
+// this alignment is something that is guaranteed by malloc (e.g. new int32_t[] will
+// return a buffer that is 4 byte aligned) or common libraries (e.g. numpy) but it is
+// not currently guaranteed by flight (GH-32276).
+//
+// By happy coincedence, for every data type, the only buffer that might need wider
+// alignment is the second buffer (at index 1).  This function returns the expected
+// alignment (in bits) of the second buffer for the given array to safely allow this cast.
+//
+// If the array's type doesn't have a second buffer or the second buffer is not expected
+// to be type punned, then we return 8.
+int GetMallocValuesAlignment(const ArrayData& array) {

Review Comment:
   Since an array can have several buffers, I think this is too coarse-grained. Let's have something like:
   ```c++
   int GetRequiredBufferAlignment(const DataType& type, int buffer_index) {
     if (buffer_index == 0) {
       // Either null bitmap or 8-bit union type ids
       return 1;
     }
     switch (type.id()) {
       case Type::INT16:
       case Type::UINT16:
       case Type::HALF_FLOAT:
         return 2;
       case Type::INT32:
       case Type::UINT32:
       case Type::FLOAT:
       case Type::DATE32:
       case Type::TIME32:
       case Type::LIST:         // Offsets may be cast to int32_t*, data is in child array
       case Type::MAP:          // This is a list array
       case Type::DENSE_UNION:  // Has an offsets buffer of int32_t*
       case Type::INTERVAL_MONTHS:  // Stored as int32_t*
         return 4;
       case Type::INT64:
       case Type::UINT64:
       case Type::DOUBLE:
       case Type::LARGE_LIST:    // Offsets may be cast to int64_t*
       case Type::DATE64:
       case Type::TIME64:
       case Type::TIMESTAMP:
       case Type::DURATION:
       case Type::INTERVAL_DAY_TIME:  // Stored as two contiguous 32-bit integers but may be
                                      // cast to struct* containing both integers
         return 8;
       case Type::INTERVAL_MONTH_DAY_NANO:  // Stored as two 32-bit integers and a 64-bit
                                            // integer
         return 16;
       case Type::STRING:
       case Type::BINARY:  // Offsets may be cast to int32_t*, data is only uint8_t*
         return (buffer_index == 1) ? 4 : 1;
       case Type::LARGE_STRING:
       case Type::LARGE_BINARY:  // Offsets may be cast to int64_t*
         return (buffer_index == 1) ? 8 : 1;
       default:
         // Everything else doesn't have buffers with non-trivial alignement requirements
         return 1;
     }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org