You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/05/23 21:03:12 UTC

[GitHub] [arrow] westonpace opened a new pull request, #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

westonpace opened a new pull request, #13218:
URL: https://github.com/apache/arrow/pull/13218

   The primary goal of this refactor of old code was to improve the readability and clarity of the code base. I did not make any functional changes to the code and if any functional changes are suggested which modify existing code I will happily discuss them here but defer the changes themselves to follow-up PRs. I would very much appreciate any feedback on naming, making sure we have sufficient test coverage, and overall layout of the code.
   
   * KeyRowArray -> RowTableImpl KeyEncoder -> RowTableEncoder: The old name made sense because this data is currently represented physically as an array of rows. However, the data is conceptually tabular. We are storing rows & columns. In particular, I found it confusing that `KeyColumnArray` was a 1D data structure while `KeyRowArray` was a 2D table structure.
   * KeyEncoder::Context -> LightContext: There's nothing particular to the key encoder here and I worry keeping it there may lead to fracturing into many different "context" objects. 
   * Overall structure: I created a new folder arrow/compute/row and put all row-based utilities in here. Most of the files are now marked as _internal and the content in these files is not used outside of arrow/compute/row. The grouper had previously been alongside the kernel code and it didn't really belong there as it relies very heavily on the internal structure of the row encoding.
   * Row structure: I documented the file arrow/compute/row/row_internal.h


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888571835


##########
cpp/src/arrow/compute/light_array.h:
##########
@@ -31,6 +33,18 @@
 namespace arrow {
 namespace compute {
 
+/// \brief Context needed by various execution engine operations
+///
+/// In the execution engine this context is provided by either the node or the
+/// plan and the context exists for the lifetime of the plan.  Defining this here
+/// allows us to take advantage of these resources without coupling the logic with
+/// the execution engine.
+struct LightContext {
+  bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; }

Review Comment:
   Leaving this alone for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13218:
URL: https://github.com/apache/arrow/pull/13218#issuecomment-1135135704

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/master/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW
   
   Opening JIRAs ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
bkietz commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r885909089


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);

Review Comment:
   `<=` is indeed incorrect here. This is a pretty trivial correction so why not include it



##########
cpp/src/arrow/compute/light_array.h:
##########
@@ -31,6 +33,18 @@
 namespace arrow {
 namespace compute {
 
+/// \brief Context needed by various execution engine operations
+///
+/// In the execution engine this context is provided by either the node or the
+/// plan and the context exists for the lifetime of the plan.  Defining this here
+/// allows us to take advantage of these resources without coupling the logic with
+/// the execution engine.
+struct LightContext {
+  bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; }

Review Comment:
   IIRC, the concept here was to be able to attach hardware flags to a specific context rather than needing to disable or enable for the whole library using `CpuInfo::EnableFeature()`. It and many other things are certainly candidates for follow up refactoring



##########
cpp/src/arrow/compute/api_aggregate.h:
##########
@@ -482,6 +404,21 @@ struct ARROW_EXPORT Aggregate {
   const FunctionOptions* options;
 };
 
+Result<std::vector<const HashAggregateKernel*>> GetKernels(

Review Comment:
   IIRC these are only used in grouped aggregation and in tests, so api_aggregate_internal.h would be appropriate to house anything which is in `namespace internal` here



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;

Review Comment:
   I agree that this change should be made but I'd recommend doing so in follow up; I'd prefer to keep this refactor move-only since it's large as it is



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #13218:
URL: https://github.com/apache/arrow/pull/13218#issuecomment-1135203213

   CI failures appear unrelated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888562981


##########
cpp/src/arrow/compute/exec/key_hash.h:
##########
@@ -45,8 +45,8 @@ class ARROW_EXPORT Hashing32 {
   friend void TestBloomSmall(BloomFilterBuildStrategy, int64_t, int, bool, bool);
 
  public:
-  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols,
-                              KeyEncoder::KeyEncoderContext* ctx, uint32_t* out_hash);
+  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, LightContext* ctx,
+                              uint32_t* out_hash);

Review Comment:
   Yes, that is what it is.  It is essentially a namespace to distinguish between 32bit and 64bit implementations.  `Hashing32::HashBatch` will hash rows into `uint32_t` while `Hashing64::HashBatch` will hash rows into `uint64_t`.  Would a namespace be an option? (e.g. `arrow::compute::hash32::HashBatch`)
   
   Alternatively, I suppose we could rename all the functions (e.g. `arrow::compute::HashBatch32` and `arrow::compute::HashBatch64`).
   
   Or we could template all the functions (e.g. `arrow::compute::HashBatch<uint32_t>` and `arrow::compute::HashBatch<uint64_t>`)
   
   Do we have a strong style preference here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888574094


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;

Review Comment:
   I went ahead and did the rename.  It's a private constant so the scope should be pretty minimal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
wesm commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r896084320


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);

Review Comment:
   yes, fixing



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888563914


##########
cpp/src/arrow/compute/row/encode_internal.h:
##########
@@ -0,0 +1,323 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <memory>
+#include <vector>
+
+#include "arrow/array/data.h"
+#include "arrow/compute/exec.h"
+#include "arrow/compute/exec/util.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/compute/row/row_internal.h"
+#include "arrow/memory_pool.h"
+#include "arrow/result.h"
+#include "arrow/status.h"
+#include "arrow/util/bit_util.h"
+
+namespace arrow {
+namespace compute {
+
+/// Converts between Arrow's typical column representation to a row-based representation
+///
+/// Data is stored as a single array of rows.  Each row combines data from all columns.
+/// The conversion is reversible.
+///
+/// Row-oriented storage is beneficial when there is a need for random access
+/// of individual rows and at the same time all included columns are likely to
+/// be accessed together, as in the case of hash table key.
+///
+/// Does not support nested types
+class RowTableEncoder {
+ public:
+  void Init(const std::vector<KeyColumnMetadata>& cols, LightContext* ctx,
+            int row_alignment, int string_alignment);
+
+  const RowTableMetadata& row_metadata() { return row_metadata_; }
+  // GrouperFastImpl right now needs somewhat intrusive visibility into RowTableEncoder
+  // This could be cleaned up at some point
+  const std::vector<KeyColumnArray>& batch_all_cols() { return batch_all_cols_; }
+
+  /// \brief Prepare to encode a collection of columns
+  /// \param start_row The starting row to encode
+  /// \param num_rows The number of rows to encode
+  /// \param cols The columns to encode.  The order of the columns should
+  ///             be consistent with the order used to create the RowTableMetadata
+  void PrepareEncodeSelected(int64_t start_row, int64_t num_rows,
+                             const std::vector<KeyColumnArray>& cols);
+  /// \brief Encode selection of prepared rows into a row table
+  /// \param rows The output row table
+  /// \param num_selected The number of rows to encode
+  /// \param selection indices of the rows to encode
+  Status EncodeSelected(RowTableImpl* rows, uint32_t num_selected,
+                        const uint16_t* selection);
+
+  /// \brief Decode a window of row oriented data into a corresponding
+  ///        window of column oriented storage.
+  /// \param start_row_input The starting row to decode
+  /// \param start_row_output An offset into the output array to write to
+  /// \param num_rows The number of rows to decode
+  /// \param rows The row table to decode from
+  /// \param cols The columns to decode into, should be sized appropriately
+  ///
+  /// The output buffers need to be correctly allocated and sized before
+  /// calling each method.  For that reason decoding is split into two functions.
+  /// DecodeFixedLengthBuffers processes everything except for varying length
+  /// buffers.
+  /// The output can be used to find out required varying length buffers sizes
+  /// for the call to DecodeVaryingLengthBuffers
+  void DecodeFixedLengthBuffers(int64_t start_row_input, int64_t start_row_output,
+                                int64_t num_rows, const RowTableImpl& rows,
+                                std::vector<KeyColumnArray>* cols);
+
+  /// \brief Decode the varlength columns of a row table into column storage
+  /// \param start_row_input The starting row to decode
+  /// \param start_row_output An offset into the output arrays
+  /// \param num_rows The number of rows to decode
+  /// \param rows The row table to decode from
+  /// \param cols The column arrays to decode into
+  void DecodeVaryingLengthBuffers(int64_t start_row_input, int64_t start_row_output,
+                                  int64_t num_rows, const RowTableImpl& rows,
+                                  std::vector<KeyColumnArray>* cols);
+
+ private:
+  /// Prepare column array vectors.
+  /// Output column arrays represent a range of input column arrays
+  /// specified by starting row and number of rows.
+  /// Three vectors are generated:
+  /// - all columns
+  /// - fixed-length columns only
+  /// - varying-length columns only
+  void PrepareKeyColumnArrays(int64_t start_row, int64_t num_rows,
+                              const std::vector<KeyColumnArray>& cols_in);
+
+  LightContext* ctx_;
+
+  // Data initialized once, based on data types of key columns
+  RowTableMetadata row_metadata_;
+
+  // Data initialized for each input batch.
+  // All elements are ordered according to the order of encoded fields in a row.
+  std::vector<KeyColumnArray> batch_all_cols_;
+  std::vector<KeyColumnArray> batch_varbinary_cols_;
+  std::vector<uint32_t> batch_varbinary_cols_base_offsets_;
+};
+
+class EncoderInteger {

Review Comment:
   Some don't.  Any of the encoders that have an `AVX2` implemented method do I think.  So if I was going to need an internal header anyways it seemed more consistent to just throw them all in.  However, I can prune this down to just the encoders needed if that would be better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888564544


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;

Review Comment:
   I'm not sure if there is a particular reason.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888574179


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data
+  std::unique_ptr<ResizableBuffer> offsets_;
+  // Stores the fixed-length parts of the rows
+  std::unique_ptr<ResizableBuffer> rows_;
+  static constexpr int max_buffers_ = 3;
+  const uint8_t* buffers_[max_buffers_];
+  uint8_t* mutable_buffers_[max_buffers_];
+  // The number of rows in the table
+  int64_t num_rows_;
+  // The number of rows that can be stored in the table without resizing
+  int64_t rows_capacity_;
+  // The number of bytes that can be stored in the table without resizing
+  int64_t bytes_capacity_;
+
+  // Mutable to allow lazy evaluation

Review Comment:
   The row table is not thread safe.  I updated the class comment to mention this fact.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
wesm commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r896088311


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  // The arrays in `buffers_` need to be padded so that
+  // vectorized operations can operate in blocks without
+  // worrying about tails
+  static constexpr int64_t kPaddingForVectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data (which is stored
+  // after all the fixed-sized fields)
+  std::unique_ptr<ResizableBuffer> offsets_;
+  // Stores the fixed-length parts of the rows
+  std::unique_ptr<ResizableBuffer> rows_;
+  static constexpr int kMaxBuffers = 3;
+  const uint8_t* buffers_[kMaxBuffers];
+  uint8_t* mutable_buffers_[kMaxBuffers];

Review Comment:
   Nope. I'm removing the const version to make things simpler



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on PR #13218:
URL: https://github.com/apache/arrow/pull/13218#issuecomment-1154658694

   https://issues.apache.org/jira/browse/ARROW-16590


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888571708


##########
cpp/src/arrow/compute/api_aggregate.h:
##########
@@ -482,6 +404,21 @@ struct ARROW_EXPORT Aggregate {
   const FunctionOptions* options;
 };
 
+Result<std::vector<const HashAggregateKernel*>> GetKernels(

Review Comment:
   Yes but `api_..._internal` feels a bit awkward.  I created `arrow/compute/exec/aggregate.h`.  This follows the same convention as things like `arrow/compute/exec/hash_join.h` which contains logic specific to the operators but unaware of the fact its being used in an exec plan.  I think it makes sense for the aggregate tests to use this type.  It's still using the internal namespace but that's because we need it in the hash kernels tests and at least this keeps the `kernels` folder cleaner.
   
   Maybe a longer term fix would be to modify the hash aggregate tests to use the exec plan and an aggregate node?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888573870


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data

Review Comment:
   I added a comment but it's stored after the fixed-size fields.  So, for example, if you had 2 int32 fields and a string field and 3 rows you might have something like...
   
   i1 | i2 | s1
   --- | --- | ---
   1 | 3 | abc
   2 | 4 | xy
   
   ```
   // buffers_[1]
   0x00000001 0x00000002 0x61 0x62 0x63 0x00000003 0x00000004 0x78 0x79
   // offsets_
   2, 5, 7, 9
   ```
   
   I'm probably off on a few details in that example but that is the rough idea.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r891242262


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);

Review Comment:
   Once I've called `AppendEmpty`, what am I supposed to do?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  // The arrays in `buffers_` need to be padded so that
+  // vectorized operations can operate in blocks without
+  // worrying about tails
+  static constexpr int64_t kPaddingForVectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data (which is stored
+  // after all the fixed-sized fields)
+  std::unique_ptr<ResizableBuffer> offsets_;
+  // Stores the fixed-length parts of the rows
+  std::unique_ptr<ResizableBuffer> rows_;
+  static constexpr int kMaxBuffers = 3;
+  const uint8_t* buffers_[kMaxBuffers];

Review Comment:
   Are these a different thing than `{null_masks_, offsets_, rows_)`?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();

Review Comment:
   `UpdateBufferPointers` perhaps?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  // The arrays in `buffers_` need to be padded so that
+  // vectorized operations can operate in blocks without
+  // worrying about tails
+  static constexpr int64_t kPaddingForVectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data (which is stored
+  // after all the fixed-sized fields)
+  std::unique_ptr<ResizableBuffer> offsets_;
+  // Stores the fixed-length parts of the rows
+  std::unique_ptr<ResizableBuffer> rows_;
+  static constexpr int kMaxBuffers = 3;
+  const uint8_t* buffers_[kMaxBuffers];
+  uint8_t* mutable_buffers_[kMaxBuffers];

Review Comment:
   Are these different except for the different const-ness?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,250 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+///
+/// The row table is not safe
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i < kMaxBuffers);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);

Review Comment:
   Should these methods be `const`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r885723070


##########
cpp/src/arrow/compute/exec/key_hash.h:
##########
@@ -45,8 +45,8 @@ class ARROW_EXPORT Hashing32 {
   friend void TestBloomSmall(BloomFilterBuildStrategy, int64_t, int, bool, bool);
 
  public:
-  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols,
-                              KeyEncoder::KeyEncoderContext* ctx, uint32_t* out_hash);
+  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, LightContext* ctx,
+                              uint32_t* out_hash);

Review Comment:
   For the record, is this a class with only static methods/attributes? This seems like an anti-pattern.



##########
cpp/src/arrow/compute/api_aggregate.h:
##########
@@ -482,6 +404,21 @@ struct ARROW_EXPORT Aggregate {
   const FunctionOptions* options;
 };
 
+Result<std::vector<const HashAggregateKernel*>> GetKernels(

Review Comment:
   Do we need to expose these APIs here, or can there be a separate header file for internal hash-aggregation APIs?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);

Review Comment:
   Shouldn't this be
   ```suggestion
       ARROW_DCHECK(i >= 0 && i < max_buffers_);
   ```



##########
cpp/src/arrow/compute/light_array.h:
##########
@@ -31,6 +33,18 @@
 namespace arrow {
 namespace compute {
 
+/// \brief Context needed by various execution engine operations
+///
+/// In the execution engine this context is provided by either the node or the
+/// plan and the context exists for the lifetime of the plan.  Defining this here
+/// allows us to take advantage of these resources without coupling the logic with
+/// the execution engine.
+struct LightContext {
+  bool has_avx2() const { return (hardware_flags & arrow::internal::CpuInfo::AVX2) > 0; }

Review Comment:
   Why is this no using `CpuInfo::IsSupported(CpuInfo::AVX2)`?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);

Review Comment:
   Same here.



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;

Review Comment:
   Why are some sizes or quantities unsigned and other signed?



##########
cpp/src/arrow/compute/row/encode_internal.h:
##########
@@ -0,0 +1,323 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#pragma once
+
+#include <cstdint>
+#include <memory>
+#include <vector>
+
+#include "arrow/array/data.h"
+#include "arrow/compute/exec.h"
+#include "arrow/compute/exec/util.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/compute/row/row_internal.h"
+#include "arrow/memory_pool.h"
+#include "arrow/result.h"
+#include "arrow/status.h"
+#include "arrow/util/bit_util.h"
+
+namespace arrow {
+namespace compute {
+
+/// Converts between Arrow's typical column representation to a row-based representation
+///
+/// Data is stored as a single array of rows.  Each row combines data from all columns.
+/// The conversion is reversible.
+///
+/// Row-oriented storage is beneficial when there is a need for random access
+/// of individual rows and at the same time all included columns are likely to
+/// be accessed together, as in the case of hash table key.
+///
+/// Does not support nested types
+class RowTableEncoder {
+ public:
+  void Init(const std::vector<KeyColumnMetadata>& cols, LightContext* ctx,
+            int row_alignment, int string_alignment);
+
+  const RowTableMetadata& row_metadata() { return row_metadata_; }
+  // GrouperFastImpl right now needs somewhat intrusive visibility into RowTableEncoder
+  // This could be cleaned up at some point
+  const std::vector<KeyColumnArray>& batch_all_cols() { return batch_all_cols_; }
+
+  /// \brief Prepare to encode a collection of columns
+  /// \param start_row The starting row to encode
+  /// \param num_rows The number of rows to encode
+  /// \param cols The columns to encode.  The order of the columns should
+  ///             be consistent with the order used to create the RowTableMetadata
+  void PrepareEncodeSelected(int64_t start_row, int64_t num_rows,
+                             const std::vector<KeyColumnArray>& cols);
+  /// \brief Encode selection of prepared rows into a row table
+  /// \param rows The output row table
+  /// \param num_selected The number of rows to encode
+  /// \param selection indices of the rows to encode
+  Status EncodeSelected(RowTableImpl* rows, uint32_t num_selected,
+                        const uint16_t* selection);
+
+  /// \brief Decode a window of row oriented data into a corresponding
+  ///        window of column oriented storage.
+  /// \param start_row_input The starting row to decode
+  /// \param start_row_output An offset into the output array to write to
+  /// \param num_rows The number of rows to decode
+  /// \param rows The row table to decode from
+  /// \param cols The columns to decode into, should be sized appropriately
+  ///
+  /// The output buffers need to be correctly allocated and sized before
+  /// calling each method.  For that reason decoding is split into two functions.
+  /// DecodeFixedLengthBuffers processes everything except for varying length
+  /// buffers.
+  /// The output can be used to find out required varying length buffers sizes
+  /// for the call to DecodeVaryingLengthBuffers
+  void DecodeFixedLengthBuffers(int64_t start_row_input, int64_t start_row_output,
+                                int64_t num_rows, const RowTableImpl& rows,
+                                std::vector<KeyColumnArray>* cols);
+
+  /// \brief Decode the varlength columns of a row table into column storage
+  /// \param start_row_input The starting row to decode
+  /// \param start_row_output An offset into the output arrays
+  /// \param num_rows The number of rows to decode
+  /// \param rows The row table to decode from
+  /// \param cols The column arrays to decode into
+  void DecodeVaryingLengthBuffers(int64_t start_row_input, int64_t start_row_output,
+                                  int64_t num_rows, const RowTableImpl& rows,
+                                  std::vector<KeyColumnArray>* cols);
+
+ private:
+  /// Prepare column array vectors.
+  /// Output column arrays represent a range of input column arrays
+  /// specified by starting row and number of rows.
+  /// Three vectors are generated:
+  /// - all columns
+  /// - fixed-length columns only
+  /// - varying-length columns only
+  void PrepareKeyColumnArrays(int64_t start_row, int64_t num_rows,
+                              const std::vector<KeyColumnArray>& cols_in);
+
+  LightContext* ctx_;
+
+  // Data initialized once, based on data types of key columns
+  RowTableMetadata row_metadata_;
+
+  // Data initialized for each input batch.
+  // All elements are ordered according to the order of encoded fields in a row.
+  std::vector<KeyColumnArray> batch_all_cols_;
+  std::vector<KeyColumnArray> batch_varbinary_cols_;
+  std::vector<uint32_t> batch_varbinary_cols_base_offsets_;
+};
+
+class EncoderInteger {

Review Comment:
   Do these all have to be exposed in a `.h`?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;

Review Comment:
   ```suggestion
     static constexpr int64_t kPaddingForVectors = 64;
   ```



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data
+  std::unique_ptr<ResizableBuffer> offsets_;
+  // Stores the fixed-length parts of the rows
+  std::unique_ptr<ResizableBuffer> rows_;
+  static constexpr int max_buffers_ = 3;

Review Comment:
   ```suggestion
     static constexpr int kMaxBuffers = 3;
   ```



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data

Review Comment:
   Where is the binary data stored?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data
+  std::unique_ptr<ResizableBuffer> offsets_;
+  // Stores the fixed-length parts of the rows
+  std::unique_ptr<ResizableBuffer> rows_;
+  static constexpr int max_buffers_ = 3;
+  const uint8_t* buffers_[max_buffers_];
+  uint8_t* mutable_buffers_[max_buffers_];
+  // The number of rows in the table
+  int64_t num_rows_;
+  // The number of rows that can be stored in the table without resizing
+  int64_t rows_capacity_;
+  // The number of bytes that can be stored in the table without resizing
+  int64_t bytes_capacity_;
+
+  // Mutable to allow lazy evaluation

Review Comment:
   Should these be atomic or is the row table not thread safe?



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;

Review Comment:
   Also add a comment explaining what this is?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
westonpace commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r888571930


##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);

Review Comment:
   I went ahead and included this fix.



##########
cpp/src/arrow/compute/row/row_internal.h:
##########
@@ -0,0 +1,244 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+#pragma once
+
+#include <cstdint>
+#include <vector>
+
+#include "arrow/buffer.h"
+#include "arrow/compute/light_array.h"
+#include "arrow/memory_pool.h"
+#include "arrow/status.h"
+#include "arrow/util/logging.h"
+
+namespace arrow {
+namespace compute {
+
+/// Description of the data stored in a RowTable
+struct ARROW_EXPORT RowTableMetadata {
+  /// \brief True if there are no variable length columns in the table
+  bool is_fixed_length;
+
+  /// For a fixed-length binary row, common size of rows in bytes,
+  /// rounded up to the multiple of alignment.
+  ///
+  /// For a varying-length binary, size of all encoded fixed-length key columns,
+  /// including lengths of varying-length columns, rounded up to the multiple of string
+  /// alignment.
+  uint32_t fixed_length;
+
+  /// Offset within a row to the array of 32-bit offsets within a row of
+  /// ends of varbinary fields.
+  /// Used only when the row is not fixed-length, zero for fixed-length row.
+  /// There are N elements for N varbinary fields.
+  /// Each element is the offset within a row of the first byte after
+  /// the corresponding varbinary field bytes in that row.
+  /// If varbinary fields begin at aligned addresses, than the end of the previous
+  /// varbinary field needs to be rounded up according to the specified alignment
+  /// to obtain the beginning of the next varbinary field.
+  /// The first varbinary field starts at offset specified by fixed_length,
+  /// which should already be aligned.
+  uint32_t varbinary_end_array_offset;
+
+  /// Fixed number of bytes per row that are used to encode null masks.
+  /// Null masks indicate for a single row which of its columns are null.
+  /// Nth bit in the sequence of bytes assigned to a row represents null
+  /// information for Nth field according to the order in which they are encoded.
+  int null_masks_bytes_per_row;
+
+  /// Power of 2. Every row will start at an offset aligned to that number of bytes.
+  int row_alignment;
+
+  /// Power of 2. Must be no greater than row alignment.
+  /// Every non-power-of-2 binary field and every varbinary field bytes
+  /// will start aligned to that number of bytes.
+  int string_alignment;
+
+  /// Metadata of encoded columns in their original order.
+  std::vector<KeyColumnMetadata> column_metadatas;
+
+  /// Order in which fields are encoded.
+  std::vector<uint32_t> column_order;
+
+  /// Offsets within a row to fields in their encoding order.
+  std::vector<uint32_t> column_offsets;
+
+  /// Rounding up offset to the nearest multiple of alignment value.
+  /// Alignment must be a power of 2.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int required_alignment) {
+    ARROW_DCHECK(ARROW_POPCOUNT64(required_alignment) == 1);
+    return static_cast<uint32_t>((-static_cast<int32_t>(offset)) &
+                                 (required_alignment - 1));
+  }
+
+  /// Rounding up offset to the beginning of next column,
+  /// choosing required alignment based on the data type of that column.
+  static inline uint32_t padding_for_alignment(uint32_t offset, int string_alignment,
+                                               const KeyColumnMetadata& col_metadata) {
+    if (!col_metadata.is_fixed_length ||
+        ARROW_POPCOUNT64(col_metadata.fixed_length) <= 1) {
+      return 0;
+    } else {
+      return padding_for_alignment(offset, string_alignment);
+    }
+  }
+
+  /// Returns an array of offsets within a row of ends of varbinary fields.
+  inline const uint32_t* varbinary_end_array(const uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<const uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// \brief An array of mutable offsets within a row of ends of varbinary fields.
+  inline uint32_t* varbinary_end_array(uint8_t* row) const {
+    ARROW_DCHECK(!is_fixed_length);
+    return reinterpret_cast<uint32_t*>(row + varbinary_end_array_offset);
+  }
+
+  /// Returns the offset within the row and length of the first varbinary field.
+  inline void first_varbinary_offset_and_length(const uint8_t* row, uint32_t* offset,
+                                                uint32_t* length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    *offset = fixed_length;
+    *length = varbinary_end_array(row)[0] - fixed_length;
+  }
+
+  /// Returns the offset within the row and length of the second and further varbinary
+  /// fields.
+  inline void nth_varbinary_offset_and_length(const uint8_t* row, int varbinary_id,
+                                              uint32_t* out_offset,
+                                              uint32_t* out_length) const {
+    ARROW_DCHECK(!is_fixed_length);
+    ARROW_DCHECK(varbinary_id > 0);
+    const uint32_t* varbinary_end = varbinary_end_array(row);
+    uint32_t offset = varbinary_end[varbinary_id - 1];
+    offset += padding_for_alignment(offset, string_alignment);
+    *out_offset = offset;
+    *out_length = varbinary_end[varbinary_id] - offset;
+  }
+
+  uint32_t encoded_field_order(uint32_t icol) const { return column_order[icol]; }
+
+  uint32_t encoded_field_offset(uint32_t icol) const { return column_offsets[icol]; }
+
+  uint32_t num_cols() const { return static_cast<uint32_t>(column_metadatas.size()); }
+
+  uint32_t num_varbinary_cols() const;
+
+  /// \brief Populate this instance to describe `cols` with the given alignment
+  void FromColumnMetadataVector(const std::vector<KeyColumnMetadata>& cols,
+                                int in_row_alignment, int in_string_alignment);
+
+  /// \brief True if `other` has the same number of columns
+  ///   and each column has the same width (two variable length
+  ///   columns are considered to have the same width)
+  bool is_compatible(const RowTableMetadata& other) const;
+};
+
+/// \brief A table of data stored in row-major order
+///
+/// Can only store non-nested data types
+///
+/// Can store both fixed-size data types and variable-length data types
+class ARROW_EXPORT RowTableImpl {
+ public:
+  RowTableImpl();
+  /// \brief Initialize a row array for use
+  ///
+  /// This must be called before any other method
+  Status Init(MemoryPool* pool, const RowTableMetadata& metadata);
+  /// \brief Clear all rows from the table
+  ///
+  /// Does not shrink buffers
+  void Clean();
+  /// \brief Add empty rows
+  /// \param num_rows_to_append The number of empty rows to append
+  /// \param num_extra_bytes_to_append For tables storing variable-length data this
+  ///     should be a guess of how many data bytes will be needed to populate the
+  ///     data.  This is ignored if there are no variable-length columns
+  Status AppendEmpty(uint32_t num_rows_to_append, uint32_t num_extra_bytes_to_append);
+  /// \brief Append rows from a source table
+  /// \param from The table to append from
+  /// \param num_rows_to_append The number of rows to append
+  /// \param source_row_ids Indices (into `from`) of the desired rows
+  Status AppendSelectionFrom(const RowTableImpl& from, uint32_t num_rows_to_append,
+                             const uint16_t* source_row_ids);
+  /// \brief Metadata describing the data stored in this table
+  const RowTableMetadata& metadata() const { return metadata_; }
+  /// \brief The number of rows stored in the table
+  int64_t length() const { return num_rows_; }
+  // Accessors into the table's buffers
+  const uint8_t* data(int i) const {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return buffers_[i];
+  }
+  uint8_t* mutable_data(int i) {
+    ARROW_DCHECK(i >= 0 && i <= max_buffers_);
+    return mutable_buffers_[i];
+  }
+  const uint32_t* offsets() const { return reinterpret_cast<const uint32_t*>(data(1)); }
+  uint32_t* mutable_offsets() { return reinterpret_cast<uint32_t*>(mutable_data(1)); }
+  const uint8_t* null_masks() const { return null_masks_->data(); }
+  uint8_t* null_masks() { return null_masks_->mutable_data(); }
+
+  /// \brief True if there is a null value anywhere in the table
+  ///
+  /// This calculation is memoized based on the number of rows and assumes
+  /// that values are only appended (and not modified in place) between
+  /// successive calls
+  bool has_any_nulls(const LightContext* ctx) const;
+
+ private:
+  Status ResizeFixedLengthBuffers(int64_t num_extra_rows);
+  Status ResizeOptionalVaryingLengthBuffer(int64_t num_extra_bytes);
+
+  // Helper functions to determine the number of bytes needed for each
+  // buffer given a number of rows.
+  int64_t size_null_masks(int64_t num_rows);
+  int64_t size_offsets(int64_t num_rows);
+  int64_t size_rows_fixed_length(int64_t num_rows);
+  int64_t size_rows_varying_length(int64_t num_bytes);
+  // Called after resize to fix pointers
+  void update_buffer_pointers();
+
+  static constexpr int64_t padding_for_vectors = 64;
+  MemoryPool* pool_;
+  RowTableMetadata metadata_;
+  // Buffers can only expand during lifetime and never shrink.
+  std::unique_ptr<ResizableBuffer> null_masks_;
+  // Only used if the table has variable-length columns
+  // Stores the offsets into the binary data
+  std::unique_ptr<ResizableBuffer> offsets_;
+  // Stores the fixed-length parts of the rows
+  std::unique_ptr<ResizableBuffer> rows_;
+  static constexpr int max_buffers_ = 3;

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
pitrou commented on code in PR #13218:
URL: https://github.com/apache/arrow/pull/13218#discussion_r891233109


##########
cpp/src/arrow/compute/exec/key_hash.h:
##########
@@ -45,8 +45,8 @@ class ARROW_EXPORT Hashing32 {
   friend void TestBloomSmall(BloomFilterBuildStrategy, int64_t, int, bool, bool);
 
  public:
-  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols,
-                              KeyEncoder::KeyEncoderContext* ctx, uint32_t* out_hash);
+  static void HashMultiColumn(const std::vector<KeyColumnArray>& cols, LightContext* ctx,
+                              uint32_t* out_hash);

Review Comment:
   > Do we have a strong style preference here?
   
   Hmm, I don't think so. If it's used for templating then I suppose the class is necessary.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] wesm merged pull request #13218: ARROW-16590: [C++] Consolidate files dealing with row-major storage

Posted by GitBox <gi...@apache.org>.
wesm merged PR #13218:
URL: https://github.com/apache/arrow/pull/13218


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org