You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "felipecrv (via GitHub)" <gi...@apache.org> on 2023/03/01 22:44:02 UTC

[GitHub] [arrow] felipecrv opened a new pull request, #34408: GH-34361: [C++] work-in-progress: Add skeleton of the new APIs for handling null checks correctly for all types

felipecrv opened a new pull request, #34408:
URL: https://github.com/apache/arrow/pull/34408

   Bonus: add `ArrayData::IsValid()` to make it consistent with `Array` and `ArraySpan`.
   
   ### Rationale for this change
   
   This is the proposed fix to #34361 plus the addition of more APIs dealing with validity/nullity.
   
   ### What changes are included in this PR?
   
   This PR changes the behavior of `IsNull` and `IsValid` in `Array`, `ArrayData`, and `ArraySpan`.
   
   It preserves the behavior of `MayHaveNulls`, `GetNullCount` and introduces new member functions to `Array`, `ArrayData`, and `ArraySpan`:
   
    - `bool HasValidityBitmap() const`
    - `bool MayHaveLogicalNulls() const`
    - `int64_t ComputeLogicalNullCount() const`
   
   ### Are these changes tested?
   
   Yes.
   
   ### Are there any user-facing changes?
   
   Yes. See above.
   
   Breakage with these changes can only happen if users rely on `IsNull(i)` always returning `true` for union types, but we have users reporting that the current behavior or broken #34315. This is why the behavior of `IsNull` and `IsValid` is changing.,
   
   **This PR contains a "Critical Fix".**


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1458882799

   Is this ready for another review?  Please check "Re-request review" if so.  Although I notice it has the `awaiting change review` label.  I wonder if that is automatic anytime a commit is pushed.
   
   ![image](https://user-images.githubusercontent.com/1696093/223554092-73b20540-3878-4813-b8ff-31933808edbd.png)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1494394134

   @ursabot please benchmark lang=C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #34408: GH-34361: [C++] work-in-progress: Add skeleton of the new APIs for handling null checks correctly for all types

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1450966057

   * Closes: #34361


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1468961783

   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1492332219

   @ursabot please benchmark lang=C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1492513734

   @ursabot please benchmark lang=C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1164392003


##########
cpp/src/arrow/util/union_util.cc:
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/union_util.h"
+
+#include <cstdint>
+
+#include "arrow/array/data.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace arrow::union_util {
+
+int64_t LogicalSparseUnionNullCount(const ArraySpan& span) {
+  const auto* sparse_union_type =
+      internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT
+  int64_t null_count = 0;
+  for (int64_t i = 0; i < span.length; i++) {
+    const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]];
+
+    null_count += span.child_data[child_id].IsNull(i);

Review Comment:
   There is a high chance the existing tests didn't cover sliced union arrays and an even higher chance I messed up here because I had just read the Arrow spec on Unions to fix this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136408291


##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -78,20 +79,58 @@ class TestArray : public ::testing::Test {
   MemoryPool* pool_;
 };
 
+template <class ScalarType>
+std::shared_ptr<ScalarType> MakeScalar(typename ScalarType::ValueType value) {
+  return std::make_shared<ScalarType>(value);
+}
+
 TEST_F(TestArray, TestNullCount) {
   // These are placeholders
   auto data = std::make_shared<Buffer>(nullptr, 0);
   auto null_bitmap = std::make_shared<Buffer>(nullptr, 0);
 
-  std::unique_ptr<Int32Array> arr(new Int32Array(100, data, null_bitmap, 10));
+  std::shared_ptr<Int32Array> arr(new Int32Array(100, data, null_bitmap, 10));
+  ASSERT_EQ(10, arr->ComputeLogicalNullCount());
   ASSERT_EQ(10, arr->null_count());
+  ASSERT_TRUE(arr->data()->MayHaveNulls());
+  ASSERT_TRUE(arr->data()->MayHaveLogicalNulls());
 
-  std::unique_ptr<Int32Array> arr_no_nulls(new Int32Array(100, data));
+  std::shared_ptr<Int32Array> arr_no_nulls(new Int32Array(100, data));
+  ASSERT_EQ(0, arr_no_nulls->ComputeLogicalNullCount());
   ASSERT_EQ(0, arr_no_nulls->null_count());
+  ASSERT_FALSE(arr_no_nulls->data()->MayHaveNulls());
+  ASSERT_FALSE(arr_no_nulls->data()->MayHaveLogicalNulls());
 
-  std::unique_ptr<Int32Array> arr_default_null_count(
+  std::shared_ptr<Int32Array> arr_default_null_count(
       new Int32Array(100, data, null_bitmap));
   ASSERT_EQ(kUnknownNullCount, arr_default_null_count->data()->null_count);
+  ASSERT_TRUE(arr_default_null_count->data()->MayHaveNulls());
+  ASSERT_TRUE(arr_default_null_count->data()->MayHaveLogicalNulls());
+
+  RunEndEncodedBuilder ree_builder(pool_, std::make_shared<Int32Builder>(pool_),

Review Comment:
   I mentioned you in PR comments pointing to code that tests `IsNull/Valid` for unions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136403318


##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -377,20 +416,23 @@ TEST_F(TestArray, TestMakeArrayOfNull) {
       ASSERT_EQ(array->length(), length);
       if (is_union(type->id())) {
         ASSERT_EQ(array->null_count(), 0);
+        ASSERT_EQ(array->ComputeLogicalNullCount(), length);
         const auto& union_array = checked_cast<const UnionArray&>(*array);
         for (int i = 0; i < union_array.num_fields(); ++i) {
           ASSERT_EQ(union_array.field(i)->null_count(), union_array.field(i)->length());
         }
       } else if (type->id() == Type::RUN_END_ENCODED) {
         ASSERT_EQ(array->null_count(), 0);
+        ASSERT_EQ(array->ComputeLogicalNullCount(), length);
         const auto& ree_array = checked_cast<const RunEndEncodedArray&>(*array);
         ASSERT_EQ(ree_array.values()->null_count(), ree_array.values()->length());
       } else {
         ASSERT_EQ(array->null_count(), length);
-        for (int64_t i = 0; i < length; ++i) {
-          ASSERT_TRUE(array->IsNull(i));
-          ASSERT_FALSE(array->IsValid(i));
-        }
+        ASSERT_EQ(array->ComputeLogicalNullCount(), length);
+      }
+      for (int64_t i = 0; i < length; ++i) {
+        ASSERT_TRUE(array->IsNull(i));
+        ASSERT_FALSE(array->IsValid(i));
       }

Review Comment:
   @westonpace testing unions here.



##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -377,20 +416,23 @@ TEST_F(TestArray, TestMakeArrayOfNull) {
       ASSERT_EQ(array->length(), length);
       if (is_union(type->id())) {
         ASSERT_EQ(array->null_count(), 0);
+        ASSERT_EQ(array->ComputeLogicalNullCount(), length);

Review Comment:
   @westonpace 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1126517948


##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -482,35 +524,45 @@ void AssertAppendScalar(MemoryPool* pool, const std::shared_ptr<Scalar>& scalar)
   std::unique_ptr<arrow::ArrayBuilder> builder;
   auto null_scalar = MakeNullScalar(scalar->type);
   ASSERT_OK(MakeBuilderExactIndex(pool, scalar->type, &builder));
-  ASSERT_OK(builder->AppendScalar(*scalar));
-  ASSERT_OK(builder->AppendScalar(*scalar));
-  ASSERT_OK(builder->AppendScalar(*null_scalar));
-  ASSERT_OK(builder->AppendScalars({scalar, null_scalar}));
-  ASSERT_OK(builder->AppendScalar(*scalar, /*n_repeats=*/2));
-  ASSERT_OK(builder->AppendScalar(*null_scalar, /*n_repeats=*/2));
+  ASSERT_OK(builder->AppendScalar(*scalar));                 // [0] = scalar
+  ASSERT_OK(builder->AppendScalar(*scalar));                 // [1] = scalar
+  ASSERT_OK(builder->AppendScalar(*null_scalar));            // [2] = NULL
+  ASSERT_OK(builder->AppendScalars({scalar, null_scalar}));  // [3, 4] = {scalar, NULL}
+  ASSERT_OK(
+      builder->AppendScalar(*scalar, /*n_repeats=*/2));  // [5, 6] = {scalar, scalar}
+  ASSERT_OK(
+      builder->AppendScalar(*null_scalar, /*n_repeats=*/2));  // [7, 8] = {NULL, NULL}
 
   std::shared_ptr<Array> out;
   FinishAndCheckPadding(builder.get(), &out);
   ASSERT_OK(out->ValidateFull());
   AssertTypeEqual(scalar->type, out->type());
   ASSERT_EQ(out->length(), 9);
 
-  const bool can_check_nulls = internal::HasValidityBitmap(out->type()->id());
+  auto out_type_id = out->type()->id();
+  const bool has_validity = internal::HasValidityBitmap(out_type_id);
   // For a dictionary builder, the output dictionary won't necessarily be the same
-  const bool can_check_values = !is_dictionary(out->type()->id());
+  const bool can_check_values = !is_dictionary(out_type_id);
 
-  if (can_check_nulls) {
+  if (has_validity) {
     ASSERT_EQ(out->null_count(), 4);
+  } else {
+    ASSERT_EQ(out->null_count(), 0);
+  }
+  if (scalar->is_valid) {
+    ASSERT_EQ(out->ComputeLogicalNullCount(), 4);
+  } else {
+    ASSERT_EQ(out->ComputeLogicalNullCount(), 9);
   }
 
   for (const auto index : {0, 1, 3, 5, 6}) {
-    ASSERT_FALSE(out->IsNull(index));
+    ASSERT_FALSE(out->IsNull(index) && scalar->is_valid);

Review Comment:
   Changed locally. Pushing soon.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1131836891


##########
cpp/src/arrow/array/data.h:
##########
@@ -363,14 +475,73 @@ struct ARROW_EXPORT ArraySpan {
     }
   }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the array has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count != 0 && buffers[0].data != NULLPTR;
   }
+
+  /// \brief Return true if the array has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0].data != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// \see ArrayData::MayHaveLogicalNulls
+  bool MayHaveLogicalNulls() const {
+    if (buffers[0].data != NULLPTR) {
+      return null_count != 0;
+    }
+    const auto t = type->id();
+    if (t == Type::SPARSE_UNION || t == Type::DENSE_UNION) {
+      return UnionMayHaveLogicalNulls();
+    }
+    if (t == Type::RUN_END_ENCODED) {
+      return RunEndEncodedMayHaveLogicalNulls();
+    }
+    return false;

Review Comment:
   To be clear: I've implemented this and pushed again.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1469183962

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/70a51126d35740449494eb37418701a9...75036fcbcf8042b2b5aa5f49692988a2/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1498118733

   Benchmark runs are scheduled for baseline = c21986302e35f48feca5e1dbd92a7130337b20e3 and contender = 4d21386d994526591fe7e174ddb32cc400a81517. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/e7e3b0c830a34fb1ad655e66ffd3b2e7...493c46a7c32e4220a65d2a5462ae8c13/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/110f01cd5de24423b264e1e64eff4239...b6df9ee5c4894c12955f63dda9486137/)
   [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a174611153194c988a6be9e06f944d2c...964e08912073403db999118f3309ccff/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/fa01686d701f4256b9b17b3463894a61...d0d544131cfe416388fd4a7899e9ab9e/)
   Buildkite builds:
   [Scheduled] [`4d21386d` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2672)
   [Scheduled] [`4d21386d` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2663)
   [Finished] [`c2198630` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2639)
   [Scheduled] [`c2198630` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2671)
   [Scheduled] [`c2198630` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2633)
   [Scheduled] [`c2198630` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2662)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136374002


##########
cpp/src/arrow/array/data.h:
##########
@@ -33,6 +33,19 @@
 namespace arrow {
 
 class Array;
+struct ArrayData;
+
+namespace internal {
+// ----------------------------------------------------------------------
+// Null handling for types without a validity bitmap
+
+ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i);
+
+ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data);
+ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data);

Review Comment:
   They are called from `IsNull` and `IsValid` which are inline and having the switch on the type ID inlined allows the compiler to inline the highly predictable branches into the loos bodies from which these functions are called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1466299612

   @westonpace I'm a bit lost on how these labels change and what they actully mean, but the PR is ready for review again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1140726096


##########
cpp/src/arrow/array/data.h:
##########
@@ -33,6 +33,19 @@
 namespace arrow {
 
 class Array;
+struct ArrayData;
+
+namespace internal {
+// ----------------------------------------------------------------------
+// Null handling for types without a validity bitmap
+
+ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i);
+
+ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data);
+ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data);

Review Comment:
   I think it's ok how it is.  I just don't ever have a good sense for when it's justified for leaving something in the header file for performance reasons.  Typically I think it would be nice for there to be some benchmark to point to.  Otherwise "it seems like this is important for performance" becomes too subjective of a criteria to apply.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1492143971

   @ursabot please benchmark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1499137907

   @jorisvandenbossche I'm all for merging and I suspect the regression in `test-mac-arm` was a fluke.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1165678156


##########
cpp/src/arrow/util/union_util.cc:
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/union_util.h"
+
+#include <cstdint>
+
+#include "arrow/array/data.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace arrow::union_util {
+
+int64_t LogicalSparseUnionNullCount(const ArraySpan& span) {
+  const auto* sparse_union_type =
+      internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT
+  int64_t null_count = 0;
+  for (int64_t i = 0; i < span.length; i++) {
+    const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]];
+
+    null_count += span.child_data[child_id].IsNull(i);

Review Comment:
   `GetValues` applies slices. Direct access doesn't. I think that makes sense, but it's hard to keep in mind at all times.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1128595095


##########
cpp/src/arrow/array/data.h:
##########
@@ -229,15 +256,88 @@ struct ARROW_EXPORT ArrayData {
 
   void SetNullCount(int64_t v) { null_count.store(v); }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the data has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known.
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count.load() != 0 && buffers[0] != NULLPTR;
   }
 
+  /// \brief Return true if the data has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0] != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// This is not a drop-in replacement for MayHaveNulls, as historically
+  /// MayHaveNulls() has been used to check for the presence of a validity
+  /// bitmap that needs to be checked.
+  ///
+  /// Code that previously used MayHaveNulls() and then dealt with the validity
+  /// bitmap directly can be fixed to handle all types correctly without
+  /// performance degradation when handling most types by adopting
+  /// HasValidityBitmap and MayHaveLogicalNulls.
+  ///
+  /// Before:
+  ///
+  ///     uint8_t* validity = array.MayHaveNulls() ? array.buffers[0].data : NULLPTR;
+  ///     for (int64_t i = 0; i < array.length; ++i) {
+  ///       if (validity && !bit_util::GetBit(validity, i)) {
+  ///         continue;  // skip a NULL
+  ///       }
+  ///       ...
+  ///     }
+  ///
+  /// After:
+  ///
+  ///     bool all_valid = !array.MayHaveLogicalNulls();
+  ///     uint8_t* validity = array.HasValidityBitmap() ? array.buffers[0].data : NULLPTR;
+  ///     for (int64_t i = 0; i < array.length; ++i) {
+  ///       bool is_valid = all_valid ||
+  ///                       (validity && bit_util::GetBit(validity, i)) ||

Review Comment:
   Why explicitly call this case out? I think `array.IsValid(i)` will check this case right?  Is this purely for performance sake?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] work-in-progress: Add skeleton of the new APIs for handling null checks correctly for all types

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1452543029

   @jorisvandenbossche @westonpace @zeroshade this now has the implementation of all the new functions, changes in behavior, and tests. Ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136298871


##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -78,20 +79,58 @@ class TestArray : public ::testing::Test {
   MemoryPool* pool_;
 };
 
+template <class ScalarType>
+std::shared_ptr<ScalarType> MakeScalar(typename ScalarType::ValueType value) {
+  return std::make_shared<ScalarType>(value);
+}

Review Comment:
   How is this different than `MakeScalar` in `scalar.h`?  If it is different, can we add a comment because I know I'll ask the question again :laughing: 



##########
cpp/src/arrow/array/data.h:
##########
@@ -33,6 +33,19 @@
 namespace arrow {
 
 class Array;
+struct ArrayData;
+
+namespace internal {
+// ----------------------------------------------------------------------
+// Null handling for types without a validity bitmap
+
+ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i);
+
+ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data);
+ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data);

Review Comment:
   What prevents these functions from being `.cc` only function in an anonymous namespace?  It seems like the only need is because we have functions later in `data.h` that could be in a `.cc` file.



##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -78,20 +79,58 @@ class TestArray : public ::testing::Test {
   MemoryPool* pool_;
 };
 
+template <class ScalarType>
+std::shared_ptr<ScalarType> MakeScalar(typename ScalarType::ValueType value) {
+  return std::make_shared<ScalarType>(value);
+}
+
 TEST_F(TestArray, TestNullCount) {
   // These are placeholders
   auto data = std::make_shared<Buffer>(nullptr, 0);
   auto null_bitmap = std::make_shared<Buffer>(nullptr, 0);
 
-  std::unique_ptr<Int32Array> arr(new Int32Array(100, data, null_bitmap, 10));
+  std::shared_ptr<Int32Array> arr(new Int32Array(100, data, null_bitmap, 10));
+  ASSERT_EQ(10, arr->ComputeLogicalNullCount());
   ASSERT_EQ(10, arr->null_count());
+  ASSERT_TRUE(arr->data()->MayHaveNulls());
+  ASSERT_TRUE(arr->data()->MayHaveLogicalNulls());
 
-  std::unique_ptr<Int32Array> arr_no_nulls(new Int32Array(100, data));
+  std::shared_ptr<Int32Array> arr_no_nulls(new Int32Array(100, data));
+  ASSERT_EQ(0, arr_no_nulls->ComputeLogicalNullCount());
   ASSERT_EQ(0, arr_no_nulls->null_count());
+  ASSERT_FALSE(arr_no_nulls->data()->MayHaveNulls());
+  ASSERT_FALSE(arr_no_nulls->data()->MayHaveLogicalNulls());
 
-  std::unique_ptr<Int32Array> arr_default_null_count(
+  std::shared_ptr<Int32Array> arr_default_null_count(
       new Int32Array(100, data, null_bitmap));
   ASSERT_EQ(kUnknownNullCount, arr_default_null_count->data()->null_count);
+  ASSERT_TRUE(arr_default_null_count->data()->MayHaveNulls());
+  ASSERT_TRUE(arr_default_null_count->data()->MayHaveLogicalNulls());
+
+  RunEndEncodedBuilder ree_builder(pool_, std::make_shared<Int32Builder>(pool_),

Review Comment:
   Do we have existing tests like this for sparse and dense union? (given we now have distinct paths for those)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] work-in-progress: Add skeleton of the new APIs for handling null checks correctly for all types

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1123232448


##########
cpp/src/arrow/array/data.h:
##########
@@ -229,15 +256,88 @@ struct ARROW_EXPORT ArrayData {
 
   void SetNullCount(int64_t v) { null_count.store(v); }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the data has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known.
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count.load() != 0 && buffers[0] != NULLPTR;
   }
 
+  /// \brief Return true if the data has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0] != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// This is not a drop-in replacement for MayHaveNulls, as historically
+  /// MayHaveNulls() has been used to check for the presence of a validity
+  /// bitmap that needs to be checked.
+  ///
+  /// Code that previously used MayHaveNulls() and then dealt with the validity
+  /// bitmap directly can be fixed to handle all types correctly without
+  /// performance degradation when handling most types by adopting
+  /// HasValidityBitmap and MayHaveLogicalNulls.
+  ///
+  /// BEFORE:
+  ///
+  ///   uint8_t* validity = array.MayHaveNulls() ? array.buffers[0].data : NULLPTR;
+  ///   for (int64_t i = 0; i < array.length; ++i) {
+  ///     if (validity && !bit_util::GetBit(validity, i)) {
+  ///       continue;  // skip a NULL
+  ///     }
+  ///     ...
+  ///   }
+  ///
+  /// AFTER:
+  ///
+  ///   bool all_valid = !array.MayHaveLogicalNulls();
+  ///   uint8_t* validity = array.HasValidityBitmap() ? array.buffers[0].data : NULLPTR;
+  ///   for (int64_t i = 0; i < array.length; ++i) {
+  ///     bool is_valid = all_valid ||
+  ///                     (validity && bit_util::GetBit(validity, i)) ||
+  ///                     array.IsValid(i);
+  ///     if (!is_valid) {
+  ///       continue;  // skip a NULL
+  ///     }
+  ///     ...
+  ///   }
+  bool MayHaveLogicalNulls() const {
+    if (buffers[0] != NULLPTR) {
+      return null_count.load() != 0;
+    }
+    const auto t = type->id();
+    if (t == Type::SPARSE_UNION || t == Type::DENSE_UNION) {
+      return internal::UnionMayHaveLogicalNulls(*this);
+    }
+    if (t == Type::RUN_END_ENCODED) {
+      return internal::RunEndEncodedMayHaveLogicalNulls(*this);
+    }
+    return false;
+  }
+
+  /// \brief Computes the logical null count for arrays of all types including
+  /// those that do not have a validity bitmap like union and run-end encoded
+  /// arrays
+  ///
+  /// If the array has a validity bitmap, this function behaves the same as
+  /// GetNullCount. For types that have no validity bitmap, this function will
+  /// recompute the null count every time it is called.

Review Comment:
   It would require adding a field to all these classes and updating it correctly when slicing arrays (back to kUnknownCount). Adds more to think about in these central classes. I think it's premature to think about caching this value, it can be done later if we see it becoming an issue in practice.
   
   For most types this re-computation is just returning the cached `null_count_` and depending on how the other types implement the calculation, they will benefit from the `null_count_` cache in child arrays.
   
   In the case of REEs, if we scan the validity bitmap of the values arrays and get a null_count=0, we won't have to do any processing when `ComputeLogicalCount` is called again. And if REE compression is being effective, the values arrays is much smaller than the logical length making the re-computation cheap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1126496560


##########
cpp/src/arrow/array/data.h:
##########
@@ -363,14 +475,73 @@ struct ARROW_EXPORT ArraySpan {
     }
   }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the array has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count != 0 && buffers[0].data != NULLPTR;
   }
+
+  /// \brief Return true if the array has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0].data != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// \see ArrayData::MayHaveLogicalNulls
+  bool MayHaveLogicalNulls() const {
+    if (buffers[0].data != NULLPTR) {
+      return null_count != 0;
+    }
+    const auto t = type->id();
+    if (t == Type::SPARSE_UNION || t == Type::DENSE_UNION) {
+      return UnionMayHaveLogicalNulls();
+    }
+    if (t == Type::RUN_END_ENCODED) {
+      return RunEndEncodedMayHaveLogicalNulls();
+    }
+    return false;

Review Comment:
   Good catch! But the fix is `null_count_ != 0` because when `length` is `0` we also want to return false here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1165530038


##########
cpp/src/arrow/util/union_util.cc:
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/union_util.h"
+
+#include <cstdint>
+
+#include "arrow/array/data.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace arrow::union_util {
+
+int64_t LogicalSparseUnionNullCount(const ArraySpan& span) {
+  const auto* sparse_union_type =
+      internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT
+  int64_t null_count = 0;
+  for (int64_t i = 0; i < span.length; i++) {
+    const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]];
+
+    null_count += span.child_data[child_id].IsNull(i);

Review Comment:
   It's not so much about the Arrow spec on Unions, but rather about the small details of our APIs (Span::GetValues returning sliced values or not, Span::child_data returning sliced data or not?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1498118642

   @ursabot please benchmark lang=C++


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1140733785


##########
cpp/src/arrow/array/data.h:
##########
@@ -33,6 +33,19 @@
 namespace arrow {
 
 class Array;
+struct ArrayData;
+
+namespace internal {
+// ----------------------------------------------------------------------
+// Null handling for types without a validity bitmap
+
+ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i);
+
+ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data);
+ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data);

Review Comment:
   The existing benchmarks indicated some regression (maybe due to union arrays being part of `IsNull` benchmarks) so I will investigate more carefully.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1500155077

   Benchmark runs are scheduled for baseline = 84e54308f1399df094b785083e6faf52275d10e9 and contender = fde31ed1cb1de1a0d2b9ade257e66104160d31bb. fde31ed1cb1de1a0d2b9ade257e66104160d31bb is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d0124fa45a3f47aeae9ba88ba2a01ef8...f8f166d291a9494dad4e45e45c4db045/)
   [Failed :arrow_down:0.56% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/7c06e65eb3e14a0dad4e4f907b3d42e0...19f119196c1e474dbeaa47ad84cb598d/)
   [Finished :arrow_down:0.26% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/0bbe7111b7f84064b5c35756f45c4993...b056ac09036045aa9809992e7e30d090/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/65793c01f964465bb21ba57530ec21c5...5d8e69ba51f04a04b3d364f687083c61/)
   Buildkite builds:
   [Finished] [`fde31ed1` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2646)
   [Failed] [`fde31ed1` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2679)
   [Finished] [`fde31ed1` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2644)
   [Failed] [`fde31ed1` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2670)
   [Finished] [`84e54308` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2645)
   [Failed] [`84e54308` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2678)
   [Finished] [`84e54308` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2643)
   [Failed] [`84e54308` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2669)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- closed pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- closed pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded
URL: https://github.com/apache/arrow/pull/34408


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1126435732


##########
cpp/src/arrow/util/union_util.cc:
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/union_util.h"
+
+#include <cstdint>
+
+#include "arrow/array/data.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace arrow::union_util {
+
+int64_t LogicalSparseUnionNullCount(const ArraySpan& span) {
+  const auto* sparse_union_type =
+      internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT

Review Comment:
   That I should use `auto` instead of `int8_t`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1125123670


##########
cpp/src/arrow/array/data.h:
##########
@@ -363,14 +475,73 @@ struct ARROW_EXPORT ArraySpan {
     }
   }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the array has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count != 0 && buffers[0].data != NULLPTR;
   }
+
+  /// \brief Return true if the array has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0].data != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// \see ArrayData::MayHaveLogicalNulls
+  bool MayHaveLogicalNulls() const {
+    if (buffers[0].data != NULLPTR) {
+      return null_count != 0;
+    }
+    const auto t = type->id();
+    if (t == Type::SPARSE_UNION || t == Type::DENSE_UNION) {
+      return UnionMayHaveLogicalNulls();
+    }
+    if (t == Type::RUN_END_ENCODED) {
+      return RunEndEncodedMayHaveLogicalNulls();
+    }
+    return false;

Review Comment:
   Do we need to look at `null_count` here?  If there is no validity bitmap (and the type would normally have one) then I think we either are "all null" or "not all null" depending on `null_count == this->length` right?



##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -482,35 +524,45 @@ void AssertAppendScalar(MemoryPool* pool, const std::shared_ptr<Scalar>& scalar)
   std::unique_ptr<arrow::ArrayBuilder> builder;
   auto null_scalar = MakeNullScalar(scalar->type);
   ASSERT_OK(MakeBuilderExactIndex(pool, scalar->type, &builder));
-  ASSERT_OK(builder->AppendScalar(*scalar));
-  ASSERT_OK(builder->AppendScalar(*scalar));
-  ASSERT_OK(builder->AppendScalar(*null_scalar));
-  ASSERT_OK(builder->AppendScalars({scalar, null_scalar}));
-  ASSERT_OK(builder->AppendScalar(*scalar, /*n_repeats=*/2));
-  ASSERT_OK(builder->AppendScalar(*null_scalar, /*n_repeats=*/2));
+  ASSERT_OK(builder->AppendScalar(*scalar));                 // [0] = scalar
+  ASSERT_OK(builder->AppendScalar(*scalar));                 // [1] = scalar
+  ASSERT_OK(builder->AppendScalar(*null_scalar));            // [2] = NULL
+  ASSERT_OK(builder->AppendScalars({scalar, null_scalar}));  // [3, 4] = {scalar, NULL}
+  ASSERT_OK(
+      builder->AppendScalar(*scalar, /*n_repeats=*/2));  // [5, 6] = {scalar, scalar}
+  ASSERT_OK(
+      builder->AppendScalar(*null_scalar, /*n_repeats=*/2));  // [7, 8] = {NULL, NULL}
 
   std::shared_ptr<Array> out;
   FinishAndCheckPadding(builder.get(), &out);
   ASSERT_OK(out->ValidateFull());
   AssertTypeEqual(scalar->type, out->type());
   ASSERT_EQ(out->length(), 9);
 
-  const bool can_check_nulls = internal::HasValidityBitmap(out->type()->id());
+  auto out_type_id = out->type()->id();
+  const bool has_validity = internal::HasValidityBitmap(out_type_id);
   // For a dictionary builder, the output dictionary won't necessarily be the same
-  const bool can_check_values = !is_dictionary(out->type()->id());
+  const bool can_check_values = !is_dictionary(out_type_id);
 
-  if (can_check_nulls) {
+  if (has_validity) {
     ASSERT_EQ(out->null_count(), 4);
+  } else {
+    ASSERT_EQ(out->null_count(), 0);
+  }
+  if (scalar->is_valid) {
+    ASSERT_EQ(out->ComputeLogicalNullCount(), 4);
+  } else {
+    ASSERT_EQ(out->ComputeLogicalNullCount(), 9);
   }
 
   for (const auto index : {0, 1, 3, 5, 6}) {
-    ASSERT_FALSE(out->IsNull(index));
+    ASSERT_FALSE(out->IsNull(index) && scalar->is_valid);

Review Comment:
   ```suggestion
       ASSERT_NE(out->IsNull(index), scalar->is_valid);
   ```
   
   Minor readability nit



##########
cpp/src/arrow/util/union_util.cc:
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/union_util.h"
+
+#include <cstdint>
+
+#include "arrow/array/data.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace arrow::union_util {
+
+int64_t LogicalSparseUnionNullCount(const ArraySpan& span) {
+  const auto* sparse_union_type =
+      internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT

Review Comment:
   What lint is being disabled here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1474420767

   @felipecrv it appears there are some significant regressions in `arrow-bit-block-counter-benchmark` and `arrow-compute-vector-sort-benchmark`.  Do you mind investigating?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] work-in-progress: Add skeleton of the new APIs for handling null checks correctly for all types

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1123232448


##########
cpp/src/arrow/array/data.h:
##########
@@ -229,15 +256,88 @@ struct ARROW_EXPORT ArrayData {
 
   void SetNullCount(int64_t v) { null_count.store(v); }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the data has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known.
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count.load() != 0 && buffers[0] != NULLPTR;
   }
 
+  /// \brief Return true if the data has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0] != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// This is not a drop-in replacement for MayHaveNulls, as historically
+  /// MayHaveNulls() has been used to check for the presence of a validity
+  /// bitmap that needs to be checked.
+  ///
+  /// Code that previously used MayHaveNulls() and then dealt with the validity
+  /// bitmap directly can be fixed to handle all types correctly without
+  /// performance degradation when handling most types by adopting
+  /// HasValidityBitmap and MayHaveLogicalNulls.
+  ///
+  /// BEFORE:
+  ///
+  ///   uint8_t* validity = array.MayHaveNulls() ? array.buffers[0].data : NULLPTR;
+  ///   for (int64_t i = 0; i < array.length; ++i) {
+  ///     if (validity && !bit_util::GetBit(validity, i)) {
+  ///       continue;  // skip a NULL
+  ///     }
+  ///     ...
+  ///   }
+  ///
+  /// AFTER:
+  ///
+  ///   bool all_valid = !array.MayHaveLogicalNulls();
+  ///   uint8_t* validity = array.HasValidityBitmap() ? array.buffers[0].data : NULLPTR;
+  ///   for (int64_t i = 0; i < array.length; ++i) {
+  ///     bool is_valid = all_valid ||
+  ///                     (validity && bit_util::GetBit(validity, i)) ||
+  ///                     array.IsValid(i);
+  ///     if (!is_valid) {
+  ///       continue;  // skip a NULL
+  ///     }
+  ///     ...
+  ///   }
+  bool MayHaveLogicalNulls() const {
+    if (buffers[0] != NULLPTR) {
+      return null_count.load() != 0;
+    }
+    const auto t = type->id();
+    if (t == Type::SPARSE_UNION || t == Type::DENSE_UNION) {
+      return internal::UnionMayHaveLogicalNulls(*this);
+    }
+    if (t == Type::RUN_END_ENCODED) {
+      return internal::RunEndEncodedMayHaveLogicalNulls(*this);
+    }
+    return false;
+  }
+
+  /// \brief Computes the logical null count for arrays of all types including
+  /// those that do not have a validity bitmap like union and run-end encoded
+  /// arrays
+  ///
+  /// If the array has a validity bitmap, this function behaves the same as
+  /// GetNullCount. For types that have no validity bitmap, this function will
+  /// recompute the null count every time it is called.

Review Comment:
   It would require adding a field to all these classes and updating it correctly when slicing arrays (back to kUnknownCount). Adds more to think about in these central classes. I think it's premature to think about caching this value, it can be done later if we see it becoming an issue in practice.
   
   For most types this re-computation is just returning the cached `null_count_` and depending on how the other types implement the calculation, they will benefit from the `null_count_` cache in child arrays.
   
   In the case of REEs, if we scan the validity bitmap of the values arrays and get a null_count=0, we won't have to do any processing when `ComputeLogicalCount` is called again. And if REE compression is being effective, the values arrays is much smaller than the logical length, making the re-computation cheap.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34408: GH-34361: [C++] work-in-progress: Add skeleton of the new APIs for handling null checks correctly for all types

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1122928924


##########
cpp/src/arrow/array/data.h:
##########
@@ -229,15 +256,88 @@ struct ARROW_EXPORT ArrayData {
 
   void SetNullCount(int64_t v) { null_count.store(v); }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the data has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known.
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count.load() != 0 && buffers[0] != NULLPTR;
   }
 
+  /// \brief Return true if the data has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0] != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// This is not a drop-in replacement for MayHaveNulls, as historically
+  /// MayHaveNulls() has been used to check for the presence of a validity
+  /// bitmap that needs to be checked.
+  ///
+  /// Code that previously used MayHaveNulls() and then dealt with the validity
+  /// bitmap directly can be fixed to handle all types correctly without
+  /// performance degradation when handling most types by adopting
+  /// HasValidityBitmap and MayHaveLogicalNulls.
+  ///
+  /// BEFORE:
+  ///
+  ///   uint8_t* validity = array.MayHaveNulls() ? array.buffers[0].data : NULLPTR;
+  ///   for (int64_t i = 0; i < array.length; ++i) {
+  ///     if (validity && !bit_util::GetBit(validity, i)) {
+  ///       continue;  // skip a NULL
+  ///     }
+  ///     ...
+  ///   }
+  ///
+  /// AFTER:
+  ///
+  ///   bool all_valid = !array.MayHaveLogicalNulls();
+  ///   uint8_t* validity = array.HasValidityBitmap() ? array.buffers[0].data : NULLPTR;
+  ///   for (int64_t i = 0; i < array.length; ++i) {
+  ///     bool is_valid = all_valid ||
+  ///                     (validity && bit_util::GetBit(validity, i)) ||
+  ///                     array.IsValid(i);
+  ///     if (!is_valid) {
+  ///       continue;  // skip a NULL
+  ///     }
+  ///     ...
+  ///   }
+  bool MayHaveLogicalNulls() const {
+    if (buffers[0] != NULLPTR) {
+      return null_count.load() != 0;
+    }
+    const auto t = type->id();
+    if (t == Type::SPARSE_UNION || t == Type::DENSE_UNION) {
+      return internal::UnionMayHaveLogicalNulls(*this);
+    }
+    if (t == Type::RUN_END_ENCODED) {
+      return internal::RunEndEncodedMayHaveLogicalNulls(*this);
+    }
+    return false;
+  }
+
+  /// \brief Computes the logical null count for arrays of all types including
+  /// those that do not have a validity bitmap like union and run-end encoded
+  /// arrays
+  ///
+  /// If the array has a validity bitmap, this function behaves the same as
+  /// GetNullCount. For types that have no validity bitmap, this function will
+  /// recompute the null count every time it is called.

Review Comment:
   It's not possible to cache this like it is done for `null_count_` ? (can child data be changed that would invalidate this?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1128724311


##########
cpp/src/arrow/array/data.h:
##########
@@ -229,15 +256,88 @@ struct ARROW_EXPORT ArrayData {
 
   void SetNullCount(int64_t v) { null_count.store(v); }
 
-  /// \brief Return null count, or compute and set it if it's not known
+  /// \brief Return physical null count, or compute and set it if it's not known
   int64_t GetNullCount() const;
 
+  /// \brief Return true if the data has a validity bitmap and the physical null
+  /// count is known to be non-zero or not yet known.
+  ///
+  /// Note that this is not the same as MayHaveLogicalNulls, which also checks
+  /// for the presence of nulls in child data for types like unions and run-end
+  /// encoded types.
+  ///
+  /// \see HasValidityBitmap
+  /// \see MayHaveLogicalNulls
   bool MayHaveNulls() const {
     // If an ArrayData is slightly malformed it may have kUnknownNullCount set
     // but no buffer
     return null_count.load() != 0 && buffers[0] != NULLPTR;
   }
 
+  /// \brief Return true if the data has a validity bitmap
+  bool HasValidityBitmap() const { return buffers[0] != NULLPTR; }
+
+  /// \brief Return true if the validity bitmap may have 0's in it, or if the
+  /// child arrays (in the case of types without a validity bitmap) may have
+  /// nulls
+  ///
+  /// This is not a drop-in replacement for MayHaveNulls, as historically
+  /// MayHaveNulls() has been used to check for the presence of a validity
+  /// bitmap that needs to be checked.
+  ///
+  /// Code that previously used MayHaveNulls() and then dealt with the validity
+  /// bitmap directly can be fixed to handle all types correctly without
+  /// performance degradation when handling most types by adopting
+  /// HasValidityBitmap and MayHaveLogicalNulls.
+  ///
+  /// Before:
+  ///
+  ///     uint8_t* validity = array.MayHaveNulls() ? array.buffers[0].data : NULLPTR;
+  ///     for (int64_t i = 0; i < array.length; ++i) {
+  ///       if (validity && !bit_util::GetBit(validity, i)) {
+  ///         continue;  // skip a NULL
+  ///       }
+  ///       ...
+  ///     }
+  ///
+  /// After:
+  ///
+  ///     bool all_valid = !array.MayHaveLogicalNulls();
+  ///     uint8_t* validity = array.HasValidityBitmap() ? array.buffers[0].data : NULLPTR;
+  ///     for (int64_t i = 0; i < array.length; ++i) {
+  ///       bool is_valid = all_valid ||
+  ///                       (validity && bit_util::GetBit(validity, i)) ||

Review Comment:
   Yes, performance. Avoid all the branches inside `IsValid` and allows the compiler to create specializations of the loop.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1492332277

   Commit a691cb9673d0dd3471bbeaac9529ec1c65195d43 already has scheduled benchmark runs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136397402


##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -78,20 +79,58 @@ class TestArray : public ::testing::Test {
   MemoryPool* pool_;
 };
 
+template <class ScalarType>
+std::shared_ptr<ScalarType> MakeScalar(typename ScalarType::ValueType value) {
+  return std::make_shared<ScalarType>(value);
+}

Review Comment:
   I was having trouble getting it to work, but now I've realized I should have just passed `int32_t` instead of `Int32Type` to the template. :)
   
   So I'm removing it now in favor of scalar.h's version.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1164143465


##########
cpp/src/arrow/util/union_util.cc:
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/union_util.h"
+
+#include <cstdint>
+
+#include "arrow/array/data.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace arrow::union_util {
+
+int64_t LogicalSparseUnionNullCount(const ArraySpan& span) {
+  const auto* sparse_union_type =
+      internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT
+  int64_t null_count = 0;
+  for (int64_t i = 0; i < span.length; i++) {
+    const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]];
+
+    null_count += span.child_data[child_id].IsNull(i);

Review Comment:
   Did you test this with a sliced array? (don't directly see it in the tests) 
   Because in https://github.com/apache/arrow/pull/35036, I based my implementation on this function, and I am finding that I need to remove/add the offset like so:
   
   ```suggestion
       const int8_t child_id = sparse_union_type->child_ids()[types[i]];
   
       null_count += span.child_data[child_id].IsNull(span.offset + i);
   ```
   
   So `types`, which is the result of `Span.GetValues()` already takes into account the span's offset, but `IsNull()` is called on the child_data, and accessing child_data gives the original array data without offset already taken into account.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] zeroshade merged pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "zeroshade (via GitHub)" <gi...@apache.org>.
zeroshade merged PR #34408:
URL: https://github.com/apache/arrow/pull/34408


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1165681287


##########
cpp/src/arrow/util/union_util.cc:
##########
@@ -0,0 +1,58 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+#include "arrow/util/union_util.h"
+
+#include <cstdint>
+
+#include "arrow/array/data.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
+
+namespace arrow::union_util {
+
+int64_t LogicalSparseUnionNullCount(const ArraySpan& span) {
+  const auto* sparse_union_type =
+      internal::checked_cast<const SparseUnionType*>(span.type);
+  DCHECK_LE(span.child_data.size(), 128);
+
+  const int8_t* types = span.GetValues<int8_t>(1);  // NOLINT
+  int64_t null_count = 0;
+  for (int64_t i = 0; i < span.length; i++) {
+    const int8_t child_id = sparse_union_type->child_ids()[types[span.offset + i]];
+
+    null_count += span.child_data[child_id].IsNull(i);

Review Comment:
   `child_ids()` is unclear to me because it's a cache if I recall correctly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136374002


##########
cpp/src/arrow/array/data.h:
##########
@@ -33,6 +33,19 @@
 namespace arrow {
 
 class Array;
+struct ArrayData;
+
+namespace internal {
+// ----------------------------------------------------------------------
+// Null handling for types without a validity bitmap
+
+ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i);
+
+ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data);
+ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data);

Review Comment:
   They are called from `IsNull` and `IsValid` which are inline and having the switch on the type ID inlined allows the compiler to inline the highly predictable branches into the loop bodies from which these functions are called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136406197


##########
cpp/src/arrow/array/data.h:
##########
@@ -33,6 +33,19 @@
 namespace arrow {
 
 class Array;
+struct ArrayData;
+
+namespace internal {
+// ----------------------------------------------------------------------
+// Null handling for types without a validity bitmap
+
+ARROW_EXPORT bool IsNullSparseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullDenseUnion(const ArrayData& data, int64_t i);
+ARROW_EXPORT bool IsNullRunEndEncoded(const ArrayData& data, int64_t i);
+
+ARROW_EXPORT bool UnionMayHaveLogicalNulls(const ArrayData& data);
+ARROW_EXPORT bool RunEndEncodedMayHaveLogicalNulls(const ArrayData& data);

Review Comment:
   I could introduce an `IsValidFallback` function that handles the case when the validity buffer is `NULLPTR`, then all these could become `.cc` only.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #34408:
URL: https://github.com/apache/arrow/pull/34408#discussion_r1136402744


##########
cpp/src/arrow/array/array_test.cc:
##########
@@ -377,20 +416,23 @@ TEST_F(TestArray, TestMakeArrayOfNull) {
       ASSERT_EQ(array->length(), length);
       if (is_union(type->id())) {
         ASSERT_EQ(array->null_count(), 0);
+        ASSERT_EQ(array->ComputeLogicalNullCount(), length);

Review Comment:
   Testing unions here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1468961852

   Benchmark runs are scheduled for baseline = fc950192ae7b801a76adc3b3de410dc22da83fd4 and contender = ceacb2ef16a455cf73d6ad31d82928f185646597. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/83e6648f8bae4aed82c17bd6070463d6...4a6e5f584273443c830ec830bbcacc8e/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/70a51126d35740449494eb37418701a9...75036fcbcf8042b2b5aa5f49692988a2/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/74c0a39e0c1b459fbc1ac2ed65f32e56...1e8b0b71552e468985304369803b2b18/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/8fa420ed32aa46ba982d44fa473dbca8...a74e754093424fb1b25a60c506d9a747/)
   Buildkite builds:
   [Scheduled] [`ceacb2ef` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2525)
   [Scheduled] [`ceacb2ef` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2554)
   [Scheduled] [`ceacb2ef` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2522)
   [Scheduled] [`ceacb2ef` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2545)
   [Finished] [`fc950192` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2484)
   [Failed] [`fc950192` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2514)
   [Finished] [`fc950192` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2482)
   [Finished] [`fc950192` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2505)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] amol- commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "amol- (via GitHub)" <gi...@apache.org>.
amol- commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1490669312

   Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1492144225

   Benchmark runs are scheduled for baseline = fa4b17057570530bba913b92daeba186acb13cba and contender = a691cb9673d0dd3471bbeaac9529ec1c65195d43. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/c07dcadca46f47f2a50206733a0a390f...9141f3f6702b4897b7da6b2cce62fca7/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/268b3f58342e4590842f9703e836a8ba...fa416da6efcf443ebb15fc4f24fc09c1/)
   [Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/46c85f7880974523887f6898cde5c256...89dc7d351ce34aa3a219a9ea36c41cfa/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/bfd26f95a253412ba13c14029b6f3386...1bcc30f94e014b7bb12c8414377ad283/)
   Buildkite builds:
   [Scheduled] [`a691cb96` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2603)
   [Scheduled] [`a691cb96` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2633)
   [Scheduled] [`a691cb96` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2600)
   [Scheduled] [`a691cb96` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2624)
   [Finished] [`fa4b1705` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2598)
   [Failed] [`fa4b1705` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2628)
   [Finished] [`fa4b1705` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2596)
   [Finished] [`fa4b1705` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2619)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1494394329

   Benchmark runs are scheduled for baseline = 34addbb5e169af368fa3df3e0b668c6a9403fd0f and contender = de5e3ded7ec728d99a39042b868b9a27f40d97e6. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/4c4d46501b8042009d38d7644d4555d2...3c27a73fd49d45dba1714ae5e3344da5/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/114f41ff192a4d1cb3aa57001ce345a8...7ab47a3e85774e00bcd9e2995269c6e7/)
   [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/97e151d65fd14027880e1ac966302a78...2ddb56f384614b2496af672de3d32464/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/9a678753f51e4433a2f75b1ef55ae305...e3e16293fc6446158baf7e6eae444b12/)
   Buildkite builds:
   [Scheduled] [`de5e3ded` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2647)
   [Scheduled] [`de5e3ded` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2638)
   [Finished] [`34addbb5` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2615)
   [Failed] [`34addbb5` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2646)
   [Finished] [`34addbb5` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2613)
   [Failed] [`34addbb5` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2637)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1494557232

   Benchmarking investigations currently blocked by https://github.com/apache/arrow/issues/34861
   
   My current suspection is that there isn't really a regression, but to be sure, we need a successful run of `test-mac-arm` as for some reason that's the only benchmark that had regressions.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1492513864

   Benchmark runs are scheduled for baseline = e45ac8db4aba02780ac17974f8df8a2efa812067 and contender = 52946ed80d9963e1a269cf28083ad6608ae3e21b. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/c3d85c3328c34b1fb60d41b4c19fc4c9...3e4232bb06294105bb19cdcb901cf129/)
   [Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/d40221de9db54e2284f6c4fe945f4235...251badab34924726a78567cd303d25f0/)
   [Skipped :warning: Only ['JavaScript', 'Python', 'R'] langs are supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/35ae0750c138455e9655130a1da2def1...056e861b17b244cc8ffb30ebbc4223d2/)
   [Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/e4753d8a97f84822b94836cf54966d61...29164461db7f483599c89bc81142607a/)
   Buildkite builds:
   [Scheduled] [`52946ed8` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2637)
   [Scheduled] [`52946ed8` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2627)
   [Finished] [`e45ac8db` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2606)
   [Scheduled] [`e45ac8db` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2636)
   [Scheduled] [`e45ac8db` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2601)
   [Scheduled] [`e45ac8db` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2626)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #34408: GH-34361: [C++] Fix the handling of logical nulls for types without bitmaps like Unions and Run-End Encoded

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #34408:
URL: https://github.com/apache/arrow/pull/34408#issuecomment-1499013098

   Should we consider merging this? I think it's important to get this into 12.0 since it is a correctness issue (and for a new type), and if there was actually a performance regression, that's something that could still be fixed later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org