You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "bkietz (via GitHub)" <gi...@apache.org> on 2023/06/09 21:03:24 UTC

[GitHub] [arrow] bkietz opened a new pull request, #36018: GH-35581: [C++] Store offsets in scalars

bkietz opened a new pull request, #36018:
URL: https://github.com/apache/arrow/pull/36018

   ArraySpan contained scratch space inside itself for storing offsets when viewing a scalar as a length=1 array. This could lead to dangling pointers in copies of the ArraySpan since copies' pointers will always refer to the original's scratch space, which may have been destroyed.
   
   This patch moves that scratch space into Scalar itself, which is more conformant to the contract of a view since the scratch space will be valid as long as the Scalar is alive, regardless of how ArraySpans are copied.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259853471


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,22 +229,20 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
-                         int buffer_index = 1) {
-  buffer[0] = 0;
-  buffer[1] = static_cast<offset_type>(value_size);
-  span->buffers[buffer_index].data = reinterpret_cast<uint8_t*>(buffer);
-  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
+BufferSpan OffsetsForScalar(uint8_t* scratch_space, offset_type value_size) {
+  auto* offsets = reinterpret_cast<offset_type*>(scratch_space);
+  offsets[0] = 0;
+  offsets[1] = static_cast<offset_type>(value_size);
+  return {scratch_space, sizeof(offset_type) * 2};
 }
 
 int GetNumBuffers(const DataType& type) {
   switch (type.id()) {
     case Type::NA:
     case Type::STRUCT:
     case Type::FIXED_SIZE_LIST:
-      return 1;
     case Type::RUN_END_ENCODED:
-      return 0;
+      return 1;

Review Comment:
   Many places in the codebase assume that `buffers.size() >= 1`, even if `buffers[0] == nullptr`. When I added test cases which exercised REE scalars those places segfaulted. I thought that requiring `buffers.size() >= 1` for REE (as we do for union) was the most expeditious fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231207633


##########
cpp/src/arrow/array/data.cc:
##########
@@ -298,6 +298,7 @@ void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
 void ArraySpan::FillFromScalar(const Scalar& value) {
   static uint8_t kTrueBit = 0x01;
   static uint8_t kFalseBit = 0x00;
+  auto* scratch_space = const_cast<Scalar&>(value).scratch_space_.data();

Review Comment:
   This can be avoided by using a plan array instead of `std::array`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231375261


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   I'll use the c array, then
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1629537074

   Revision: 4ea421ab4350274493c39ac11c963578efc8f70e
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-8dcf3ee3c6](https://github.com/ursacomputing/crossbow/branches/all?query=actions-8dcf3ee3c6)
   
   |Task|Status|
   |----|------|
   |test-alpine-linux-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-alpine-linux-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5511876324/jobs/10047997340)|
   |test-build-cpp-fuzz|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/5511877081/jobs/10047999131)|
   |test-conda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5511875163/jobs/10047994458)|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-8dcf3ee3c6-azure-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/runs/14922063520)|
   |test-cuda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-cuda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5511874683/jobs/10047993447)|
   |test-debian-11-cpp-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-debian-11-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/5511876795/jobs/10047998540)|
   |test-debian-11-cpp-i386|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-debian-11-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/5511874222/jobs/10047992354)|
   |test-fedora-35-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-fedora-35-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5511876579/jobs/10047998032)|
   |test-ubuntu-20.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-ubuntu-20.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5511874860/jobs/10047993887)|
   |test-ubuntu-20.04-cpp-bundled|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-ubuntu-20.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/5511877278/jobs/10047999832)|
   |test-ubuntu-20.04-cpp-minimal-with-formats|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-ubuntu-20.04-cpp-minimal-with-formats)](https://github.com/ursacomputing/crossbow/actions/runs/5511875643/jobs/10047995634)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions/runs/5511875880/jobs/10047996209)|
   |test-ubuntu-22.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5511874429/jobs/10047992720)|
   |test-ubuntu-22.04-cpp-20|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-8dcf3ee3c6-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/5511876093/jobs/10047996734)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #36018:
URL: https://github.com/apache/arrow/pull/36018


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231097520


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,11 +228,12 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
+void SetOffsetsForScalar(ArraySpan* span, uint8_t* buffer, int64_t value_size,

Review Comment:
   Yes, but we always initialize value_size from `ArrayData::length` or `Buffer::length()`, which are both `int64_t`- so we'd still need the `static_cast<offset_type>` but it'd be at each call site to SetOffsetsForScalar. We would be able to call it without the explicit offset_type template argument, so that'd be a bit nicer
   
   If you think that's preferable, I'll do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1593291126

   There are CI failures though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1593289461

   This LGTM on the principle.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231378365


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_;

Review Comment:
   If you'd prefer, the intermediate class wouldn't be hard. I'd think leaving the scratch space here is fine until we have a reason to move it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1254384855


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};

Review Comment:
   Why do you want to explicit create `UnionScratchSpace`? It should be a trivial type AFAICT.
   
   I was only mentioning padding bits in the event that your code was attempting to initialize them :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1621781227

   Revision: bfeef64ab8b2386a12f2007145c7ce18fd1746e6
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-814207b368](https://github.com/ursacomputing/crossbow/branches/all?query=actions-814207b368)
   
   |Task|Status|
   |----|------|
   |test-alpine-linux-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-alpine-linux-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5464983760/jobs/9947771055)|
   |test-build-cpp-fuzz|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/5464984645/jobs/9947772960)|
   |test-conda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5464983111/jobs/9947769670)|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-814207b368-azure-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/runs/14795444025)|
   |test-cuda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-cuda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5464980811/jobs/9947764371)|
   |test-debian-11-cpp-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-debian-11-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/5464981573/jobs/9947766034)|
   |test-debian-11-cpp-i386|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-debian-11-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/5464981834/jobs/9947766730)|
   |test-fedora-35-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-fedora-35-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5464983328/jobs/9947770145)|
   |test-ubuntu-20.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-ubuntu-20.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5464983971/jobs/9947771627)|
   |test-ubuntu-20.04-cpp-bundled|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-ubuntu-20.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/5464981043/jobs/9947764775)|
   |test-ubuntu-20.04-cpp-minimal-with-formats|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-ubuntu-20.04-cpp-minimal-with-formats)](https://github.com/ursacomputing/crossbow/actions/runs/5464982793/jobs/9947769099)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions/runs/5464984303/jobs/9947772413)|
   |test-ubuntu-22.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5464985014/jobs/9947773881)|
   |test-ubuntu-22.04-cpp-20|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-814207b368-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/5464981269/jobs/9947765396)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1253129938


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};

Review Comment:
   Why not `reinterpret_cast` the scratch space? Is it because of aliasing?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1621767703

   > This PR should probably wait until https://github.com/apache/arrow/pull/35036 is merged
   
   It's fine for this to be merged first, will update my PR later afterwards then (since I am using the scratch space in the other PR, fixing that first here also make sense)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231217444


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   @bkietz we don't fail CI build in `clang-tidy` failures though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231506309


##########
cpp/src/arrow/array/data.cc:
##########
@@ -298,6 +298,7 @@ void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
 void ArraySpan::FillFromScalar(const Scalar& value) {
   static uint8_t kTrueBit = 0x01;
   static uint8_t kFalseBit = 0x00;
+  auto* scratch_space = const_cast<Scalar&>(value).scratch_space_.data();

Review Comment:
   Even with `scratch_space_` declared as `mutable`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231206065


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,11 +228,12 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
+void SetOffsetsForScalar(ArraySpan* span, uint8_t* buffer, int64_t value_size,

Review Comment:
   I think the cast at the callsite is the right way to go because by looking at just `SetOffsetsForScalar` I can't say it never truncates values into a 32-bit `offset_type`.
   
   @pitrou but this is an offset buffer, not a single array length/offset.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231504922


##########
cpp/src/arrow/array/data.cc:
##########
@@ -381,24 +381,29 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      uint8_t type_code;
+      alignas(int32_t) uint8_t offsets[sizeof(int32_t) * 2];

Review Comment:
   If, out of caution, we want these pointers to have 64-bit alignment like provided by `malloc`, the `type_code` should align to `int64_t`. As it is now, you're using the second and third 4-byte block of the scratch space, when you could use the third and fourth.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259902839


##########
cpp/src/arrow/array/data.cc:
##########
@@ -130,7 +130,8 @@ std::shared_ptr<ArrayData> ArrayData::Make(std::shared_ptr<DataType> type, int64
 }
 
 std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
-  ARROW_CHECK_LE(off, length) << "Slice offset greater than array length";
+  ARROW_CHECK_LE(off, length) << "Slice offset (" << off
+                              << ") greater than array length (" << length << ")";

Review Comment:
   I wonder if this has to always perform int to string conversion. Even when the check passes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1623788282

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1623869602

   It would be a good idea to add systematic tests for `FillFromScalar`. Some combination of random array generation and `Array::GetScalar` should help avoid per-type data wrangling.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1253129019


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};
+
     // First buffer is kept null since unions have no validity vector
     this->buffers[0] = {};
 
-    this->buffers[1].data = reinterpret_cast<uint8_t*>(this->scratch_space);
+    this->buffers[1].data = &union_scratch_space->type_code;
     this->buffers[1].size = 1;
-    int8_t* type_codes = reinterpret_cast<int8_t*>(this->scratch_space);
-    type_codes[0] = checked_cast<const UnionScalar&>(value).type_code;
+    new (&union_scratch_space->type_code)
+        int8_t{checked_cast<const UnionScalar&>(value).type_code};

Review Comment:
   I don't understand why this isn't simply:
   ```suggestion
       union_scratch_space->type_code = checked_cast<const UnionScalar&>(value).type_code;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1585127822

   @felipecrv 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1224861844


##########
cpp/CMakePresets.json:
##########
@@ -11,6 +11,7 @@
       "hidden": true,
       "generator": "Ninja",
       "cacheVariables": {
+        "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",

Review Comment:
   I don't think we want to merge this and generate the huge file in CI builds, right?



##########
cpp/src/arrow/array/data.cc:
##########
@@ -265,13 +266,12 @@ int GetNumBuffers(const DataType& type) {
 namespace internal {
 
 void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
-  memset(span->scratch_space, 0x00, sizeof(span->scratch_space));
-
   span->type = type;
   span->length = 0;
   int num_buffers = GetNumBuffers(*type);
   for (int i = 0; i < num_buffers; ++i) {
-    span->buffers[i].data = reinterpret_cast<uint8_t*>(span->scratch_space);
+    static int64_t zero{0};
+    span->buffers[i].data = reinterpret_cast<uint8_t*>(&zero);

Review Comment:
   You're casting a pointer to static data to a non-const pointer. That could be bad if you needed this buffer to always be zero, but you don't really need that, so maybe rename it from `zero` to `scratch_buffer`?



##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   Why not `alignas(uint64_t) uint8_t scratch_space_[16];`?
   
   I would leave out the 0 initialization as well. Leave it undefined for scalars that don't need the scratch buffers.



##########
cpp/src/arrow/array/data.cc:
##########
@@ -265,13 +266,12 @@ int GetNumBuffers(const DataType& type) {
 namespace internal {
 
 void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
-  memset(span->scratch_space, 0x00, sizeof(span->scratch_space));
-
   span->type = type;
   span->length = 0;
   int num_buffers = GetNumBuffers(*type);
   for (int i = 0; i < num_buffers; ++i) {
-    span->buffers[i].data = reinterpret_cast<uint8_t*>(span->scratch_space);
+    static int64_t zero{0};
+    span->buffers[i].data = reinterpret_cast<uint8_t*>(&zero);

Review Comment:
   This would be new, but it's probably better to set buffer 0 (validity bitmap) to `nullptr` than to set it to a empty buffer.



##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,11 +228,12 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
+void SetOffsetsForScalar(ArraySpan* span, uint8_t* buffer, int64_t value_size,

Review Comment:
   `value_size` could be declared as `offset_type` instead of `int64_t`.



##########
cpp/src/arrow/array/data.cc:
##########
@@ -354,6 +353,7 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
     const auto& scalar = checked_cast<const BaseListScalar&>(value);
 
     int64_t value_length = 0;
+    (void)value_length;

Review Comment:
   Leftover?



##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   This would also remove the need to call `.data()` above which was weird.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1261156907


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,22 +229,20 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
-                         int buffer_index = 1) {
-  buffer[0] = 0;
-  buffer[1] = static_cast<offset_type>(value_size);
-  span->buffers[buffer_index].data = reinterpret_cast<uint8_t*>(buffer);
-  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
+BufferSpan OffsetsForScalar(uint8_t* scratch_space, offset_type value_size) {
+  auto* offsets = reinterpret_cast<offset_type*>(scratch_space);
+  offsets[0] = 0;
+  offsets[1] = static_cast<offset_type>(value_size);
+  return {scratch_space, sizeof(offset_type) * 2};
 }
 
 int GetNumBuffers(const DataType& type) {
   switch (type.id()) {
     case Type::NA:
     case Type::STRUCT:
     case Type::FIXED_SIZE_LIST:
-      return 1;
     case Type::RUN_END_ENCODED:
-      return 0;
+      return 1;

Review Comment:
   FWIW, this was the only place in the codebase which didn't already give REE at least one buffer; the constructors, the builder, ... already did so



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1623792960

   Revision: 307aae9a688720c1a5e46b9784917668ab3792c9
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-02b630ffcd](https://github.com/ursacomputing/crossbow/branches/all?query=actions-02b630ffcd)
   
   |Task|Status|
   |----|------|
   |test-alpine-linux-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-alpine-linux-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5476739772/jobs/9974717450)|
   |test-build-cpp-fuzz|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/5476741775/jobs/9974722503)|
   |test-conda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5476741469/jobs/9974721618)|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-02b630ffcd-azure-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/runs/14830082458)|
   |test-cuda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-cuda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5476739174/jobs/9974715648)|
   |test-debian-11-cpp-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-debian-11-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/5476738498/jobs/9974713908)|
   |test-debian-11-cpp-i386|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-debian-11-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/5476739983/jobs/9974717896)|
   |test-fedora-35-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-fedora-35-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5476740346/jobs/9974718886)|
   |test-ubuntu-20.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-ubuntu-20.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5476738164/jobs/9974713084)|
   |test-ubuntu-20.04-cpp-bundled|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-ubuntu-20.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/5476737432/jobs/9974711602)|
   |test-ubuntu-20.04-cpp-minimal-with-formats|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-ubuntu-20.04-cpp-minimal-with-formats)](https://github.com/ursacomputing/crossbow/actions/runs/5476741225/jobs/9974721068)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions/runs/5476737784/jobs/9974712207)|
   |test-ubuntu-22.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5476740913/jobs/9974720359)|
   |test-ubuntu-22.04-cpp-20|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-02b630ffcd-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/5476739483/jobs/9974716604)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1254451565


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};

Review Comment:
   It is a trivial type but those still have lifetimes, and accessing objects outside their lifetime is UB. That's probably excessively pedantic though; the c++20 addition of [implicit lifetime](https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p0593r6.html#when-to-create-objects) was mostly a formal acceptance of cases where we just cast the storage pointer to the object pointer and start accessing.
   
   TL;DR: if you prefer 
   ```suggestion
       auto* union_scratch_space =
           reinterpret_cast<UnionScratchSpace*>(checked_cast<const UnionScalar&>(value).scratch_space_);
   ```
   that should be fine



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231182209


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_;

Review Comment:
   Do we want to put this in the base `Scalar` or only in an intermediate subclass `ScalarWithScratchSpace` perhaps?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231097263


##########
cpp/src/arrow/array/data.cc:
##########
@@ -265,13 +266,12 @@ int GetNumBuffers(const DataType& type) {
 namespace internal {
 
 void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
-  memset(span->scratch_space, 0x00, sizeof(span->scratch_space));
-
   span->type = type;
   span->length = 0;
   int num_buffers = GetNumBuffers(*type);
   for (int i = 0; i < num_buffers; ++i) {
-    span->buffers[i].data = reinterpret_cast<uint8_t*>(span->scratch_space);
+    static int64_t zero{0};
+    span->buffers[i].data = reinterpret_cast<uint8_t*>(&zero);

Review Comment:
   This buffer is not allowed to be null, since it holds offsets



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1232191690


##########
cpp/src/arrow/array/data.cc:
##########
@@ -381,24 +381,29 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      uint8_t type_code;
+      alignas(int32_t) uint8_t offsets[sizeof(int32_t) * 2];

Review Comment:
   That also does the trick :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1623833265

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259908969


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,22 +229,20 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
-                         int buffer_index = 1) {
-  buffer[0] = 0;
-  buffer[1] = static_cast<offset_type>(value_size);
-  span->buffers[buffer_index].data = reinterpret_cast<uint8_t*>(buffer);
-  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
+BufferSpan OffsetsForScalar(uint8_t* scratch_space, offset_type value_size) {
+  auto* offsets = reinterpret_cast<offset_type*>(scratch_space);
+  offsets[0] = 0;
+  offsets[1] = static_cast<offset_type>(value_size);
+  return {scratch_space, sizeof(offset_type) * 2};
 }
 
 int GetNumBuffers(const DataType& type) {
   switch (type.id()) {
     case Type::NA:
     case Type::STRUCT:
     case Type::FIXED_SIZE_LIST:
-      return 1;
     case Type::RUN_END_ENCODED:
-      return 0;
+      return 1;

Review Comment:
   `RUN_END_ENCODED` doesn't need any buffer but `NA` also does not and we return `1` here. :thinking: 
   
   It's a "once you start lying you can't stop lying" kind of problem for `GetNumBuffers`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259913048


##########
cpp/src/arrow/compute/kernels/test_util.cc:
##########
@@ -263,6 +263,23 @@ void CheckScalarUnary(std::string func_name, std::shared_ptr<DataType> in_ty,
 
 void CheckVectorUnary(std::string func_name, Datum input, Datum expected,
                       const FunctionOptions* options) {
+  // WTF debug
+  std::cout << "input: ";
+  switch (input.kind()) {
+    case Datum::ARRAY:
+      std::cout << input.make_array()->ToString();
+      break;
+    case Datum::CHUNKED_ARRAY:
+      std::cout << input.chunked_array()->ToString();
+      break;
+    case Datum::SCALAR:
+      std::cout << input.scalar()->ToString();
+      break;
+    default:
+      std::cout << input.ToString();
+      break;
+  }
+  std::cout << std::endl;

Review Comment:
   It seems to have been removed and I was confused by the GitHub notification showing me this code as green.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1646081146

   After merging your PR, Conbench analyzed the 6 benchmarking runs that have been run so far on merge-commit 96ac514383bc33b02d19d739a6428ad9b3f453b7.
   
   There were no benchmark performance regressions. 🎉
   
   The [full Conbench report](https://github.com/apache/arrow/runs/15244079987) has more details. It also includes information about possible false positives for unstable benchmarks that are known to sometimes produce them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1254455956


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};

Review Comment:
   Well, we already do this with things such as `struct MonthDayNanos`, don't we? So, yes, I think we should keep things simple here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231206700


##########
cpp/src/arrow/array/data.cc:
##########
@@ -265,13 +266,12 @@ int GetNumBuffers(const DataType& type) {
 namespace internal {
 
 void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
-  memset(span->scratch_space, 0x00, sizeof(span->scratch_space));
-
   span->type = type;
   span->length = 0;
   int num_buffers = GetNumBuffers(*type);
   for (int i = 0; i < num_buffers; ++i) {
-    span->buffers[i].data = reinterpret_cast<uint8_t*>(span->scratch_space);
+    static int64_t zero{0};
+    span->buffers[i].data = reinterpret_cast<uint8_t*>(&zero);

Review Comment:
   Got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231097263


##########
cpp/src/arrow/array/data.cc:
##########
@@ -265,13 +266,12 @@ int GetNumBuffers(const DataType& type) {
 namespace internal {
 
 void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
-  memset(span->scratch_space, 0x00, sizeof(span->scratch_space));
-
   span->type = type;
   span->length = 0;
   int num_buffers = GetNumBuffers(*type);
   for (int i = 0; i < num_buffers; ++i) {
-    span->buffers[i].data = reinterpret_cast<uint8_t*>(span->scratch_space);
+    static int64_t zero{0};
+    span->buffers[i].data = reinterpret_cast<uint8_t*>(&zero);

Review Comment:
   I'll try that



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231097520


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,11 +228,12 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
+void SetOffsetsForScalar(ArraySpan* span, uint8_t* buffer, int64_t value_size,

Review Comment:
   Okay



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231177455


##########
cpp/src/arrow/array/data.cc:
##########
@@ -298,6 +298,7 @@ void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
 void ArraySpan::FillFromScalar(const Scalar& value) {
   static uint8_t kTrueBit = 0x01;
   static uint8_t kFalseBit = 0x00;
+  auto* scratch_space = const_cast<Scalar&>(value).scratch_space_.data();

Review Comment:
   Ouch :-(



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259906174


##########
cpp/src/arrow/array/data.cc:
##########
@@ -130,7 +130,8 @@ std::shared_ptr<ArrayData> ArrayData::Make(std::shared_ptr<DataType> type, int64
 }
 
 std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
-  ARROW_CHECK_LE(off, length) << "Slice offset greater than array length";
+  ARROW_CHECK_LE(off, length) << "Slice offset (" << off
+                              << ") greater than array length (" << length << ")";

Review Comment:
   It shouldn't. The `operator<<` is only invoked when the assertion fails. Though operator precedence makes me slightly uneasy:
   ```c++
   #define ARROW_CHECK_OR_LOG(condition, level) \
     ARROW_PREDICT_TRUE(condition)              \
     ? ARROW_IGNORE_EXPR(0)                     \
     : ::arrow::util::Voidify() & ARROW_LOG(level) << " Check failed: " #condition " "
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231223683


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   https://github.com/apache/arrow/pull/36102



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231117144


##########
cpp/src/arrow/array/data.cc:
##########
@@ -265,13 +266,12 @@ int GetNumBuffers(const DataType& type) {
 namespace internal {
 
 void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
-  memset(span->scratch_space, 0x00, sizeof(span->scratch_space));
-
   span->type = type;
   span->length = 0;
   int num_buffers = GetNumBuffers(*type);
   for (int i = 0; i < num_buffers; ++i) {
-    span->buffers[i].data = reinterpret_cast<uint8_t*>(span->scratch_space);
+    static int64_t zero{0};
+    span->buffers[i].data = reinterpret_cast<uint8_t*>(&zero);

Review Comment:
   This buffer should never be read, let alone mutated. I tend to agree with you that this isn't ideal, but [BufferSpan is just always mutable](
   https://github.com/bkietz/arrow/blob/f13a76ac35dc3586e9c2bd37770ccf22d3d15eb4/cpp/src/arrow/array/data.h#L356-L359) and we have to trust that nobody will misuse it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231207633


##########
cpp/src/arrow/array/data.cc:
##########
@@ -298,6 +298,7 @@ void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
 void ArraySpan::FillFromScalar(const Scalar& value) {
   static uint8_t kTrueBit = 0x01;
   static uint8_t kFalseBit = 0x00;
+  auto* scratch_space = const_cast<Scalar&>(value).scratch_space_.data();

Review Comment:
   This can be avoided by using a plain array instead of `std::array`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259856252


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,22 +229,20 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
-                         int buffer_index = 1) {
-  buffer[0] = 0;
-  buffer[1] = static_cast<offset_type>(value_size);
-  span->buffers[buffer_index].data = reinterpret_cast<uint8_t*>(buffer);
-  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
+BufferSpan OffsetsForScalar(uint8_t* scratch_space, offset_type value_size) {
+  auto* offsets = reinterpret_cast<offset_type*>(scratch_space);
+  offsets[0] = 0;
+  offsets[1] = static_cast<offset_type>(value_size);
+  return {scratch_space, sizeof(offset_type) * 2};
 }
 
 int GetNumBuffers(const DataType& type) {
   switch (type.id()) {
     case Type::NA:
     case Type::STRUCT:
     case Type::FIXED_SIZE_LIST:
-      return 1;
     case Type::RUN_END_ENCODED:
-      return 0;
+      return 1;

Review Comment:
   @felipecrv What do you think here? Should we require a one-element `buffers` vector for REE?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259908221


##########
cpp/src/arrow/array/data.cc:
##########
@@ -130,7 +130,8 @@ std::shared_ptr<ArrayData> ArrayData::Make(std::shared_ptr<DataType> type, int64
 }
 
 std::shared_ptr<ArrayData> ArrayData::Slice(int64_t off, int64_t len) const {
-  ARROW_CHECK_LE(off, length) << "Slice offset greater than array length";
+  ARROW_CHECK_LE(off, length) << "Slice offset (" << off
+                              << ") greater than array length (" << length << ")";

Review Comment:
   We could actually add a test for that, IMHO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259255441


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,22 +229,20 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
-                         int buffer_index = 1) {
-  buffer[0] = 0;
-  buffer[1] = static_cast<offset_type>(value_size);
-  span->buffers[buffer_index].data = reinterpret_cast<uint8_t*>(buffer);
-  span->buffers[buffer_index].size = 2 * sizeof(offset_type);
+BufferSpan OffsetsForScalar(uint8_t* scratch_space, offset_type value_size) {
+  auto* offsets = reinterpret_cast<offset_type*>(scratch_space);
+  offsets[0] = 0;
+  offsets[1] = static_cast<offset_type>(value_size);
+  return {scratch_space, sizeof(offset_type) * 2};
 }
 
 int GetNumBuffers(const DataType& type) {
   switch (type.id()) {
     case Type::NA:
     case Type::STRUCT:
     case Type::FIXED_SIZE_LIST:
-      return 1;
     case Type::RUN_END_ENCODED:
-      return 0;
+      return 1;

Review Comment:
   I merged a bit quickly, why does `RUN_END_ENCODED` need one buffer here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259902764


##########
cpp/src/arrow/compute/kernels/test_util.cc:
##########
@@ -263,6 +263,23 @@ void CheckScalarUnary(std::string func_name, std::shared_ptr<DataType> in_ty,
 
 void CheckVectorUnary(std::string func_name, Datum input, Datum expected,
                       const FunctionOptions* options) {
+  // WTF debug
+  std::cout << "input: ";
+  switch (input.kind()) {
+    case Datum::ARRAY:
+      std::cout << input.make_array()->ToString();
+      break;
+    case Datum::CHUNKED_ARRAY:
+      std::cout << input.chunked_array()->ToString();
+      break;
+    case Datum::SCALAR:
+      std::cout << input.scalar()->ToString();
+      break;
+    default:
+      std::cout << input.ToString();
+      break;
+  }
+  std::cout << std::endl;

Review Comment:
   Woops. Definitely not. Sorry, I should have been more careful here. @bkietz Do you want to prepare a fixup PR?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1254370004


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};

Review Comment:
   Do we need to zero-initialize padding bits? They're outside the range viewed by the BufferSpans- not even under a null bit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231096874


##########
cpp/src/arrow/array/data.cc:
##########
@@ -354,6 +353,7 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
     const auto& scalar = checked_cast<const BaseListScalar&>(value);
 
     int64_t value_length = 0;
+    (void)value_length;

Review Comment:
   Yep



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231534941


##########
cpp/src/arrow/array/data.cc:
##########
@@ -381,24 +381,29 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      uint8_t type_code;
+      alignas(int32_t) uint8_t offsets[sizeof(int32_t) * 2];

Review Comment:
   ```suggestion
         uint8_t type_code;
         alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1254369336


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};

Review Comment:
   I was using `new` to explicitly create the UnionScratchSpace. The reinterpret_cast would probably be fine, but technically doesn't implicitly start the lifetime of the UnionScratchSpace object until c++20 where it's [part of the object model](https://eel.is/c++draft/basic.memobj#intro.object-10) that implicit lifetime types may be used this way. Would you prefer:
   ```suggestion
       auto* union_scratch_space = [&] {
         UnionScratchSpace union_scratch_space{...initializers...};
         std::memcpy(checked_cast<const UnionScalar&>(value).scratch_space_,
                     &union_scratch_space,
                     sizeof(UnionScratchSpace));
         return reinterpret_cast<UnionScratchSpace*>(checked_cast<const UnionScalar&>(value).scratch_space_);
       }();
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1254352313


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};
+
     // First buffer is kept null since unions have no validity vector
     this->buffers[0] = {};
 
-    this->buffers[1].data = reinterpret_cast<uint8_t*>(this->scratch_space);
+    this->buffers[1].data = &union_scratch_space->type_code;
     this->buffers[1].size = 1;
-    int8_t* type_codes = reinterpret_cast<int8_t*>(this->scratch_space);
-    type_codes[0] = checked_cast<const UnionScalar&>(value).type_code;
+    new (&union_scratch_space->type_code)
+        int8_t{checked_cast<const UnionScalar&>(value).type_code};

Review Comment:
   I'll do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259901071


##########
cpp/src/arrow/compute/kernels/test_util.cc:
##########
@@ -263,6 +263,23 @@ void CheckScalarUnary(std::string func_name, std::shared_ptr<DataType> in_ty,
 
 void CheckVectorUnary(std::string func_name, Datum input, Datum expected,
                       const FunctionOptions* options) {
+  // WTF debug
+  std::cout << "input: ";
+  switch (input.kind()) {
+    case Datum::ARRAY:
+      std::cout << input.make_array()->ToString();
+      break;
+    case Datum::CHUNKED_ARRAY:
+      std::cout << input.chunked_array()->ToString();
+      break;
+    case Datum::SCALAR:
+      std::cout << input.scalar()->ToString();
+      break;
+    default:
+      std::cout << input.ToString();
+      break;
+  }
+  std::cout << std::endl;

Review Comment:
   Should this have been merged?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1629483971

   @pitrou PTAL


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1232306819


##########
cpp/CMakePresets.json:
##########
@@ -11,6 +11,7 @@
       "hidden": true,
       "generator": "Ninja",
       "cacheVariables": {
+        "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",

Review Comment:
   https://github.com/apache/arrow/issues/36124



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231100522


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   Alright, I'll remove the zero init.
   
   WRT std::array: I'm following the [linter](https://releases.llvm.org/13.0.0/tools/clang/tools/extra/docs/clang-tidy/checks/modernize-avoid-c-arrays.html) which recommends avoiding C arrays in favor of `std::array`. If we would prefer to disable that lint check (and correspondingly prefer usage of C arrays), let's do that in a follow up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231172318


##########
cpp/CMakePresets.json:
##########
@@ -11,6 +11,7 @@
       "hidden": true,
       "generator": "Ninja",
       "cacheVariables": {
+        "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",

Review Comment:
   I agree that the generated file is not huge, but I also think these presets should enable just what's necessary for development. They serve as documentation and starting points for users. It's easy to enable additional options on top of an existing preset, AFAIK.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231216861


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   @pitrou I think we should disable that `clang-tidy` rule. It's not a good fit for the kind of things we do in Arrow.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1253130934


##########
cpp/src/arrow/array/data.cc:
##########
@@ -384,26 +382,31 @@ void ArraySpan::FillFromScalar(const Scalar& value) {
       this->child_data[i].FillFromScalar(*scalar.value[i]);
     }
   } else if (is_union(type_id)) {
+    // Dense union needs scratch space to store both offsets and a type code
+    struct UnionScratchSpace {
+      alignas(int64_t) uint8_t type_code;
+      alignas(int64_t) uint8_t offsets[sizeof(int32_t) * 2];
+    };
+    static_assert(sizeof(UnionScratchSpace) <= sizeof(UnionScalar::scratch_space_));
+    auto* union_scratch_space =
+        new (checked_cast<const UnionScalar&>(value).scratch_space_) UnionScratchSpace{};

Review Comment:
   Also this might not zero-initialize padding bytes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1621776412

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231200848


##########
cpp/CMakePresets.json:
##########
@@ -11,6 +11,7 @@
       "hidden": true,
       "generator": "Ninja",
       "cacheVariables": {
+        "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",

Review Comment:
   But which tools need this in CI? I use this file on my machine for LSP on my editor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231381358


##########
cpp/CMakePresets.json:
##########
@@ -11,6 +11,7 @@
       "hidden": true,
       "generator": "Ninja",
       "cacheVariables": {
+        "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",

Review Comment:
   The presets aren't just for CI builds, as @pitrou says it's a starting point for users. I would say this is an easy and useful default to provide for users- myself included. I'll move this to a different issue, though, since it's not directly pertinent to offset storage



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231376143


##########
cpp/src/arrow/array/data.cc:
##########
@@ -298,6 +298,7 @@ void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
 void ArraySpan::FillFromScalar(const Scalar& value) {
   static uint8_t kTrueBit = 0x01;
   static uint8_t kFalseBit = 0x00;
+  auto* scratch_space = const_cast<Scalar&>(value).scratch_space_.data();

Review Comment:
   To be clear, we'll still need the const_cast
   ```suggestion
     auto* scratch_space = const_cast<Scalar&>(value).scratch_space_;
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1232215797


##########
cpp/src/arrow/array/data.cc:
##########
@@ -298,6 +298,7 @@ void FillZeroLengthArray(const DataType* type, ArraySpan* span) {
 void ArraySpan::FillFromScalar(const Scalar& value) {
   static uint8_t kTrueBit = 0x01;
   static uint8_t kFalseBit = 0x00;
+  auto* scratch_space = const_cast<Scalar&>(value).scratch_space_.data();

Review Comment:
   Nope, mutable spares us that. Forgot to remove the const_cast after I added it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231174726


##########
cpp/src/arrow/scalar.h:
##########
@@ -111,6 +111,12 @@ struct ARROW_EXPORT Scalar : public std::enable_shared_from_this<Scalar>,
  protected:
   Scalar(std::shared_ptr<DataType> type, bool is_valid)
       : type(std::move(type)), is_valid(is_valid) {}
+
+  // 16 bytes of scratch space to enable ArraySpan to be a view onto any
+  // Scalar- including binary scalars where we need to create a buffer
+  // that looks like two 32-bit or 64-bit offsets.
+  alignas(int64_t) std::array<uint8_t, sizeof(int64_t) * 2> scratch_space_{0};

Review Comment:
   We have tons of C arrays around, so I'm surprised the linter would warn about this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231175989


##########
cpp/src/arrow/array/data.cc:
##########
@@ -228,11 +228,12 @@ void ArraySpan::SetMembers(const ArrayData& data) {
 namespace {
 
 template <typename offset_type>
-void SetOffsetsForScalar(ArraySpan* span, offset_type* buffer, int64_t value_size,
+void SetOffsetsForScalar(ArraySpan* span, uint8_t* buffer, int64_t value_size,

Review Comment:
   We tend to use `int64_t` for sizes consistently, so let's continue to do that here, IMHO.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1594652669

   This PR should probably wait until https://github.com/apache/arrow/pull/35036 is merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1231075128


##########
cpp/CMakePresets.json:
##########
@@ -11,6 +11,7 @@
       "hidden": true,
       "generator": "Ninja",
       "cacheVariables": {
+        "CMAKE_EXPORT_COMPILE_COMMANDS": "ON",

Review Comment:
   It's less than a megabyte on my machine; I think it's not expensive and it's worthwhile to enable clang tools by default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #36018:
URL: https://github.com/apache/arrow/pull/36018#discussion_r1259945806


##########
cpp/src/arrow/compute/kernels/test_util.cc:
##########
@@ -263,6 +263,23 @@ void CheckScalarUnary(std::string func_name, std::shared_ptr<DataType> in_ty,
 
 void CheckVectorUnary(std::string func_name, Datum input, Datum expected,
                       const FunctionOptions* options) {
+  // WTF debug
+  std::cout << "input: ";
+  switch (input.kind()) {
+    case Datum::ARRAY:
+      std::cout << input.make_array()->ToString();
+      break;
+    case Datum::CHUNKED_ARRAY:
+      std::cout << input.chunked_array()->ToString();
+      break;
+    case Datum::SCALAR:
+      std::cout << input.scalar()->ToString();
+      break;
+    default:
+      std::cout << input.ToString();
+      break;
+  }
+  std::cout << std::endl;

Review Comment:
   This is not on main: https://github.com/apache/arrow/blob/96ac514383bc33b02d19d739a6428ad9b3f453b7/cpp/src/arrow/array/data.cc#L266
   
   https://github.com/apache/arrow/commit/96ac514383bc33b02d19d739a6428ad9b3f453b7#diff-1d848527e898cd9da6a41bdbd68cab5d5826986128851220d20d1fdac9752d8b



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1629530634

   @github-actions crossbow submit -g cpp


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36018: GH-35581: [C++] Store offsets in scalars

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36018:
URL: https://github.com/apache/arrow/pull/36018#issuecomment-1623837505

   Revision: 2d1510f1854b82a260b7195fa1ccd919341c212e
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-f5bd8a15b8](https://github.com/ursacomputing/crossbow/branches/all?query=actions-f5bd8a15b8)
   
   |Task|Status|
   |----|------|
   |test-alpine-linux-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-alpine-linux-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5477004147/jobs/9975394586)|
   |test-build-cpp-fuzz|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-build-cpp-fuzz)](https://github.com/ursacomputing/crossbow/actions/runs/5477006726/jobs/9975402072)|
   |test-conda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-conda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5477003078/jobs/9975392032)|
   |test-conda-cpp-valgrind|[![Azure](https://dev.azure.com/ursacomputing/crossbow/_apis/build/status/ursacomputing.crossbow?branchName=actions-f5bd8a15b8-azure-test-conda-cpp-valgrind)](https://github.com/ursacomputing/crossbow/runs/14830971919)|
   |test-cuda-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-cuda-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5477003939/jobs/9975394034)|
   |test-debian-11-cpp-amd64|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-debian-11-cpp-amd64)](https://github.com/ursacomputing/crossbow/actions/runs/5477005709/jobs/9975399182)|
   |test-debian-11-cpp-i386|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-debian-11-cpp-i386)](https://github.com/ursacomputing/crossbow/actions/runs/5477007020/jobs/9975402876)|
   |test-fedora-35-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-fedora-35-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5477006424/jobs/9975401295)|
   |test-ubuntu-20.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-ubuntu-20.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5477003607/jobs/9975393243)|
   |test-ubuntu-20.04-cpp-bundled|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-ubuntu-20.04-cpp-bundled)](https://github.com/ursacomputing/crossbow/actions/runs/5477004455/jobs/9975395500)|
   |test-ubuntu-20.04-cpp-minimal-with-formats|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-ubuntu-20.04-cpp-minimal-with-formats)](https://github.com/ursacomputing/crossbow/actions/runs/5477006254/jobs/9975400710)|
   |test-ubuntu-20.04-cpp-thread-sanitizer|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-ubuntu-20.04-cpp-thread-sanitizer)](https://github.com/ursacomputing/crossbow/actions/runs/5477004657/jobs/9975396021)|
   |test-ubuntu-22.04-cpp|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-ubuntu-22.04-cpp)](https://github.com/ursacomputing/crossbow/actions/runs/5477003351/jobs/9975392548)|
   |test-ubuntu-22.04-cpp-20|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-f5bd8a15b8-github-test-ubuntu-22.04-cpp-20)](https://github.com/ursacomputing/crossbow/actions/runs/5477005085/jobs/9975397341)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org