You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "felipecrv (via GitHub)" <gi...@apache.org> on 2023/04/14 13:05:15 UTC

[GitHub] [arrow] felipecrv opened a new pull request, #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

felipecrv opened a new pull request, #35129:
URL: https://github.com/apache/arrow/pull/35129

   ### Rationale for this change
   
   Fixing a bug.
   
   ### What changes are included in this PR?
   
   Changes to the `"hash_count"` kernel implementation to handle REE and union arrays correctly.
   
   - [x] Generic (potentially slow) implementation
   - [ ] REE-specialized implementation
   
   ### Are these changes tested?
   
   Yes, by modifying the existing unit tests.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35129:
URL: https://github.com/apache/arrow/pull/35129#issuecomment-1530811382

   ['Python', 'R'] benchmarks have high level of regressions.
   [test-mac-arm](https://conbench.ursa.dev/compare/runs/6331adfa453343f1ab3db9d2d59c4f96...e33671654caa403daaf24724a74617bd/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35129:
URL: https://github.com/apache/arrow/pull/35129#discussion_r1179605685


##########
cpp/src/arrow/testing/gtest_util.cc:
##########
@@ -427,6 +428,24 @@ std::shared_ptr<Table> TableFromJSON(const std::shared_ptr<Schema>& schema,
   return *Table::FromRecordBatches(schema, std::move(batches));
 }
 
+Result<std::shared_ptr<Table>> RunEndEncodeTableColumns(
+    const Table& table, const std::vector<int>& column_indices) {
+  const int num_columns = table.num_columns();
+  std::vector<std::shared_ptr<ChunkedArray>> encoded_columns;
+  encoded_columns.reserve(num_columns);
+  for (int i = 0; i < num_columns; i++) {
+    if (std::find(column_indices.begin(), column_indices.end(), i) !=
+        column_indices.end()) {
+      EXPECT_OK_AND_ASSIGN(auto run_end_encoded, compute::RunEndEncode(table.column(i)));
+      EXPECT_EQ(run_end_encoded.kind(), Datum::CHUNKED_ARRAY);

Review Comment:
   Given you're returning a result anyways you could just use ARROW_ASSIGN_OR_RAISE and ARROW_RETURN_NOT_OK here right?



##########
cpp/src/arrow/compute/kernels/hash_aggregate.cc:
##########
@@ -305,6 +306,56 @@ struct GroupedCountImpl : public GroupedAggregator {
     return Status::OK();
   }
 
+  template <bool count_valid>
+  struct RunEndEncodedCountImpl {
+    /// Count the number of valid or invalid values in a run-end-encoded array.
+    ///
+    /// \param[in] input the run-end-encoded array
+    /// \param[out] counts the counts being accumulated
+    /// \param[in] g the group ids of the values in the array
+    template <typename RunEndCType>
+    void DoCount(const ArraySpan& input, int64_t* counts, const uint32_t* g) {
+      ree_util::RunEndEncodedArraySpan<RunEndCType> ree_span(input);
+      const auto* physical_validity = ree_util::ValuesArray(input).GetValues<uint8_t>(0);
+      auto end = ree_span.end();
+      for (auto it = ree_span.begin(); it != end; ++it) {
+        const bool is_valid = bit_util::GetBit(physical_validity, it.index_into_array());
+        if constexpr (count_valid) {
+          if (is_valid) {
+            for (int64_t i = 0; i < it.run_length(); ++i, ++g) {
+              counts[*g] += 1;
+            }
+          } else {
+            g += it.run_length();
+          }
+        } else {
+          if (!is_valid) {
+            for (int64_t i = 0; i < it.run_length(); ++i, ++g) {
+              counts[*g] += 1;
+            }
+          } else {
+            g += it.run_length();
+          }
+        }

Review Comment:
   Can you collapse these two branches?
   
   ```
   if (is_valid == count_valid) {
     ...
   }
   ```
   
   Or does that not somehow generate the same benefit?  Seems like either way there is a single if comparison.  I expect the compiler would generate the same for `if (x == true)` that it does for `if (x)`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace merged PR #35129:
URL: https://github.com/apache/arrow/pull/35129


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #35129:
URL: https://github.com/apache/arrow/pull/35129#discussion_r1179837738


##########
cpp/src/arrow/testing/gtest_util.cc:
##########
@@ -427,6 +428,24 @@ std::shared_ptr<Table> TableFromJSON(const std::shared_ptr<Schema>& schema,
   return *Table::FromRecordBatches(schema, std::move(batches));
 }
 
+Result<std::shared_ptr<Table>> RunEndEncodeTableColumns(
+    const Table& table, const std::vector<int>& column_indices) {
+  const int num_columns = table.num_columns();
+  std::vector<std::shared_ptr<ChunkedArray>> encoded_columns;
+  encoded_columns.reserve(num_columns);
+  for (int i = 0; i < num_columns; i++) {
+    if (std::find(column_indices.begin(), column_indices.end(), i) !=
+        column_indices.end()) {
+      EXPECT_OK_AND_ASSIGN(auto run_end_encoded, compute::RunEndEncode(table.column(i)));
+      EXPECT_EQ(run_end_encoded.kind(), Datum::CHUNKED_ARRAY);

Review Comment:
   Done. Pushed two new commits.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35129:
URL: https://github.com/apache/arrow/pull/35129#issuecomment-1530805441

   Benchmark runs are scheduled for baseline = 99ac74e6890767e1a3cf96d7aff0b96b9503153c and contender = be12888997c81b1fb7947f6284be1256edd4d3e4. be12888997c81b1fb7947f6284be1256edd4d3e4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/48452dd0b5a1435996191c282fe7a073...cbfbc9871a1846e3bb5326528317068b/)
   [Failed :arrow_down:3.03% :arrow_up:0.26%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/6331adfa453343f1ab3db9d2d59c4f96...e33671654caa403daaf24724a74617bd/)
   [Failed :arrow_down:0.0% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/5378984264184d6eb9445a08ae5abd9d...6d138698130349e496384450e7a95c04/)
   [Finished :arrow_down:1.11% :arrow_up:0.06%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/ca813469982a4b6ea2c6a5f0c8f2aa62...4bbf47eecfe94ee5a7bdd779d2fd1c9d/)
   Buildkite builds:
   [Finished] [`be128889` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2804)
   [Failed] [`be128889` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2838)
   [Failed] [`be128889` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2802)
   [Finished] [`be128889` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2829)
   [Finished] [`99ac74e6` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2803)
   [Failed] [`99ac74e6` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2837)
   [Failed] [`99ac74e6` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2801)
   [Finished] [`99ac74e6` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2828)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on PR #35129:
URL: https://github.com/apache/arrow/pull/35129#issuecomment-1513666384

   @westonpace @jorisvandenbossche 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #35129:
URL: https://github.com/apache/arrow/pull/35129#discussion_r1179831741


##########
cpp/src/arrow/compute/kernels/hash_aggregate.cc:
##########
@@ -305,6 +306,56 @@ struct GroupedCountImpl : public GroupedAggregator {
     return Status::OK();
   }
 
+  template <bool count_valid>
+  struct RunEndEncodedCountImpl {
+    /// Count the number of valid or invalid values in a run-end-encoded array.
+    ///
+    /// \param[in] input the run-end-encoded array
+    /// \param[out] counts the counts being accumulated
+    /// \param[in] g the group ids of the values in the array
+    template <typename RunEndCType>
+    void DoCount(const ArraySpan& input, int64_t* counts, const uint32_t* g) {
+      ree_util::RunEndEncodedArraySpan<RunEndCType> ree_span(input);
+      const auto* physical_validity = ree_util::ValuesArray(input).GetValues<uint8_t>(0);
+      auto end = ree_span.end();
+      for (auto it = ree_span.begin(); it != end; ++it) {
+        const bool is_valid = bit_util::GetBit(physical_validity, it.index_into_array());
+        if constexpr (count_valid) {
+          if (is_valid) {
+            for (int64_t i = 0; i < it.run_length(); ++i, ++g) {
+              counts[*g] += 1;
+            }
+          } else {
+            g += it.run_length();
+          }
+        } else {
+          if (!is_valid) {
+            for (int64_t i = 0; i < it.run_length(); ++i, ++g) {
+              counts[*g] += 1;
+            }
+          } else {
+            g += it.run_length();
+          }
+        }

Review Comment:
   Interesting catch. `count_valid` is a template param, so the compiler will generate the same code `if (is_valid)` in one instantiation and `if (!is_valid)` on another. No problem there. The final code is quite a logic exercise though :-) 
   
   ```cpp
             if (is_valid == count_valid) {
               for (int64_t i = 0; i < it.run_length(); ++i, ++g) {
                 counts[*g] += 1;
               }
             } else {
               g += it.run_length();
             }
    ```
   
   I will change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #35129:
URL: https://github.com/apache/arrow/pull/35129#discussion_r1179829093


##########
cpp/src/arrow/testing/gtest_util.cc:
##########
@@ -427,6 +428,24 @@ std::shared_ptr<Table> TableFromJSON(const std::shared_ptr<Schema>& schema,
   return *Table::FromRecordBatches(schema, std::move(batches));
 }
 
+Result<std::shared_ptr<Table>> RunEndEncodeTableColumns(
+    const Table& table, const std::vector<int>& column_indices) {
+  const int num_columns = table.num_columns();
+  std::vector<std::shared_ptr<ChunkedArray>> encoded_columns;
+  encoded_columns.reserve(num_columns);
+  for (int i = 0; i < num_columns; i++) {
+    if (std::find(column_indices.begin(), column_indices.end(), i) !=
+        column_indices.end()) {
+      EXPECT_OK_AND_ASSIGN(auto run_end_encoded, compute::RunEndEncode(table.column(i)));
+      EXPECT_EQ(run_end_encoded.kind(), Datum::CHUNKED_ARRAY);

Review Comment:
   Isn't this better for things used in tests? These are not really expected to fail, so these checks are more assert-like than regular error checking. I found this to be a pattern in this file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35129:
URL: https://github.com/apache/arrow/pull/35129#discussion_r1179834592


##########
cpp/src/arrow/testing/gtest_util.cc:
##########
@@ -427,6 +428,24 @@ std::shared_ptr<Table> TableFromJSON(const std::shared_ptr<Schema>& schema,
   return *Table::FromRecordBatches(schema, std::move(batches));
 }
 
+Result<std::shared_ptr<Table>> RunEndEncodeTableColumns(
+    const Table& table, const std::vector<int>& column_indices) {
+  const int num_columns = table.num_columns();
+  std::vector<std::shared_ptr<ChunkedArray>> encoded_columns;
+  encoded_columns.reserve(num_columns);
+  for (int i = 0; i < num_columns; i++) {
+    if (std::find(column_indices.begin(), column_indices.end(), i) !=
+        column_indices.end()) {
+      EXPECT_OK_AND_ASSIGN(auto run_end_encoded, compute::RunEndEncode(table.column(i)));
+      EXPECT_EQ(run_end_encoded.kind(), Datum::CHUNKED_ARRAY);

Review Comment:
   If that's the case then change the return type to `std::shared_ptr<Table>` and change the end to...
   
   ```
   EXPECT_OK_AND_ASSIGN(auto table, Table::Make(...));
   return table;
   ```
   
   I don't really mind one pattern or the other.  It's just the mixing of patterns we should avoid.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #35129:
URL: https://github.com/apache/arrow/pull/35129#discussion_r1179837366


##########
cpp/src/arrow/compute/kernels/hash_aggregate.cc:
##########
@@ -305,6 +306,56 @@ struct GroupedCountImpl : public GroupedAggregator {
     return Status::OK();
   }
 
+  template <bool count_valid>
+  struct RunEndEncodedCountImpl {
+    /// Count the number of valid or invalid values in a run-end-encoded array.
+    ///
+    /// \param[in] input the run-end-encoded array
+    /// \param[out] counts the counts being accumulated
+    /// \param[in] g the group ids of the values in the array
+    template <typename RunEndCType>
+    void DoCount(const ArraySpan& input, int64_t* counts, const uint32_t* g) {
+      ree_util::RunEndEncodedArraySpan<RunEndCType> ree_span(input);
+      const auto* physical_validity = ree_util::ValuesArray(input).GetValues<uint8_t>(0);
+      auto end = ree_span.end();
+      for (auto it = ree_span.begin(); it != end; ++it) {
+        const bool is_valid = bit_util::GetBit(physical_validity, it.index_into_array());
+        if constexpr (count_valid) {
+          if (is_valid) {
+            for (int64_t i = 0; i < it.run_length(); ++i, ++g) {
+              counts[*g] += 1;
+            }
+          } else {
+            g += it.run_length();
+          }
+        } else {
+          if (!is_valid) {
+            for (int64_t i = 0; i < it.run_length(); ++i, ++g) {
+              counts[*g] += 1;
+            }
+          } else {
+            g += it.run_length();
+          }
+        }

Review Comment:
   Done.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35129:
URL: https://github.com/apache/arrow/pull/35129#issuecomment-1508478153

   * Closes: #35059


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35129: GH-35059: [C++] Fix "hash_count" for run-end encoded inputs

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35129:
URL: https://github.com/apache/arrow/pull/35129#issuecomment-1508478227

   :warning: GitHub issue #35059 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org