You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "llama90 (via GitHub)" <gi...@apache.org> on 2023/10/28 13:33:11 UTC

[PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

llama90 opened a new pull request, #38502:
URL: https://github.com/apache/arrow/pull/38502

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   Support `ChunkedArray` sorting for `DictionaryType`.
   
   By referencing the sorting algorithm for `DictionaryType` for `Array`, modifications were made to **unify** dictionaries stored within among ChunkedArrays, **compute** rankings, and **perform** a merge sort based on the calculated rankings.
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   ### What changes are included in this PR?
   
   * Updated `ChunkedArraySorter` to support `DictionaryType`.
   * Added unit test
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   ### Are these changes tested?
   
   Yes
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   ### Are there any user-facing changes?
   
   No
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.

llama90 commented on PR #38502:
URL: https://github.com/apache/arrow/pull/38502#issuecomment-1803610189

   @js8544 Could you please confirm if this is the correct approach?
   
   Per your guidance, I've implemented changes using the `Visitor Pattern`. The following improvements appear to have been achieved:
   
   * Eliminated the need for branching when sorting Dictionary Types.
   * Removed unnecessary code that was previously part of the branches.
   
   I have some concerns regarding the cleanliness of the code. Additionally, while I have currently added unit tests for `string` and `int32` types, I'm considering extending them to cover all possible ranges. Would that be appropriate?
   
   Your input has been invaluable, and I've learned a great deal from this experience. Thank you. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on PR #38502:
URL: https://github.com/apache/arrow/pull/38502#issuecomment-1783812152

   :warning: GitHub issue #38490 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.

llama90 commented on PR #38502:
URL: https://github.com/apache/arrow/pull/38502#issuecomment-1783949064

   I noticed the errors in the workflow and am trying to fix them...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "js8544 (via GitHub)" <gi...@apache.org>.

js8544 commented on code in PR #38502:
URL: https://github.com/apache/arrow/pull/38502#discussion_r1388920322


##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -118,13 +115,8 @@ class ChunkedArraySorter : public TypeVisitor {
       };
       auto merge_non_nulls = [&](uint64_t* range_begin, uint64_t* range_middle,
                                  uint64_t* range_end, uint64_t* temp_indices) {
-        if (is_dictionary(*physical_type_)) {
-          MergeNonNulls<DictionaryType>(range_begin, range_middle, range_end, arrays,
-                                        temp_indices);
-        } else {
-          MergeNonNulls<ArrayType>(range_begin, range_middle, range_end, arrays,
-                                   temp_indices);
-        }
+        MergeNonNulls<ArrayType>(range_begin, range_middle, range_end, arrays,

Review Comment:
   I'm sorry my intentions were not clear. I agree with your `ChunkedArraySorter::MergeNonNulls<DictionaryType>` approach, but you don't need to have a `if` clause to dispatch it, because `MergeNonNulls<ArrayType>` should be enough to dispatch it (here ArrayType==DictionaryType).



##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -236,67 +190,6 @@ class ChunkedArraySorter : public TypeVisitor {
   NullPartitionResult* output_;
 };
 
-template <>
-void ChunkedArraySorter::MergeNonNulls<DictionaryType>(

Review Comment:
   Here's an example of my intended idea (may not work as is, and may not even compile):
   ```cpp
   class DictionaryMergeSorter {
     template <typename T>
     std::enable_if_t<has_string_view<T>::value || has_c_type<T>::value, Status> Visit(
         const T& type, const SortOrder order, uint64_t* range_begin, uint64_t* range_middle,
         uint64_t* range_end, const std::vector<const Array*>& arrays,
         uint64_t* temp_indices) {
       using ArrayType = typename TypeTraits<T>::ArrayType;
   
       // This lambda functions should probably be refactored to codegen_internal.h or
       // chunked_internal.h
       auto GetView = [](const Array& array, uint64_t index) {
         const auto& dictionary_array = checked_cast<const DictionaryArray&>(array);
         // GetValueIndex is expensive and can be further optimized
         int64_t value_index = dictionary_array.GetValueIndex(index);
         return checked_cast<const ArrayType&>(*dictionary_array.dictionary())
             .GetView(value_index);
       };
   
       const ChunkedArrayResolver left_resolver(arrays);
       const ChunkedArrayResolver right_resolver(arrays);
       if (order == SortOrder::Ascending) {
         std::merge(range_begin, range_middle, range_middle, range_end, temp_indices,
                    [&](uint64_t left, uint64_t right) {
                      const auto chunk_left = left_resolver.Resolve<Array>(left);
                      const auto chunk_right = right_resolver.Resolve<Array>(right);
                      auto left_value = GetView(*chunk_left.array, chunk_left.index);
                      auto right_value = GetView(*chunk_right.array, chunk_right.index);
                      return left_value < right_value;
                    });
       } else {
         std::merge(range_begin, range_middle, range_middle, range_end, temp_indices,
                    [&](uint64_t left, uint64_t right) {
                      const auto chunk_left = left_resolver.Resolve<Array>(left);
                      const auto chunk_right = right_resolver.Resolve<Array>(right);
                      auto left_value = GetView(*chunk_left.array, chunk_left.index);
                      auto right_value = GetView(*chunk_right.array, chunk_right.index);
                      return left_value > right_value;
                    });
       }
     }
   
     Status Visit(const DataType& type, const SortOrder order, uint64_t* range_begin,
                  uint64_t* range_middle, uint64_t* range_end,
                  const std::vector<const Array*>& arrays, uint64_t* temp_indices) {
       return Status::TypeError("Unsupported type for Dictionary sorting: ",
                                type.ToString());
     }
   
    public:
     Status Merge(const SortOrder order, uint64_t* range_begin, uint64_t* range_middle,
                  uint64_t* range_end, const std::vector<const Array*>& arrays,
                  uint64_t* temp_indices) {
       auto dict_value_type =
           checked_cast<const DictionaryArray&>(arrays[0]).dictionary()->type();
       return VisitTypeInline(*dict_value_type, this, range_begin, range_middle, range_end,
                              arrays, temp_indices);
     }
   };
   
   template <>
   Status ChunkedArraySorter::MergeNonNulls<DictionaryType>(
       uint64_t* range_begin, uint64_t* range_middle, uint64_t* range_end,
       const std::vector<const Array*>& arrays, uint64_t* temp_indices) {
     DictionaryMergeSorter sorter;
     return sorter.Merge(order_, range_begin, range_middle, range_end, arrays, temp_indices);
   }
   ```
   In this way there is one type dispatch per chunk, rather than one per element when using a variant. The dispatching can even be moved up further, but I'm not sure how to do it elegantly.



##########
cpp/src/arrow/array/array_dict.cc:
##########
@@ -78,6 +78,13 @@ int64_t DictionaryArray::GetValueIndex(int64_t i) const {
   }
 }
 
+ViewType DictionaryArray::GetView(int64_t i) const {

Review Comment:
   I'm sorry that my suggestions were not clear. I meant to suggest a visitor pattern in the sort kernel implementation, not in DictionaryArray definition. 
   
   The dictionary value can be of any arrow type and not all types have a view. For example, `dictionary<list<string>>` can't have a proper view type because lists don't have view types. 
   
   And having a view of a variant type may be slow. Because variant's comparison is a dynamic dispatching process and it will happen for each pair of values. We can move the dispatching higher up to make it faster. I'll give an example below.



##########
cpp/src/arrow/array/array_dict.h:
##########
@@ -20,17 +20,23 @@
 #include <cstdint>
 #include <memory>
 
+#include "arrow/array.h"
 #include "arrow/array/array_base.h"
+#include "arrow/array/array_primitive.h"
 #include "arrow/array/data.h"
 #include "arrow/result.h"
 #include "arrow/scalar.h"
 #include "arrow/status.h"
 #include "arrow/type.h"
+#include "arrow/util/checked_cast.h"
+#include "arrow/util/logging.h"
 #include "arrow/util/macros.h"
 #include "arrow/util/visibility.h"
 
 namespace arrow {
 
+using ViewType = std::variant<std::string_view, int32_t>;

Review Comment:
   If we use a variant then it's going to have so many variants that it's essentially the `Scalar` class



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.

llama90 commented on PR #38502:
URL: https://github.com/apache/arrow/pull/38502#issuecomment-1790499859

   Hello… I would like you to check my code in order for me to make any progress further.
   I would really appreciate your help in advance.
   
   cc @pitrou @js8544 @westonpace 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.

llama90 commented on PR #38502:
URL: https://github.com/apache/arrow/pull/38502#issuecomment-1794899155

   @js8544 Thank you for your review!! I'll take your advice and try to improve it. Thank you once again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.

llama90 commented on PR #38502:
URL: https://github.com/apache/arrow/pull/38502#issuecomment-1803549940

   Some outdated reviews have been resolved.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "llama90 (via GitHub)" <gi...@apache.org>.

llama90 commented on PR #38502:
URL: https://github.com/apache/arrow/pull/38502#issuecomment-1784072583

   I believe I've resolved the C++ related issues in the `workflow`, so I've reopened the PR. I've also commented on the key parts I'd like to have reviewed in detail.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [PR] GH-38490: [C++][Compute] Support ChunkedArray sorting for dictionary type [arrow]

Posted by "js8544 (via GitHub)" <gi...@apache.org>.

js8544 commented on code in PR #38502:
URL: https://github.com/apache/arrow/pull/38502#discussion_r1383103558


##########
cpp/src/arrow/array/array_dict.h:
##########
@@ -111,6 +112,14 @@ class ARROW_EXPORT DictionaryArray : public Array {
   /// value is null or out-of-bounds.
   int64_t GetValueIndex(int64_t i) const;
 
+  std::string GetView(int64_t i) const {
+    // Obtain the index at position i
+    const auto index = GetValueIndex(i);
+
+    // Assuming the dictionary is a String

Review Comment:
   We can't really assume the dictionary is a string here as Arrow's Dictionary type can support any type.



##########
cpp/src/arrow/compute/kernels/codegen_internal.h:
##########
@@ -187,6 +187,14 @@ struct GetOutputType<Decimal256Type> {
   using T = Decimal256;
 };
 
+template <>
+struct GetViewType<DictionaryType> {

Review Comment:
   Again, we can't assume a string dictionary. It's probably necessary to dispatch your implementation based on the concrete dictionary type. See the [visitor pattern](https://arrow.apache.org/docs/cpp/datatypes.html#visitor-pattern) and the [`GetArraySorter`](https://github.com/apache/arrow/blob/main/cpp/src/arrow/compute/kernels/vector_array_sort.cc#L665) function for an example.



##########
cpp/src/arrow/array/array_dict.h:
##########
@@ -111,6 +112,14 @@ class ARROW_EXPORT DictionaryArray : public Array {
   /// value is null or out-of-bounds.
   int64_t GetValueIndex(int64_t i) const;
 
+  std::string GetView(int64_t i) const {
+    // Obtain the index at position i
+    const auto index = GetValueIndex(i);
+
+    // Assuming the dictionary is a String
+    return dictionary()->GetScalar(index).ValueOrDie()->ToString();

Review Comment:
   We don't use `ValueOrDie` except in tests or benchmarks because we don't want users' program to crash when they depend on Arrow. We generally use `ARROW_ASSIGN_OR_RAISE` and `ARROW_RETURN_NOT_OK` to propagate the `Status` to the caller.



##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -190,6 +236,67 @@ class ChunkedArraySorter : public TypeVisitor {
   NullPartitionResult* output_;
 };
 
+template <>
+void ChunkedArraySorter::MergeNonNulls<DictionaryType>(
+    uint64_t* range_begin, uint64_t* range_middle, uint64_t* range_end,
+    const std::vector<const Array*>& arrays, uint64_t* temp_indices) {
+  const ChunkedArrayResolver left_resolver(arrays);
+  const ChunkedArrayResolver right_resolver(arrays);
+
+  // concatenate all dictionary arrays to calculate rank
+  ArrayVector dicts_array;
+  for (const auto& array : arrays) {
+    const auto& dicts = checked_cast<const DictionaryArray&>(*array);
+    dicts_array.push_back(dicts.dictionary());
+  }
+  auto concat_dict = Concatenate(dicts_array).ValueOrDie();
+  auto ranks = RanksWithNulls(concat_dict).ValueOrDie();
+
+  // build unified rank map using dicts and rank array
+  std::unordered_map<std::string, int64_t> unified_rank_map;
+  for (int i = 0; i < concat_dict->length(); i++) {
+    auto dict = concat_dict->GetScalar(i).ValueOrDie();

Review Comment:
   We tend to avoid GetScalar because it's expensive (It unwraps the array as its real type, gets the value, and wraps it as the scalar class again). Instead we often do `checked_cast<ConcreteArrayType&>(*arr)->GetView(i)` to directly get its ctype value.



##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -115,8 +118,13 @@ class ChunkedArraySorter : public TypeVisitor {
       };
       auto merge_non_nulls = [&](uint64_t* range_begin, uint64_t* range_middle,
                                  uint64_t* range_end, uint64_t* temp_indices) {
-        MergeNonNulls<ArrayType>(range_begin, range_middle, range_end, arrays,
-                                 temp_indices);
+        if (is_dictionary(*physical_type_)) {
+          MergeNonNulls<DictionaryType>(range_begin, range_middle, range_end, arrays,

Review Comment:
   Can we merge these two branches? Is DictionaryType different from ArrayType?



##########
cpp/src/arrow/compute/kernels/vector_sort.cc:
##########
@@ -179,6 +187,44 @@ class ChunkedArraySorter : public TypeVisitor {
     std::copy(temp_indices, temp_indices + (range_end - range_begin), range_begin);
   }
 
+  // Duplicate of ArrayCompareSorter
+  static Result<std::shared_ptr<Array>> RanksWithNulls(
+      const std::shared_ptr<Array>& array) {
+    // Notes:
+    // * The order is always ascending here, since the goal is to produce
+    //   an exactly-equivalent-order of the dictionary values.
+    // * We're going to re-emit nulls in the output, so we can just always consider
+    //   them "at the end".  Note that choosing AtStart would merely shift other
+    //   ranks by 1 if there are any nulls...
+    RankOptions rank_options(SortOrder::Ascending, NullPlacement::AtEnd,
+                             RankOptions::Dense);
+
+    auto data = array->data();
+    std::shared_ptr<Buffer> null_bitmap;
+    if (array->null_count() > 0) {
+      null_bitmap = array->null_bitmap();
+      data = array->data()->Copy();
+      if (data->offset > 0) {
+        ARROW_ASSIGN_OR_RAISE(
+            null_bitmap,
+            arrow::internal::CopyBitmap(arrow::default_memory_pool(), null_bitmap->data(),
+                                        data->offset, data->length));
+      }
+      data->buffers[0] = nullptr;
+      data->null_count = 0;
+    }
+    ARROW_ASSIGN_OR_RAISE(auto rank_datum,
+                          CallFunction("rank", {std::move(data)}, &rank_options));

Review Comment:
   I think using rank here is expensive and perhaps unnecessary? IIUC The rank is only used to compare two elements during merge sorting. Can we compare them by their values directly? Since `rank` internally also sorts the array, we are essentially sorting it twice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org