You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "benibus (via GitHub)" <gi...@apache.org> on 2023/04/21 22:29:57 UTC

[GitHub] [arrow] benibus opened a new pull request, #35280: GH-29887: [C++] Implement dictionary array sorting

benibus opened a new pull request, #35280:
URL: https://github.com/apache/arrow/pull/35280

   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   Sorting for `DictionaryArray`s is not currently supported.
   
   ### What changes are included in this PR?
   
   - Adds support for dictionaries in the `array_sort_indices` kernel
   - Adds tests and benchmarks
   - Alters the internal `ArraySortFunc` definition to return an error status and accept the caller's `ExecContext` as an argument.
   - Adds small utility method `IsNull` to the public `DictionaryArray` class.
   
   ### Are these changes tested?
   
   Yes
   
   ### Are there any user-facing changes?
   
   Yes
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->
   
   ### Notes
   
   This picks up where https://github.com/apache/arrow/pull/13334 left off. Those commits have been squashed and included in this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1550047309

   Just for the record, it's a pity that the all-or-mostly-null case is a bit slower:
   ```
   ArraySortIndicesInt64WideDict/32768/10000         254 us          254 us         2756 bytes_per_second=122.985M/s items_per_second=16.1199M/s null_percent=0.01 size=32.768k
   ArraySortIndicesInt64WideDict/32768/100           272 us          272 us         2615 bytes_per_second=115.005M/s items_per_second=15.0739M/s null_percent=1 size=32.768k
   ArraySortIndicesInt64WideDict/32768/10            324 us          324 us         2162 bytes_per_second=96.5235M/s items_per_second=12.6515M/s null_percent=10 size=32.768k
   ArraySortIndicesInt64WideDict/32768/2             386 us          386 us         1811 bytes_per_second=80.9335M/s items_per_second=10.6081M/s null_percent=50 size=32.768k
   ArraySortIndicesInt64WideDict/32768/1             357 us          357 us         1956 bytes_per_second=87.483M/s items_per_second=11.4666M/s null_percent=100 size=32.768k
   ArraySortIndicesInt64WideDict/32768/0             254 us          254 us         2751 bytes_per_second=122.957M/s items_per_second=16.1163M/s null_percent=0 size=32.768k
   ArraySortIndicesInt64WideDict/1048576/100        3398 us         3396 us          205 bytes_per_second=294.442M/s items_per_second=38.593M/s null_percent=1 size=1048.58k
   ArraySortIndicesInt64WideDict/8388608/100       31369 us        31341 us           22 bytes_per_second=255.253M/s items_per_second=33.4566M/s null_percent=1 size=8.38861M
   ```
   
   This can be left to another issue or PR, though. Overall the improvement is very nice, especially for strings.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1569138597

   Benchmark runs are scheduled for baseline = f3500f65c0b15e285cba2b1822385248b3424f84 and contender = 6ceb12f700aa1e4dcf6724e6bb1e2e714df46124. 6ceb12f700aa1e4dcf6724e6bb1e2e714df46124 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d1c4c1f03c294ee4ba6354fc910f5618...b6784a310fcc4d97a49e4f33b4fb801f/)
   [Finished :arrow_down:0.56% :arrow_up:0.36%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/63878470fbfa44ea83079d73af2c6549...395700bedb034d948c4adfb39e009124/)
   [Finished :arrow_down:2.94% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/3d76cb5ffb8849bf8c3ea9b32d08b3b7...578b21fce98d46d0a7dc293eccfb7692/)
   [Finished :arrow_down:0.42% :arrow_up:0.24%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/385dcbb492f44244b6edae507d6d8018...4181bc5123224956ba4cff425b6b81e5/)
   Buildkite builds:
   [Finished] [`6ceb12f7` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2916)
   [Finished] [`6ceb12f7` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2952)
   [Finished] [`6ceb12f7` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2917)
   [Finished] [`6ceb12f7` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2942)
   [Finished] [`f3500f65` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2915)
   [Finished] [`f3500f65` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2951)
   [Finished] [`f3500f65` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2916)
   [Finished] [`f3500f65` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2941)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1185301484


##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.

Review Comment:
   https://github.com/apache/arrow/issues/35437 (also included in the TODO)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #35280:
URL: https://github.com/apache/arrow/pull/35280


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1518392025

   :warning: GitHub issue #29887 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1553757556

   After some ad-hoc testing, the benchmark' array lengths/sizes should be essentially the same now between the dict and non-dict versions. For strings, the dict versions _slightly_ overshoot the others (in terms of bytes) due to some null overlap between the index/value arrays.
   
   I also added some special-casing for the 100% null benchmarks - primarily to ensure that the input `DictionaryArray` was actually 0 bytes (like the `StringArray` equivalents). Caveat: it only tests the best-case scenario - i.e. where the index and/or value array is entirely null.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1200666243


##########
cpp/src/arrow/compute/kernels/vector_sort_test.cc:
##########
@@ -368,6 +368,138 @@ TEST(ArraySortIndicesFunction, Array) {
   AssertDatumsEqual(expected, actual, /*verbose=*/true);
 }
 
+TEST(ArraySortIndicesFunction, DictionaryArray) {

Review Comment:
   Sorry, do you mean a `DictionaryArray` where the values array has a non-zero offset - or something else? The RandomDictionaryArray test should cover that case, I think.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1181839716


##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.
+    auto dict_values = dict_array.dictionary();
+    auto dict_indices = dict_array.indices();
+
+    // Algorithm:
+    // 1) Use the Rank function to get an exactly-equivalent-order array
+    //    of the dictionary values, but with a datatype that's friendlier to
+    //    sorting (uint64).
+    // 2) Act as if we were sorting a dictionary array with the same indices,
+    //    but with the ranks as dictionary values.
+    // 2a) Dictionary-decode the ranks by calling Take.
+    // 2b) Sort the decoded ranks. Not only those are uint64, they are dense
+    //     in a [0, k) range where k is the number of unique dictionary values.
+    //     Therefore, unless the dictionary is very large, a fast counting sort
+    //     will be used.
+    //
+    // The bottom line is that performance will usually be much better
+    // (potentially an order of magnitude faster) than by naively decoding
+    // the original dictionary and sorting the decoded version.
+
+    std::shared_ptr<Array> decoded_ranks;
+    // Skip the rank/take steps for cases with only nulls
+    if (IsAllNulls(dict_array)) {
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            MakeNullUInt64Array(dict_array.length(), ctx->memory_pool()));
+    } else {
+      ARROW_ASSIGN_OR_RAISE(auto ranks, RanksWithNulls(dict_values, ctx));
+
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            Take(*ranks, *dict_indices, TakeOptions::Defaults(), ctx));
+    }
+
+    DCHECK_EQ(decoded_ranks->type_id(), Type::UINT64);
+    DCHECK_EQ(decoded_ranks->length(), dict_array.length());
+    ARROW_ASSIGN_OR_RAISE(auto rank_sorter, GetArraySorter(*decoded_ranks->type()));
+
+    return rank_sorter(indices_begin, indices_end, *decoded_ranks, offset, options, ctx);
+  }
+
+ private:
+  static bool IsAllNulls(const DictionaryArray& dict_array) {
+    auto indices = dict_array.indices();
+    auto values = dict_array.dictionary();
+    if (values->null_count() == values->length() ||
+        indices->null_count() == indices->length()) {
+      return true;
+    }
+    for (int64_t i = 0; i < dict_array.length(); ++i) {
+      // TODO: Special handling for dictionary types in Array::IsValid/Null?
+      if (indices->IsValid(i) && values->IsValid(dict_array.GetValueIndex(i))) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  static Result<std::shared_ptr<Array>> MakeNullUInt64Array(int64_t length,
+                                                            MemoryPool* pool) {
+    ARROW_ASSIGN_OR_RAISE(auto bitmap, AllocateEmptyBitmap(length, pool));
+    auto data = ArrayData::Make(uint64(), length, {std::move(bitmap), nullptr},
+                                /*null_count=*/length);
+    return MakeArray(data);
+  }
+
+  static Result<std::shared_ptr<Array>> RanksWithNulls(
+      const std::shared_ptr<Array>& array, ExecContext* ctx) {
+    // Notes:
+    // * The order is always ascending here, since the goal is to produce
+    //   an exactly-equivalent-order of the dictionary values.
+    // * We're going to re-emit nulls in the output, so we can just always consider
+    //   them "at the end".  Note that choosing AtStart would merely shift other
+    //   ranks by 1 if there are any nulls...
+    RankOptions rank_options(SortOrder::Ascending, NullPlacement::AtEnd,
+                             RankOptions::Dense);
+
+    // XXX Should this support Type::NA?
+    auto data = array->data();
+    std::shared_ptr<Buffer> null_bitmap;
+    if (array->null_count() > 0) {
+      null_bitmap = array->null_bitmap();
+      data = array->data()->Copy();
+      if (data->offset > 0) {
+        ARROW_ASSIGN_OR_RAISE(null_bitmap, arrow::internal::CopyBitmap(
+                                               ctx->memory_pool(), null_bitmap->data(),
+                                               data->offset, data->length));
+      }
+      data->buffers[0] = nullptr;
+      data->null_count = 0;
+    }

Review Comment:
   Apparently not... (it should be passed to `rank`). Will fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1181839289


##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.
+    auto dict_values = dict_array.dictionary();
+    auto dict_indices = dict_array.indices();
+
+    // Algorithm:
+    // 1) Use the Rank function to get an exactly-equivalent-order array
+    //    of the dictionary values, but with a datatype that's friendlier to
+    //    sorting (uint64).
+    // 2) Act as if we were sorting a dictionary array with the same indices,
+    //    but with the ranks as dictionary values.
+    // 2a) Dictionary-decode the ranks by calling Take.
+    // 2b) Sort the decoded ranks. Not only those are uint64, they are dense
+    //     in a [0, k) range where k is the number of unique dictionary values.
+    //     Therefore, unless the dictionary is very large, a fast counting sort
+    //     will be used.
+    //
+    // The bottom line is that performance will usually be much better
+    // (potentially an order of magnitude faster) than by naively decoding
+    // the original dictionary and sorting the decoded version.
+
+    std::shared_ptr<Array> decoded_ranks;
+    // Skip the rank/take steps for cases with only nulls
+    if (IsAllNulls(dict_array)) {
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            MakeNullUInt64Array(dict_array.length(), ctx->memory_pool()));
+    } else {
+      ARROW_ASSIGN_OR_RAISE(auto ranks, RanksWithNulls(dict_values, ctx));
+
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            Take(*ranks, *dict_indices, TakeOptions::Defaults(), ctx));
+    }
+
+    DCHECK_EQ(decoded_ranks->type_id(), Type::UINT64);
+    DCHECK_EQ(decoded_ranks->length(), dict_array.length());
+    ARROW_ASSIGN_OR_RAISE(auto rank_sorter, GetArraySorter(*decoded_ranks->type()));
+
+    return rank_sorter(indices_begin, indices_end, *decoded_ranks, offset, options, ctx);
+  }
+
+ private:
+  static bool IsAllNulls(const DictionaryArray& dict_array) {
+    auto indices = dict_array.indices();
+    auto values = dict_array.dictionary();
+    if (values->null_count() == values->length() ||
+        indices->null_count() == indices->length()) {
+      return true;
+    }
+    for (int64_t i = 0; i < dict_array.length(); ++i) {
+      // TODO: Special handling for dictionary types in Array::IsValid/Null?

Review Comment:
   Yeah, basically just an equivalent of the next few lines. In practice, it would need to be added to `Array` itself since `IsValid` isn't virtual.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1200675007


##########
cpp/src/arrow/compute/kernels/vector_sort_test.cc:
##########
@@ -368,6 +368,138 @@ TEST(ArraySortIndicesFunction, Array) {
   AssertDatumsEqual(expected, actual, /*verbose=*/true);
 }
 
+TEST(ArraySortIndicesFunction, DictionaryArray) {

Review Comment:
   Oh, I had missed that. Yes, it's certainly enough! Thank you.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1557538851

   I will note that the string benchmark numbers are lower with the latest benchmarking changes (which reduce the actual string array length), but that reflects the non-trivial fixed costs in the sorting routine:
   ```
   ArraySortIndicesStringWide/32768/10000            585 us          584 us         1202 bytes_per_second=53.4765M/s items_per_second=1.75232M/s null_percent=0.01 size=32.768k
   ArraySortIndicesStringWide/32768/100              606 us          605 us         1169 bytes_per_second=51.627M/s items_per_second=1.70824M/s null_percent=1 size=32.768k
   ArraySortIndicesStringWide/32768/10               610 us          610 us         1145 bytes_per_second=51.2671M/s items_per_second=1.86694M/s null_percent=10 size=32.768k
   ArraySortIndicesStringWide/32768/2                653 us          652 us         1072 bytes_per_second=47.8979M/s items_per_second=3.13904M/s null_percent=50 size=32.768k
   ArraySortIndicesStringWide/32768/1               79.4 us         79.4 us         8561 bytes_per_second=393.738M/s items_per_second=12.902M/s null_percent=100 size=32.768k
   ArraySortIndicesStringWide/32768/0                580 us          579 us         1203 bytes_per_second=53.9836M/s items_per_second=1.76893M/s null_percent=0 size=32.768k
   ArraySortIndicesStringWide/1048576/100          25200 us        25175 us           28 bytes_per_second=39.7218M/s items_per_second=1.31475M/s null_percent=1 size=1048.58k
   ArraySortIndicesStringWide/8388608/100         239121 us       238851 us            3 bytes_per_second=33.4937M/s items_per_second=1.10861M/s null_percent=1 size=8.38861M
   ArraySortIndicesStringWideDict/32768/10000        190 us          190 us         3683 bytes_per_second=164.852M/s items_per_second=5.40187M/s null_percent=0.01 size=32.768k
   ArraySortIndicesStringWideDict/32768/100          195 us          195 us         3591 bytes_per_second=160.284M/s items_per_second=5.30347M/s null_percent=1 size=32.768k
   ArraySortIndicesStringWideDict/32768/10           216 us          216 us         3244 bytes_per_second=144.701M/s items_per_second=5.26945M/s null_percent=10 size=32.768k
   ArraySortIndicesStringWideDict/32768/2            278 us          278 us         2504 bytes_per_second=112.403M/s items_per_second=7.36645M/s null_percent=50 size=32.768k
   ArraySortIndicesStringWideDict/32768/1            108 us          108 us         6456 bytes_per_second=288.614M/s items_per_second=9.45732M/s null_percent=100 size=32.768k
   ArraySortIndicesStringWideDict/32768/0            190 us          190 us         3676 bytes_per_second=164.204M/s items_per_second=5.38062M/s null_percent=0 size=32.768k
   ArraySortIndicesStringWideDict/1048576/100       1005 us         1005 us          695 bytes_per_second=995.211M/s items_per_second=32.9405M/s null_percent=1 size=1048.58k
   ArraySortIndicesStringWideDict/8388608/100       6826 us         6823 us          102 bytes_per_second=1.14501G/s items_per_second=38.8083M/s null_percent=1 size=8.38861M
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] felipecrv commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "felipecrv (via GitHub)" <gi...@apache.org>.
felipecrv commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1183119986


##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.
+    auto dict_values = dict_array.dictionary();
+    auto dict_indices = dict_array.indices();
+
+    // Algorithm:
+    // 1) Use the Rank function to get an exactly-equivalent-order array
+    //    of the dictionary values, but with a datatype that's friendlier to
+    //    sorting (uint64).
+    // 2) Act as if we were sorting a dictionary array with the same indices,
+    //    but with the ranks as dictionary values.
+    // 2a) Dictionary-decode the ranks by calling Take.
+    // 2b) Sort the decoded ranks. Not only those are uint64, they are dense
+    //     in a [0, k) range where k is the number of unique dictionary values.
+    //     Therefore, unless the dictionary is very large, a fast counting sort
+    //     will be used.
+    //
+    // The bottom line is that performance will usually be much better
+    // (potentially an order of magnitude faster) than by naively decoding
+    // the original dictionary and sorting the decoded version.
+
+    std::shared_ptr<Array> decoded_ranks;
+    // Skip the rank/take steps for cases with only nulls
+    if (IsAllNulls(dict_array)) {
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            MakeNullUInt64Array(dict_array.length(), ctx->memory_pool()));
+    } else {
+      ARROW_ASSIGN_OR_RAISE(auto ranks, RanksWithNulls(dict_values, ctx));
+
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            Take(*ranks, *dict_indices, TakeOptions::Defaults(), ctx));
+    }
+
+    DCHECK_EQ(decoded_ranks->type_id(), Type::UINT64);
+    DCHECK_EQ(decoded_ranks->length(), dict_array.length());
+    ARROW_ASSIGN_OR_RAISE(auto rank_sorter, GetArraySorter(*decoded_ranks->type()));
+
+    return rank_sorter(indices_begin, indices_end, *decoded_ranks, offset, options, ctx);
+  }
+
+ private:
+  static bool IsAllNulls(const DictionaryArray& dict_array) {
+    auto indices = dict_array.indices();
+    auto values = dict_array.dictionary();
+    if (values->null_count() == values->length() ||
+        indices->null_count() == indices->length()) {
+      return true;
+    }
+    for (int64_t i = 0; i < dict_array.length(); ++i) {
+      // TODO: Special handling for dictionary types in Array::IsValid/Null?

Review Comment:
   `Array::IsValid` is testing the type in conditionals to dispatch call to the appropriate implementation. As the special cases are less than ~7, this way of dispatching the calls is better than vtable dispatching [1] and with #35149 the compiler can even inline the conditional checks away.
   
   [1] https://twitter.com/_Felipe/status/1645237327780880385



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] benibus commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "benibus (via GitHub)" <gi...@apache.org>.
benibus commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1535161400

   I did some refactoring/reorganizing of the benchmarks so the new ones should hopefully make more sense. The original intent should still be intact (although I changed some of the string length parameters to be consistent with the existing non-dictionary string benchmarks).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1200251175


##########
cpp/src/arrow/compute/kernels/vector_sort_test.cc:
##########
@@ -368,6 +368,138 @@ TEST(ArraySortIndicesFunction, Array) {
   AssertDatumsEqual(expected, actual, /*verbose=*/true);
 }
 
+TEST(ArraySortIndicesFunction, DictionaryArray) {

Review Comment:
   Perhaps add a test with a non-zero array indices and/or values offset somewhere?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1181732100


##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.

Review Comment:
   Seems reasonable. Do you want to open a separate issue for this follow up and tag that issue here?  E.g. `TODO (GH-123): These methods...`



##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.
+    auto dict_values = dict_array.dictionary();
+    auto dict_indices = dict_array.indices();
+
+    // Algorithm:
+    // 1) Use the Rank function to get an exactly-equivalent-order array
+    //    of the dictionary values, but with a datatype that's friendlier to
+    //    sorting (uint64).
+    // 2) Act as if we were sorting a dictionary array with the same indices,
+    //    but with the ranks as dictionary values.
+    // 2a) Dictionary-decode the ranks by calling Take.
+    // 2b) Sort the decoded ranks. Not only those are uint64, they are dense
+    //     in a [0, k) range where k is the number of unique dictionary values.
+    //     Therefore, unless the dictionary is very large, a fast counting sort
+    //     will be used.
+    //
+    // The bottom line is that performance will usually be much better
+    // (potentially an order of magnitude faster) than by naively decoding
+    // the original dictionary and sorting the decoded version.
+
+    std::shared_ptr<Array> decoded_ranks;
+    // Skip the rank/take steps for cases with only nulls
+    if (IsAllNulls(dict_array)) {
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            MakeNullUInt64Array(dict_array.length(), ctx->memory_pool()));
+    } else {
+      ARROW_ASSIGN_OR_RAISE(auto ranks, RanksWithNulls(dict_values, ctx));
+
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            Take(*ranks, *dict_indices, TakeOptions::Defaults(), ctx));
+    }
+
+    DCHECK_EQ(decoded_ranks->type_id(), Type::UINT64);
+    DCHECK_EQ(decoded_ranks->length(), dict_array.length());
+    ARROW_ASSIGN_OR_RAISE(auto rank_sorter, GetArraySorter(*decoded_ranks->type()));
+
+    return rank_sorter(indices_begin, indices_end, *decoded_ranks, offset, options, ctx);
+  }
+
+ private:
+  static bool IsAllNulls(const DictionaryArray& dict_array) {
+    auto indices = dict_array.indices();
+    auto values = dict_array.dictionary();
+    if (values->null_count() == values->length() ||
+        indices->null_count() == indices->length()) {
+      return true;
+    }
+    for (int64_t i = 0; i < dict_array.length(); ++i) {
+      // TODO: Special handling for dictionary types in Array::IsValid/Null?
+      if (indices->IsValid(i) && values->IsValid(dict_array.GetValueIndex(i))) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  static Result<std::shared_ptr<Array>> MakeNullUInt64Array(int64_t length,

Review Comment:
   Use `arrow::MakeArrayOfNull` in `src/arrow/array/util.cc` instead?



##########
cpp/src/arrow/compute/kernels/vector_sort_test.cc:
##########
@@ -368,6 +368,129 @@ TEST(ArraySortIndicesFunction, Array) {
   AssertDatumsEqual(expected, actual, /*verbose=*/true);
 }
 
+TEST(ArraySortIndicesFunction, DictionaryArray) {
+  // Decoded dictionary array:
+  // (["b", "c", null, null, null, "b", "c", "c", "a"])
+
+  std::vector<std::string> expected_str = {
+      "[8, 0, 5, 1, 6, 7, 2, 3, 4]",  // SortOrder::Ascending NullPlacement::AtEnd
+      "[2, 3, 4, 8, 0, 5, 1, 6, 7]",  // SortOrder::Ascending NullPlacement::AtStart
+      "[1, 6, 7, 0, 5, 8, 2, 3, 4]",  // SortOrder::Descending NullPlacement::AtEnd
+      "[2, 3, 4, 1, 6, 7, 0, 5, 8]"   // SortOrder::Descending NullPlacement::AtStart
+  };
+
+  for (auto index_type : all_dictionary_index_types()) {

Review Comment:
   ```suggestion
     for (const auto& index_type : all_dictionary_index_types()) {
   ```



##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.
+    auto dict_values = dict_array.dictionary();
+    auto dict_indices = dict_array.indices();
+
+    // Algorithm:
+    // 1) Use the Rank function to get an exactly-equivalent-order array
+    //    of the dictionary values, but with a datatype that's friendlier to
+    //    sorting (uint64).
+    // 2) Act as if we were sorting a dictionary array with the same indices,
+    //    but with the ranks as dictionary values.
+    // 2a) Dictionary-decode the ranks by calling Take.
+    // 2b) Sort the decoded ranks. Not only those are uint64, they are dense
+    //     in a [0, k) range where k is the number of unique dictionary values.
+    //     Therefore, unless the dictionary is very large, a fast counting sort
+    //     will be used.
+    //
+    // The bottom line is that performance will usually be much better
+    // (potentially an order of magnitude faster) than by naively decoding
+    // the original dictionary and sorting the decoded version.
+
+    std::shared_ptr<Array> decoded_ranks;
+    // Skip the rank/take steps for cases with only nulls
+    if (IsAllNulls(dict_array)) {
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            MakeNullUInt64Array(dict_array.length(), ctx->memory_pool()));
+    } else {
+      ARROW_ASSIGN_OR_RAISE(auto ranks, RanksWithNulls(dict_values, ctx));
+
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            Take(*ranks, *dict_indices, TakeOptions::Defaults(), ctx));
+    }
+
+    DCHECK_EQ(decoded_ranks->type_id(), Type::UINT64);
+    DCHECK_EQ(decoded_ranks->length(), dict_array.length());
+    ARROW_ASSIGN_OR_RAISE(auto rank_sorter, GetArraySorter(*decoded_ranks->type()));
+
+    return rank_sorter(indices_begin, indices_end, *decoded_ranks, offset, options, ctx);
+  }
+
+ private:
+  static bool IsAllNulls(const DictionaryArray& dict_array) {
+    auto indices = dict_array.indices();
+    auto values = dict_array.dictionary();
+    if (values->null_count() == values->length() ||
+        indices->null_count() == indices->length()) {
+      return true;
+    }
+    for (int64_t i = 0; i < dict_array.length(); ++i) {
+      // TODO: Special handling for dictionary types in Array::IsValid/Null?

Review Comment:
   I'm not sure what this todo is referring to.  Are you suggesting we move this logic into `DictionaryArray`?  @felipecrv has been handling special cases for `IsNull/IsValid` recently and might be interested too.



##########
cpp/src/arrow/compute/kernels/vector_sort_benchmark.cc:
##########
@@ -82,6 +86,55 @@ static void ArraySortFuncInt64Benchmark(benchmark::State& state, const Runner& r
   ArraySortFuncBenchmark(state, runner, values);
 }
 
+template <typename Runner>
+static void ArraySortFuncInt64DictBenchmark(benchmark::State& state, const Runner& runner,
+                                            int64_t min, int64_t max) {
+  RegressionArgs args(state);
+
+  const int64_t array_size = args.size / sizeof(int64_t);
+
+  auto rand = random::RandomArrayGenerator(kSeed);
+  auto dict_values = rand.Int64(kDictionarySize, min, max, args.null_proportion / 2);
+  auto dict_indices =
+      rand.Int64(array_size, 0, kDictionarySize - 1, args.null_proportion / 2);
+  auto dict_array = *DictionaryArray::FromArrays(dict_indices, dict_values);
+
+  ArraySortFuncBenchmark(state, runner, dict_array);
+}
+
+template <typename Runner>
+static void ArraySortFuncStringBenchmark(benchmark::State& state, const Runner& runner) {

Review Comment:
   I'm not sure if I'm just confused or maybe this PR needs a rebase from main but it appears there is already an `ArraySortFuncStringBenchmark` at https://github.com/apache/arrow/blob/be12888997c81b1fb7947f6284be1256edd4d3e4/cpp/src/arrow/compute/kernels/vector_sort_benchmark.cc#L103
   
   Also, the version I linked to uses the actual length of the strings instead of `sizeof(int64_t)` which I think is more consistent.  Then `args.size` is the size in bytes and items processed is the number of records.



##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,116 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.
+    auto dict_values = dict_array.dictionary();
+    auto dict_indices = dict_array.indices();
+
+    // Algorithm:
+    // 1) Use the Rank function to get an exactly-equivalent-order array
+    //    of the dictionary values, but with a datatype that's friendlier to
+    //    sorting (uint64).
+    // 2) Act as if we were sorting a dictionary array with the same indices,
+    //    but with the ranks as dictionary values.
+    // 2a) Dictionary-decode the ranks by calling Take.
+    // 2b) Sort the decoded ranks. Not only those are uint64, they are dense
+    //     in a [0, k) range where k is the number of unique dictionary values.
+    //     Therefore, unless the dictionary is very large, a fast counting sort
+    //     will be used.
+    //
+    // The bottom line is that performance will usually be much better
+    // (potentially an order of magnitude faster) than by naively decoding
+    // the original dictionary and sorting the decoded version.
+
+    std::shared_ptr<Array> decoded_ranks;
+    // Skip the rank/take steps for cases with only nulls
+    if (IsAllNulls(dict_array)) {
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            MakeNullUInt64Array(dict_array.length(), ctx->memory_pool()));
+    } else {
+      ARROW_ASSIGN_OR_RAISE(auto ranks, RanksWithNulls(dict_values, ctx));
+
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            Take(*ranks, *dict_indices, TakeOptions::Defaults(), ctx));
+    }
+
+    DCHECK_EQ(decoded_ranks->type_id(), Type::UINT64);
+    DCHECK_EQ(decoded_ranks->length(), dict_array.length());
+    ARROW_ASSIGN_OR_RAISE(auto rank_sorter, GetArraySorter(*decoded_ranks->type()));
+
+    return rank_sorter(indices_begin, indices_end, *decoded_ranks, offset, options, ctx);
+  }
+
+ private:
+  static bool IsAllNulls(const DictionaryArray& dict_array) {
+    auto indices = dict_array.indices();
+    auto values = dict_array.dictionary();
+    if (values->null_count() == values->length() ||
+        indices->null_count() == indices->length()) {
+      return true;
+    }
+    for (int64_t i = 0; i < dict_array.length(); ++i) {
+      // TODO: Special handling for dictionary types in Array::IsValid/Null?
+      if (indices->IsValid(i) && values->IsValid(dict_array.GetValueIndex(i))) {
+        return false;
+      }
+    }
+    return true;
+  }
+
+  static Result<std::shared_ptr<Array>> MakeNullUInt64Array(int64_t length,
+                                                            MemoryPool* pool) {
+    ARROW_ASSIGN_OR_RAISE(auto bitmap, AllocateEmptyBitmap(length, pool));
+    auto data = ArrayData::Make(uint64(), length, {std::move(bitmap), nullptr},
+                                /*null_count=*/length);
+    return MakeArray(data);
+  }
+
+  static Result<std::shared_ptr<Array>> RanksWithNulls(
+      const std::shared_ptr<Array>& array, ExecContext* ctx) {
+    // Notes:
+    // * The order is always ascending here, since the goal is to produce
+    //   an exactly-equivalent-order of the dictionary values.
+    // * We're going to re-emit nulls in the output, so we can just always consider
+    //   them "at the end".  Note that choosing AtStart would merely shift other
+    //   ranks by 1 if there are any nulls...
+    RankOptions rank_options(SortOrder::Ascending, NullPlacement::AtEnd,
+                             RankOptions::Dense);
+
+    // XXX Should this support Type::NA?
+    auto data = array->data();
+    std::shared_ptr<Buffer> null_bitmap;
+    if (array->null_count() > 0) {
+      null_bitmap = array->null_bitmap();
+      data = array->data()->Copy();
+      if (data->offset > 0) {
+        ARROW_ASSIGN_OR_RAISE(null_bitmap, arrow::internal::CopyBitmap(
+                                               ctx->memory_pool(), null_bitmap->data(),
+                                               data->offset, data->length));
+      }
+      data->buffers[0] = nullptr;
+      data->null_count = 0;
+    }

Review Comment:
   I'm not sure I follow what is happening here.  Is `data` used after this if statement?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #35280:
URL: https://github.com/apache/arrow/pull/35280#discussion_r1195444005


##########
cpp/src/arrow/compute/kernels/vector_sort_internal.h:
##########
@@ -192,9 +192,9 @@ struct NullPartitionResult {
 // Move nulls (not null-like values) to end of array.
 //
 // `offset` is used when this is called on a chunk of a chunked array
-template <typename Partitioner>
+template <typename ArrayType, typename Partitioner>
 NullPartitionResult PartitionNullsOnly(uint64_t* indices_begin, uint64_t* indices_end,
-                                       const Array& values, int64_t offset,
+                                       const ArrayType& values, int64_t offset,

Review Comment:
   Hmm, is it necessary to pass the concrete array type? I tried to avoid combinatorial explosion here, as the `partitioner` call will generate a non-trivial amount of code.



##########
cpp/src/arrow/compute/kernels/vector_array_sort.cc:
##########
@@ -173,6 +173,109 @@ class ArrayCompareSorter {
   }
 };
 
+template <>
+class ArrayCompareSorter<DictionaryType> {
+ public:
+  Result<NullPartitionResult> operator()(uint64_t* indices_begin, uint64_t* indices_end,
+                                         const Array& array, int64_t offset,
+                                         const ArraySortOptions& options,
+                                         ExecContext* ctx) {
+    const auto& dict_array = checked_cast<const DictionaryArray&>(array);
+    // TODO: These methods should probably return a const&? They seem capable.
+    // https://github.com/apache/arrow/issues/35437
+    auto dict_values = dict_array.dictionary();
+    auto dict_indices = dict_array.indices();
+
+    // Algorithm:
+    // 1) Use the Rank function to get an exactly-equivalent-order array
+    //    of the dictionary values, but with a datatype that's friendlier to
+    //    sorting (uint64).
+    // 2) Act as if we were sorting a dictionary array with the same indices,
+    //    but with the ranks as dictionary values.
+    // 2a) Dictionary-decode the ranks by calling Take.
+    // 2b) Sort the decoded ranks. Not only those are uint64, they are dense
+    //     in a [0, k) range where k is the number of unique dictionary values.
+    //     Therefore, unless the dictionary is very large, a fast counting sort
+    //     will be used.
+    //
+    // The bottom line is that performance will usually be much better
+    // (potentially an order of magnitude faster) than by naively decoding
+    // the original dictionary and sorting the decoded version.
+
+    std::shared_ptr<Array> decoded_ranks;
+    // Skip the rank/take steps for cases with only nulls
+    if (IsAllNulls(dict_array)) {
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks, MakeArrayOfNull(uint64(), dict_array.length(),
+                                                           ctx->memory_pool()));
+    } else {
+      ARROW_ASSIGN_OR_RAISE(auto ranks, RanksWithNulls(dict_values, ctx));
+
+      ARROW_ASSIGN_OR_RAISE(decoded_ranks,
+                            Take(*ranks, *dict_indices, TakeOptions::Defaults(), ctx));
+    }
+
+    DCHECK_EQ(decoded_ranks->type_id(), Type::UINT64);
+    DCHECK_EQ(decoded_ranks->length(), dict_array.length());
+    ARROW_ASSIGN_OR_RAISE(auto rank_sorter, GetArraySorter(*decoded_ranks->type()));
+
+    return rank_sorter(indices_begin, indices_end, *decoded_ranks, offset, options, ctx);
+  }
+
+ private:
+  static bool IsAllNulls(const DictionaryArray& dict_array) {
+    auto indices = dict_array.indices();
+    auto values = dict_array.dictionary();
+    if (values->null_count() == values->length() ||
+        indices->null_count() == indices->length()) {
+      return true;
+    }
+    for (int64_t i = 0; i < dict_array.length(); ++i) {

Review Comment:
   Is it actually worth doing this? This is just a performance optimization for all-null dictionary arrays, but not necessary for correctness?



##########
cpp/src/arrow/compute/kernels/vector_sort_benchmark.cc:
##########
@@ -82,6 +100,64 @@ static void ArraySortFuncInt64Benchmark(benchmark::State& state, const Runner& r
   ArraySortFuncBenchmark(state, runner, values);
 }
 
+template <typename Runner>
+static void ArraySortFuncInt64DictBenchmark(benchmark::State& state, const Runner& runner,
+                                            int64_t min, int64_t max, int32_t dict_size) {
+  RegressionArgs args(state);
+
+  const double null_proportion = args.null_proportion / 2;
+  const int64_t array_size = args.size / sizeof(int64_t);
+
+  auto rand = random::RandomArrayGenerator(kSeed);
+  auto dict_values = rand.Int64(dict_size, min, max, null_proportion);
+  auto dict_indices = rand.Int64(array_size, 0, dict_size - 1, null_proportion);
+  auto dict_array = *DictionaryArray::FromArrays(dict_indices, dict_values);
+
+  ArraySortFuncBenchmark(state, runner, dict_array);
+}
+
+template <typename Runner>
+static void ArraySortFuncStringBenchmark(benchmark::State& state, const Runner& runner,
+                                         int32_t min_length, int32_t max_length) {
+  RegressionArgs args(state);
+
+  const int64_t array_size =
+      GetStringArraySize(args.size, min_length, max_length, args.null_proportion);
+
+  auto rand = random::RandomArrayGenerator(kSeed);
+  auto values = rand.String(array_size, min_length, max_length, args.null_proportion);
+
+  ArraySortFuncBenchmark(state, runner, values);
+}
+
+template <typename Runner>
+static void ArraySortFuncStringDictBenchmark(benchmark::State& state,
+                                             const Runner& runner, int32_t min_length,
+                                             int32_t max_length, int32_t dict_size) {
+  RegressionArgs args(state);
+
+  const double null_proportion = args.null_proportion / 2;
+  const int64_t array_size = args.size / sizeof(int64_t);

Review Comment:
   This is debatable. If we want to compare against the sort speed of an equivalent decoded array, we would like the array size to match.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1518392005

   * Closes: #29887


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35280: GH-29887: [C++] Implement dictionary array sorting

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35280:
URL: https://github.com/apache/arrow/pull/35280#issuecomment-1569423571

   ['Python', 'R'] benchmarks have high level of regressions.
   [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/3d76cb5ffb8849bf8c3ea9b32d08b3b7...578b21fce98d46d0a7dc293eccfb7692/)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org