You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "js8544 (via GitHub)" <gi...@apache.org> on 2023/06/14 15:54:46 UTC

[GitHub] [arrow] js8544 opened a new pull request, #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

js8544 opened a new pull request, #36067:
URL: https://github.com/apache/arrow/pull/36067

   
   
   <!--
   Thanks for opening a pull request!
   If this is your first pull request you can find detailed information on how 
   to contribute here:
     * [New Contributor's Guide](https://arrow.apache.org/docs/dev/developers/guide/step_by_step/pr_lifecycle.html#reviews-and-merge-of-the-pull-request)
     * [Contributing Overview](https://arrow.apache.org/docs/dev/developers/overview.html)
   
   
   If this is not a [minor PR](https://github.com/apache/arrow/blob/main/CONTRIBUTING.md#Minor-Fixes). Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose
   
   Opening GitHub issues ahead of time contributes to the [Openness](http://theapacheway.com/open/#:~:text=Openness%20allows%20new%20users%20the,must%20happen%20in%20the%20open.) of the Apache Arrow project.
   
   Then could you also rename the pull request title in the following format?
   
       GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   or
   
       MINOR: [${COMPONENT}] ${SUMMARY}
   
   In the case of PARQUET issues on JIRA the title also supports:
   
       PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}
   
   -->
   
   ### Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   Performance improvement for scalar lookup functions.
   
   ### What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   1. Reserve space for hashtable for scalar lookup functions before the insertion process.
   2. A benchmark that demonstrates the improvement. Data is pasted in the comments.
   
   ### Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   4. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   By existing unit tests.
   
   ### Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please uncomment the line below and explain which changes are breaking.
   -->
   <!-- **This PR includes breaking changes to public APIs.** -->
   
   <!--
   Please uncomment the line below (and provide explanation) if the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld). We use this to highlight fixes to issues that may affect users without their knowledge. For this reason, fixing bugs that cause errors don't count, since those are usually obvious.
   -->
   <!-- **This PR contains a "Critical Fix".** -->
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238611574


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238646880


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -134,10 +151,12 @@ BENCHMARK(IndexInInt8SmallSet)->RangeMultiplier(4)->Range(2, 8);
 BENCHMARK(IndexInInt16SmallSet)->RangeMultiplier(4)->Range(2, 64);
 BENCHMARK(IndexInInt32SmallSet)->RangeMultiplier(4)->Range(2, 64);
 BENCHMARK(IndexInInt64SmallSet)->RangeMultiplier(4)->Range(2, 64);
+BENCHMARK(IndexInInt32LargeSet)->RangeMultiplier(10)->Range(100, 10000000);

Review Comment:
   Let's make the benchmarks shorter a bit? We don't need so many and so large value sets IMHO.
   ```suggestion
   BENCHMARK(IndexInInt32LargeSet)->RangeMultiplier(100)->Range(1000, 1000000);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1235916639


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   Why `10` here instead of `1 << 18`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238642308


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -56,8 +56,8 @@ static void SetLookupBenchmarkString(benchmark::State& state,
 template <typename Type>
 static void SetLookupBenchmarkNumeric(benchmark::State& state,
                                       const std::string& func_name,
-                                      const int64_t value_set_length) {
-  const int64_t array_length = 1 << 18;
+                                      const int64_t value_set_length,
+                                      const int64_t array_length) {

Review Comment:
   perhaps add `state.counters["value_set_length"] = value_set_length` somewhere below to make results more readable? (same in the other bench function)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238355926


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   Well, we should keep the benchmarks meaningful and relevant. A 10-length input array does not really represent real-world use cases.
   
   We can also add this optimization without adding micro-benchmarks for it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238477686


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:
##########
@@ -36,16 +36,23 @@ namespace {
 
 template <typename Type>
 struct SetLookupState : public KernelState {
-  explicit SetLookupState(MemoryPool* pool) : lookup_table(pool, 0) {}
+  explicit SetLookupState(MemoryPool* pool) : lookup_table(pool, 0), memory_pool(pool) {}

Review Comment:
   Good point. Changed it to `std::optional`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238618296


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:
##########
@@ -54,8 +61,8 @@ struct SetLookupState : public KernelState {
     } else {
       return Status::Invalid("value_set should be an array or chunked array");
     }
-    if (!options.skip_nulls && lookup_table.GetNull() >= 0) {
-      null_index = memo_index_to_value_index[lookup_table.GetNull()];
+    if (!options.skip_nulls && lookup_table.value().GetNull() >= 0) {
+      null_index = memo_index_to_value_index[lookup_table.value().GetNull()];

Review Comment:
   Can probably use the dereference operator?
   ```suggestion
       if (!options.skip_nulls && lookup_table->GetNull() >= 0) {
         null_index = memo_index_to_value_index[lookup_table->GetNull()];
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou merged pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou merged PR #36067:
URL: https://github.com/apache/arrow/pull/36067


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238606021


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   Well, we can keep the large set benchmarks, but with a "normal" array size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238675399


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -56,8 +56,8 @@ static void SetLookupBenchmarkString(benchmark::State& state,
 template <typename Type>
 static void SetLookupBenchmarkNumeric(benchmark::State& state,
                                       const std::string& func_name,
-                                      const int64_t value_set_length) {
-  const int64_t array_length = 1 << 18;
+                                      const int64_t value_set_length,
+                                      const int64_t array_length) {

Review Comment:
   done



##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,52 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238618585


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:
##########
@@ -54,8 +61,8 @@ struct SetLookupState : public KernelState {
     } else {
       return Status::Invalid("value_set should be an array or chunked array");
     }
-    if (!options.skip_nulls && lookup_table.GetNull() >= 0) {
-      null_index = memo_index_to_value_index[lookup_table.GetNull()];
+    if (!options.skip_nulls && lookup_table.value().GetNull() >= 0) {
+      null_index = memo_index_to_value_index[lookup_table.value().GetNull()];

Review Comment:
   Same in other occurrences below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238621255


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:
##########
@@ -54,8 +61,8 @@ struct SetLookupState : public KernelState {
     } else {
       return Status::Invalid("value_set should be an array or chunked array");
     }
-    if (!options.skip_nulls && lookup_table.GetNull() >= 0) {
-      null_index = memo_index_to_value_index[lookup_table.GetNull()];
+    if (!options.skip_nulls && lookup_table.value().GetNull() >= 0) {
+      null_index = memo_index_to_value_index[lookup_table.value().GetNull()];

Review Comment:
   Done. They look nicer indeed!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238675747


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -134,10 +151,12 @@ BENCHMARK(IndexInInt8SmallSet)->RangeMultiplier(4)->Range(2, 8);
 BENCHMARK(IndexInInt16SmallSet)->RangeMultiplier(4)->Range(2, 64);
 BENCHMARK(IndexInInt32SmallSet)->RangeMultiplier(4)->Range(2, 64);
 BENCHMARK(IndexInInt64SmallSet)->RangeMultiplier(4)->Range(2, 64);
+BENCHMARK(IndexInInt32LargeSet)->RangeMultiplier(10)->Range(100, 10000000);

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] conbench-apache-arrow[bot] commented on pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "conbench-apache-arrow[bot] (via GitHub)" <gi...@apache.org>.
conbench-apache-arrow[bot] commented on PR #36067:
URL: https://github.com/apache/arrow/pull/36067#issuecomment-1606361220

   Conbench analyzed the 6 benchmark runs on commit `2ce4a38f`.
   
   There were 9 benchmark results indicating a performance regression:
   
   - Commit Run on `ursa-thinkcentre-m75q` at [2023-06-25 05:44:29Z](http://conbench.ursa.dev/compare/runs/a7e4561927c347279aec1272a1d9f746...530d7cc6a6e3438996dc8ae202561b4a/)
     - [params=<Add, DoubleType>/size:524288/inverse_null_proportion:100, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark](http://conbench.ursa.dev/compare/benchmarks/06497aef183a7675800026714e13d13d...06497d498b0e712b80003aa1dac08dc8)
     - [params=<Multiply, Int8Type>/size:524288/inverse_null_proportion:100, source=cpp-micro, suite=arrow-compute-scalar-arithmetic-benchmark](http://conbench.ursa.dev/compare/benchmarks/06497aef73607f918000944cfe75116b...06497d49e5c172b780008efad0b1d735)
   - and 7 more (see the report linked below)
   
   The [full Conbench report](https://github.com/apache/arrow/runs/14539738701) has more details.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238609790


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   That sounds fine to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238490730


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   Actually, we do use `IndexIn` in a similar fashion. In our recommender system, we need to find if the candidate items (25~50ish) appear in the user's action history (up to 100000 for really acrive users). However, I can remove this if you have a strong opinion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on PR #36067:
URL: https://github.com/apache/arrow/pull/36067#issuecomment-1591552160

   Benchmark result:
   Before:
   ```
   ----------------------------------------------------------------------------------------
   Benchmark                              Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------------
   IndexInInt32LargeSet/100            2883 ns         2883 ns       246728 bytes_per_second=13.2302M/s items_per_second=3.46823M/s
   IndexInInt32LargeSet/1000          10322 ns        10322 ns        63804 bytes_per_second=3.6958M/s items_per_second=968.833k/s
   IndexInInt32LargeSet/10000        139318 ns       139316 ns         4987 bytes_per_second=280.389k/s items_per_second=71.7795k/s
   IndexInInt32LargeSet/100000      4588807 ns      4588452 ns          165 bytes_per_second=8.51322k/s items_per_second=2.17938k/s
   IndexInInt32LargeSet/1000000    90538493 ns     90536500 ns           10 bytes_per_second=441.811/s items_per_second=110.453/s
   IndexInInt32LargeSet/10000000 1428412088 ns   1428369286 ns            1 bytes_per_second=28.004/s items_per_second=7.00099/s
   IsInInt32LargeSet/100               2833 ns         2833 ns       234016 bytes_per_second=13.4639M/s items_per_second=3.52948M/s
   IsInInt32LargeSet/1000             10376 ns        10376 ns        67001 bytes_per_second=3.6765M/s items_per_second=963.773k/s
   IsInInt32LargeSet/10000           148294 ns       148284 ns         4769 bytes_per_second=263.43k/s items_per_second=67.4381k/s
   IsInInt32LargeSet/100000         4463505 ns      4463293 ns          147 bytes_per_second=8.75195k/s items_per_second=2.2405k/s
   IsInInt32LargeSet/1000000       75740587 ns     75738687 ns           10 bytes_per_second=528.132/s items_per_second=132.033/s
   IsInInt32LargeSet/10000000    1317214434 ns   1317158744 ns            1 bytes_per_second=30.3684/s items_per_second=7.5921/s
   ```
   
   After
   ```
   ----------------------------------------------------------------------------------------
   Benchmark                              Time             CPU   Iterations UserCounters...
   ----------------------------------------------------------------------------------------
   IndexInInt32LargeSet/100            2659 ns         2658 ns       266800 bytes_per_second=14.3493M/s items_per_second=3.76159M/s
   IndexInInt32LargeSet/1000           9759 ns         9759 ns        70169 bytes_per_second=3.9091M/s items_per_second=1024.75k/s
   IndexInInt32LargeSet/10000         93311 ns        93309 ns         7477 bytes_per_second=418.634k/s items_per_second=107.17k/s
   IndexInInt32LargeSet/100000      1729160 ns      1729162 ns          453 bytes_per_second=22.5904k/s items_per_second=5.78315k/s
   IndexInInt32LargeSet/1000000    79806851 ns     79805041 ns            9 bytes_per_second=501.221/s items_per_second=125.305/s
   IndexInInt32LargeSet/10000000 1112193163 ns   1112056579 ns            1 bytes_per_second=35.9694/s items_per_second=8.99235/s
   IsInInt32LargeSet/100               2518 ns         2518 ns       280148 bytes_per_second=15.1526M/s items_per_second=3.97215M/s
   IsInInt32LargeSet/1000              9625 ns         9624 ns        72600 bytes_per_second=3.96356M/s items_per_second=1039.02k/s
   IsInInt32LargeSet/10000            93759 ns        93754 ns         7256 bytes_per_second=416.647k/s items_per_second=106.662k/s
   IsInInt32LargeSet/100000         1608221 ns      1608152 ns          495 bytes_per_second=24.2903k/s items_per_second=6.21832k/s
   IsInInt32LargeSet/1000000       73113133 ns     73112833 ns            9 bytes_per_second=547.1/s items_per_second=136.775/s
   IsInInt32LargeSet/10000000    1207785785 ns   1207731078 ns            1 bytes_per_second=33.12/s items_per_second=8.27999/s
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1236275360


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   If query length is `1 << 18`, the execution time would be dominated by the "lookup" phase and the optimization I implemented would have very little effect.  The idea is that the new `LargeSet` benchmarks are to measure the performance of the "preparation" phase, while the existing `SmallSet` benchmarks are to measure the "lookup" phase.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238359428


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup.cc:
##########
@@ -36,16 +36,23 @@ namespace {
 
 template <typename Type>
 struct SetLookupState : public KernelState {
-  explicit SetLookupState(MemoryPool* pool) : lookup_table(pool, 0) {}
+  explicit SetLookupState(MemoryPool* pool) : lookup_table(pool, 0), memory_pool(pool) {}

Review Comment:
   Ok, but this makes an initial memory allocation for the lookup table which is immediately undone in `Init`.
   
   A possible way around that is to make `lookup_table` a `std::optional<MemoTable>`, to allow deferred initialization.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] js8544 commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "js8544 (via GitHub)" <gi...@apache.org>.
js8544 commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238609030


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,51 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IndexInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IndexInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "index_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IndexInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "index_in_meta_binary", state.range(0), 10);
 }
 
 static void IsInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "is_in_meta_binary", state.range(0),
+                                      1 << 18);
 }
 
 static void IsInInt16SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int16Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt32SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
 }
 
 static void IsInInt64SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int64Type>(state, "is_in_meta_binary", state.range(0),
+                                       1 << 18);
+}
+
+static void IsInInt32LargeSet(benchmark::State& state) {
+  SetLookupBenchmarkNumeric<Int32Type>(state, "is_in_meta_binary", state.range(0), 10);

Review Comment:
   Ok, how about 1000?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] pitrou commented on a diff in pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "pitrou (via GitHub)" <gi...@apache.org>.
pitrou commented on code in PR #36067:
URL: https://github.com/apache/arrow/pull/36067#discussion_r1238644123


##########
cpp/src/arrow/compute/kernels/scalar_set_lookup_benchmark.cc:
##########
@@ -91,35 +91,52 @@ static void IsInStringLargeSet(benchmark::State& state) {
 }
 
 static void IndexInInt8SmallSet(benchmark::State& state) {
-  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0));
+  SetLookupBenchmarkNumeric<Int8Type>(state, "index_in_meta_binary", state.range(0),
+                                      1 << 18);

Review Comment:
   Hmm, can we avoid the repeated hardcoded value?
   You could define a constant such as `constexpr int64 kArrayLengthWithSmallSet = 1 << 18`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #36067: GH-36059: [C++][Compute] Reserve space for hashtable for scalar lookup functions

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #36067:
URL: https://github.com/apache/arrow/pull/36067#issuecomment-1591531791

   :warning: GitHub issue #36059 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org