You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "Tangruilin (via GitHub)" <gi...@apache.org> on 2023/10/08 12:23:53 UTC

[PR] [test] add fuzz test for topk [arrow-datafusion]

Tangruilin opened a new pull request, #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772

   ## Which issue does this PR close?
   Closes #7749 .
   
   ## Rationale for this change
   Add fuzz test for topk
   
   ## What changes are included in this PR?
   1. This PR add i32 type fuzz test for topk;
   2. This PR add fn to get f64 and string batched but now not use(for the fn batches_to_vec can only work for i32)
   
   TODO: This PR is not completed. I need some suggestions that if i need to modify `fn batches_to_vec`
   
   ## Are these changes tested?
   this PR is a test
   
   ## Are there any user-facing changes?
   no


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1755937388

   > I run ci/scripts/rust_clippy.sh on my mac but get
   
   
   You may have to do:
   1. `rustup update` to get the latest rust version
   2. merge up from `origin/main` to get the latest DataFusion code


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on code in PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#discussion_r1352910050


##########
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:
##########
@@ -22,89 +22,101 @@ use arrow::{
     compute::SortOptions,
     record_batch::RecordBatch,
 };
-use datafusion::execution::runtime_env::{RuntimeConfig, RuntimeEnv};
-use datafusion::physical_plan::expressions::{col, PhysicalSortExpr};
+use arrow_array::{Float64Array, StringArray};
+use core::iter;
+use datafusion::physical_plan::expressions::PhysicalSortExpr;
 use datafusion::physical_plan::memory::MemoryExec;
 use datafusion::physical_plan::sorts::sort::SortExec;
 use datafusion::physical_plan::{collect, ExecutionPlan};
 use datafusion::prelude::{SessionConfig, SessionContext};
+use datafusion::{
+    datasource::MemTable,
+    execution::runtime_env::{RuntimeConfig, RuntimeEnv},
+};
+use datafusion_common::{
+    cast::{as_float64_array, as_string_array},
+    TableReference,
+};
 use datafusion_execution::memory_pool::GreedyMemoryPool;
-use rand::Rng;
+use datafusion_physical_expr::expressions::col;
+use rand::{rngs::StdRng, Rng, SeedableRng};
 use std::sync::Arc;
-use test_utils::{batches_to_vec, partitions_to_sorted_vec};
+use test_utils::{batches_to_vec, partitions_to_sorted_vec, stagger_batch};
 
 #[tokio::test]
 #[cfg_attr(tarpaulin, ignore)]
 async fn test_sort_1k_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(10240)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, true), (1000000, true)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(10240)
+            .with_should_spill(should_spill)
+            .run()
+            .await;
+    }
 }
 
 #[tokio::test]
 #[cfg_attr(tarpaulin, ignore)]
 async fn test_sort_100k_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(102400)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(102400)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(102400)
-        .with_should_spill(true)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, false), (1000000, true)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(102400)
+            .with_should_spill(should_spill)
+            .run()
+            .await;
+    }
 }
 
 #[tokio::test]
 async fn test_sort_unlimited_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(usize::MAX)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(usize::MAX)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(usize::MAX)
-        .with_should_spill(false)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, false), (1000000, false)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(usize::MAX)
+            .with_should_spill(should_spill)
+            .run()
+            .await;
+    }
+}
+
+#[tokio::test]
+async fn test_sort_topk_i32() {

Review Comment:
   A good suggestion! I'll do it tomorrow and push it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1765442760

   Thanks @Tangruilin  -- I plan to review this more carefully tomorrow


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1769946299

   > Thanks @Tangruilin -- I plan to review this more carefully tomorrow
   
   Don't forget this~~~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1761671342

   > except that. I found that the rust should be stable —— run rustup update && rust default stable then it is solved. Are there docs for this. If not, maybe I can add it.
   
   I think it is implicit in the [Testing setup:](https://arrow.apache.org/datafusion/contributor-guide/index.html#bootstrap-environment) part:
   
   > rustup update stable DataFusion uses the latest stable release of rust
   
   
   I agree this could be clearer in the documentation. Perhaps you can help here
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1755755070

   Mark as draft to signify this PR is not waiting on feedback anymore


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1770504401

   I am sorry 😢  -- I have not forgotten, but I am backed up on reviews:
   
   <img width="728" alt="Screenshot 2023-10-19 at 6 14 21 AM" src="https://github.com/apache/arrow-datafusion/assets/490673/d9bfa9a3-ffd4-4e20-8f9a-df3fca752ca2">
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1753294416

   > @alamb Please review it.
   
   Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1773893623

   BTW I am working on an extesion of this test to support multiple columns


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1759887266

   > > I run ci/scripts/rust_clippy.sh on my mac but get
   > 
   > You may have to do:
   > 
   > 1. `rustup update` to get the latest rust version
   > 2. merge up from `origin/main` to get the latest DataFusion code
   
   except that. I found that the rust should be stable —— run `rustup update && rust default stable` then it is solved. Are there docs for this. If not, maybe I can add it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1753923567

   Thank you @Tangruilin  -- I plan to review this tomorrow. I look forward to it!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#discussion_r1367715835


##########
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:
##########
@@ -138,6 +154,44 @@ impl SortTest {
         self
     }
 
+    async fn run_with_limit<'a>(
+        &self,
+        topk_scenario: &TopKScenario<'a>,
+    ) -> Vec<RecordBatch> {
+        let input = self.input.clone();
+        let schema = input
+            .iter()
+            .flat_map(|p| p.iter())
+            .next()
+            .expect("at least one batch")
+            .schema();
+
+        let table = MemTable::try_new(schema, input.clone()).unwrap();
+
+        let ctx = SessionContext::new();
+
+        ctx.register_table(
+            TableReference::Bare {
+                table: topk_scenario.table_name.into(),
+            },
+            Arc::new(table),
+        )
+        .unwrap();
+
+        let df = ctx
+            .table(topk_scenario.table_name)
+            .await
+            .unwrap()
+            .sort(vec![
+                datafusion_expr::col(topk_scenario.col_name).sort(true, true)
+            ])
+            .unwrap()
+            .limit(0, Some(topk_scenario.limit))

Review Comment:
   I verified the plan here has the expected `TopK` element:
   
   ```
   | physical_plan                                              | GlobalLimitExec: skip=0, fetch=10                    |
   |                                                            |   SortExec: TopK(fetch=10), expr=[x@0 ASC]           |
   |                                                            |     MemoryExec: partitions=1, partition_sizes=[38]   |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1755896754

   I run ci/scripts/rust_clippy.sh on my mac but get 
   <img width="1392" alt="image" src="https://github.com/apache/arrow-datafusion/assets/23651891/514847b4-9972-415c-9a08-2e4c376470b4">
   but the ci is 
   <img width="1007" alt="image" src="https://github.com/apache/arrow-datafusion/assets/23651891/a2b6db59-af5a-4cf5-80b5-ec16df83177d">
   
   the result is not some, I'm confused @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Weijun-H (via GitHub)" <gi...@apache.org>.
Weijun-H commented on code in PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#discussion_r1352844382


##########
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:
##########
@@ -22,89 +22,101 @@ use arrow::{
     compute::SortOptions,
     record_batch::RecordBatch,
 };
-use datafusion::execution::runtime_env::{RuntimeConfig, RuntimeEnv};
-use datafusion::physical_plan::expressions::{col, PhysicalSortExpr};
+use arrow_array::{Float64Array, StringArray};
+use core::iter;
+use datafusion::physical_plan::expressions::PhysicalSortExpr;
 use datafusion::physical_plan::memory::MemoryExec;
 use datafusion::physical_plan::sorts::sort::SortExec;
 use datafusion::physical_plan::{collect, ExecutionPlan};
 use datafusion::prelude::{SessionConfig, SessionContext};
+use datafusion::{
+    datasource::MemTable,
+    execution::runtime_env::{RuntimeConfig, RuntimeEnv},
+};
+use datafusion_common::{
+    cast::{as_float64_array, as_string_array},
+    TableReference,
+};
 use datafusion_execution::memory_pool::GreedyMemoryPool;
-use rand::Rng;
+use datafusion_physical_expr::expressions::col;
+use rand::{rngs::StdRng, Rng, SeedableRng};
 use std::sync::Arc;
-use test_utils::{batches_to_vec, partitions_to_sorted_vec};
+use test_utils::{batches_to_vec, partitions_to_sorted_vec, stagger_batch};
 
 #[tokio::test]
 #[cfg_attr(tarpaulin, ignore)]
 async fn test_sort_1k_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(10240)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, true), (1000000, true)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(10240)

Review Comment:
   ```suggestion
               .with_pool_size(10 * KB)
   ```



##########
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:
##########
@@ -22,89 +22,101 @@ use arrow::{
     compute::SortOptions,
     record_batch::RecordBatch,
 };
-use datafusion::execution::runtime_env::{RuntimeConfig, RuntimeEnv};
-use datafusion::physical_plan::expressions::{col, PhysicalSortExpr};
+use arrow_array::{Float64Array, StringArray};
+use core::iter;
+use datafusion::physical_plan::expressions::PhysicalSortExpr;
 use datafusion::physical_plan::memory::MemoryExec;
 use datafusion::physical_plan::sorts::sort::SortExec;
 use datafusion::physical_plan::{collect, ExecutionPlan};
 use datafusion::prelude::{SessionConfig, SessionContext};
+use datafusion::{
+    datasource::MemTable,
+    execution::runtime_env::{RuntimeConfig, RuntimeEnv},
+};
+use datafusion_common::{
+    cast::{as_float64_array, as_string_array},
+    TableReference,
+};
 use datafusion_execution::memory_pool::GreedyMemoryPool;
-use rand::Rng;
+use datafusion_physical_expr::expressions::col;
+use rand::{rngs::StdRng, Rng, SeedableRng};
 use std::sync::Arc;
-use test_utils::{batches_to_vec, partitions_to_sorted_vec};
+use test_utils::{batches_to_vec, partitions_to_sorted_vec, stagger_batch};
 

Review Comment:
   ```suggestion
   const KB: u64 = 1 << 10;
   ```



##########
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:
##########
@@ -22,89 +22,101 @@ use arrow::{
     compute::SortOptions,
     record_batch::RecordBatch,
 };
-use datafusion::execution::runtime_env::{RuntimeConfig, RuntimeEnv};
-use datafusion::physical_plan::expressions::{col, PhysicalSortExpr};
+use arrow_array::{Float64Array, StringArray};
+use core::iter;
+use datafusion::physical_plan::expressions::PhysicalSortExpr;
 use datafusion::physical_plan::memory::MemoryExec;
 use datafusion::physical_plan::sorts::sort::SortExec;
 use datafusion::physical_plan::{collect, ExecutionPlan};
 use datafusion::prelude::{SessionConfig, SessionContext};
+use datafusion::{
+    datasource::MemTable,
+    execution::runtime_env::{RuntimeConfig, RuntimeEnv},
+};
+use datafusion_common::{
+    cast::{as_float64_array, as_string_array},
+    TableReference,
+};
 use datafusion_execution::memory_pool::GreedyMemoryPool;
-use rand::Rng;
+use datafusion_physical_expr::expressions::col;
+use rand::{rngs::StdRng, Rng, SeedableRng};
 use std::sync::Arc;
-use test_utils::{batches_to_vec, partitions_to_sorted_vec};
+use test_utils::{batches_to_vec, partitions_to_sorted_vec, stagger_batch};
 
 #[tokio::test]
 #[cfg_attr(tarpaulin, ignore)]
 async fn test_sort_1k_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(10240)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, true), (1000000, true)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(10240)
+            .with_should_spill(should_spill)
+            .run()
+            .await;
+    }
 }
 
 #[tokio::test]
 #[cfg_attr(tarpaulin, ignore)]
 async fn test_sort_100k_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(102400)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(102400)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(102400)
-        .with_should_spill(true)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, false), (1000000, true)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(102400)

Review Comment:
   ```suggestion
               .with_pool_size(100 * KB)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#discussion_r1352722190


##########
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:
##########
@@ -138,6 +162,44 @@ impl SortTest {
         self
     }
 
+    async fn run_with_params(

Review Comment:
   Perhaps naming this `run_with_limit` would make it easier to understand what is happening



##########
datafusion/core/tests/fuzz_cases/sort_fuzz.rs:
##########
@@ -22,89 +22,101 @@ use arrow::{
     compute::SortOptions,
     record_batch::RecordBatch,
 };
-use datafusion::execution::runtime_env::{RuntimeConfig, RuntimeEnv};
-use datafusion::physical_plan::expressions::{col, PhysicalSortExpr};
+use arrow_array::{Float64Array, StringArray};
+use core::iter;
+use datafusion::physical_plan::expressions::PhysicalSortExpr;
 use datafusion::physical_plan::memory::MemoryExec;
 use datafusion::physical_plan::sorts::sort::SortExec;
 use datafusion::physical_plan::{collect, ExecutionPlan};
 use datafusion::prelude::{SessionConfig, SessionContext};
+use datafusion::{
+    datasource::MemTable,
+    execution::runtime_env::{RuntimeConfig, RuntimeEnv},
+};
+use datafusion_common::{
+    cast::{as_float64_array, as_string_array},
+    TableReference,
+};
 use datafusion_execution::memory_pool::GreedyMemoryPool;
-use rand::Rng;
+use datafusion_physical_expr::expressions::col;
+use rand::{rngs::StdRng, Rng, SeedableRng};
 use std::sync::Arc;
-use test_utils::{batches_to_vec, partitions_to_sorted_vec};
+use test_utils::{batches_to_vec, partitions_to_sorted_vec, stagger_batch};
 
 #[tokio::test]
 #[cfg_attr(tarpaulin, ignore)]
 async fn test_sort_1k_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(10240)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(10240)
-        .with_should_spill(true)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, true), (1000000, true)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(10240)
+            .with_should_spill(should_spill)
+            .run()
+            .await;
+    }
 }
 
 #[tokio::test]
 #[cfg_attr(tarpaulin, ignore)]
 async fn test_sort_100k_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(102400)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(102400)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(102400)
-        .with_should_spill(true)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, false), (1000000, true)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(102400)
+            .with_should_spill(should_spill)
+            .run()
+            .await;
+    }
 }
 
 #[tokio::test]
 async fn test_sort_unlimited_mem() {
-    SortTest::new()
-        .with_int32_batches(5)
-        .with_pool_size(usize::MAX)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(20000)
-        .with_pool_size(usize::MAX)
-        .with_should_spill(false)
-        .run()
-        .await;
-
-    SortTest::new()
-        .with_int32_batches(1000000)
-        .with_pool_size(usize::MAX)
-        .with_should_spill(false)
-        .run()
-        .await;
+    for (batch_size, should_spill) in [(5, false), (20000, false), (1000000, false)] {
+        SortTest::new()
+            .with_int32_batches(batch_size)
+            .with_pool_size(usize::MAX)
+            .with_should_spill(should_spill)
+            .run()
+            .await;
+    }
+}
+
+#[tokio::test]
+async fn test_sort_topk_i32() {

Review Comment:
   What would you think about encapsulating the limit data and expected value calculation in a structure? 
   
   So this test might look like
   
   ```rust
   let size: usize = ...; // pick a random size
   let scenario = TopKScenario::new()
      // tell the scenario to sort by one column
      .with_sort_column(["i32_column"])
      // specify a limit of 10 rows
       .with_limit(10);
   
   // stagger the batches in the scenario
   scenario.stagger()
   
   let collected = SortTest::new()
     // call Scenario::batches to get the input batches
     .with_input(scenario.batches());
   // run the test
     .run_with_limit("t", scenario.sort_cols(), scenario.limit()).await;
   
   // The scenario handles calculting expected output (as it knows the sort column and limit)
   let expected = scenario.expected()
   let actual =  batches_to_vec(&collected);
   assert_eq!(actual, &expected); 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1763879006

   I have tried some ways, but there is not a prefect solution. I will update the PR tonight


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [PR] [test] add fuzz test for topk [arrow-datafusion]

Posted by "Tangruilin (via GitHub)" <gi...@apache.org>.
Tangruilin commented on PR #7772:
URL: https://github.com/apache/arrow-datafusion/pull/7772#issuecomment-1753294100

   @alamb Please review it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org