You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2021/04/07 12:20:00 UTC

[jira] [Updated] (ARROW-12235) [Rust][DataFusion] LIMIT returns incorrect results when used with several small partitions

     [ https://issues.apache.org/jira/browse/ARROW-12235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ASF GitHub Bot updated ARROW-12235:
-----------------------------------
    Labels: pull-request-available  (was: )

> [Rust][DataFusion] LIMIT returns incorrect results when used with several small partitions
> ------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12235
>                 URL: https://issues.apache.org/jira/browse/ARROW-12235
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust - DataFusion
>            Reporter: Andrew Lamb
>            Assignee: Andrew Lamb
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I noticed when I was running some queries locally that `LIMIT` was not behaving correctly. For my case, a query with `LIMIT 10` was always returning zero rows. 
> I spent some time and I have found a self contained reproducer. If you put the following test in `rust/src/datafusion/execution/context.rs` it will fail.
> {code}
>     /// Return a RecordBatch with a single Int32 array with values (0..sz)
>     fn make_partition(sz: i32) -> RecordBatch {
>         let seq_start = 0;
>         let seq_end =  sz;
>         let values = (seq_start..seq_end).collect::<Vec<_>>();
>         let schema = Arc::new(Schema::new(vec![Field::new("i", DataType::Int32, true)]));
>         let arr = Arc::new(Int32Array::from(values));
>         let arr = arr as ArrayRef;
>         RecordBatch::try_new(schema.clone(),vec![arr]).unwrap()
>     }
>     #[tokio::test]
>     async fn limit_multi_partitions() -> Result<()> {
>         let tmp_dir = TempDir::new()?;
>         let mut ctx = create_ctx(&tmp_dir, 1)?;
>         let partitions = vec![
>             vec![make_partition(0)],
>             vec![make_partition(1)],
>             vec![make_partition(2)],
>             vec![make_partition(3)],
>             vec![make_partition(4)],
>             vec![make_partition(5)],
>         ];
>         let schema = partitions[0][0].schema();
>         let provider = Arc::new(MemTable::try_new(schema, partitions).unwrap());
>         ctx.register_table("t", provider)
>             .unwrap();
>         // select all rows
>         let results = plan_and_collect(&mut ctx, "SELECT i FROM t")
>             .await
>             .unwrap();
>         let num_rows: usize = results.into_iter().map(|b| b.num_rows()).sum();
>         assert_eq!(num_rows, 15);
>         for limit in 1..10 {
>             let query = format!("SELECT i FROM t limit {}", limit);
>             let results = plan_and_collect(&mut ctx, &query)
>                 .await
>                 .unwrap();
>             let num_rows: usize = results.into_iter().map(|b| b.num_rows()).sum();
>             assert_eq!(num_rows, limit, "mismatch with query {}", query);
>         }
>         Ok(())
>     }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)