You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "mustafasrepo (via GitHub)" <gi...@apache.org> on 2023/05/02 13:13:17 UTC

[GitHub] [arrow-datafusion] mustafasrepo opened a new issue, #6190: Utilize PRIMARY KEY information better

mustafasrepo opened a new issue, #6190:
URL: https://github.com/apache/arrow-datafusion/issues/6190

   ### Is your feature request related to a problem or challenge?
   
   Consider the query below
   ```sql
   SELECT s.sn, s.amount
               FROM sales_global AS s
               GROUP BY sn
   ```
   When `sn` is `PRIMARY KEY` we know that, all the columns the table `sales_global` e.g `s` can be emitted after aggregation (since they will all have same values).  Corresponding query can run in Postgre. However, datafusion can only emit `s.sn` after aggregation from the original table.
   
   ### Describe the solution you'd like
   
   I would like to have this support.
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   To reproduce the problem one can use the test below
   ```rust
   #[tokio::test]
   async fn test_primary_key_aggregation() -> Result<()> {
       let config = SessionConfig::new()
           .with_target_partitions(1);
       let ctx = SessionContext::with_config(config);
       ctx.sql("CREATE TABLE sales_global (
         sn INT PRIMARY KEY,
         ts TIMESTAMP,
         currency VARCHAR(3),
         amount INT
       ) as VALUES
         (1, '2022-01-01 08:00:00'::timestamp, 'EUR', 50.00),
         (2, '2022-01-01 11:30:00'::timestamp, 'EUR', 75.00),
         (3, '2022-01-02 12:00:00'::timestamp, 'EUR', 200.00),
         (4, '2022-01-03 10:00:00'::timestamp, 'EUR', 100.00)").await?;
       let sql = "SELECT s.sn, s.amount
           FROM sales_global AS s
           GROUP BY sn";
   
       let msg = format!("Creating logical plan for '{sql}'");
       let dataframe: DataFrame = ctx.sql(sql).await.expect(&msg);
       let physical_plan = dataframe.create_physical_plan().await?;
       let batches = collect(physical_plan, ctx.task_ctx()).await?;
       print_batches(&batches)?;
       Ok(())
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #6190: Utilize PRIMARY KEY information better

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #6190: Utilize PRIMARY KEY information better
URL: https://github.com/apache/arrow-datafusion/issues/6190


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] mustafasrepo commented on issue #6190: Utilize PRIMARY KEY information better

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.

mustafasrepo commented on issue #6190:
URL: https://github.com/apache/arrow-datafusion/issues/6190#issuecomment-1532571673

   > 
   
   Thanks @alamb I will explore how can we utilize distinct count statistic to support this feature. Thanks for the suggestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6190: Utilize PRIMARY KEY information better

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6190:
URL: https://github.com/apache/arrow-datafusion/issues/6190#issuecomment-1532106529

   I think primary key could be modeled by the existing "distinct count" in the statistics: https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.ColumnStatistics.html
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #6190: Utilize PRIMARY KEY information better

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #6190:
URL: https://github.com/apache/arrow-datafusion/issues/6190#issuecomment-1532107309

   I am not sure that datafusion itself should have any notion of "primary key" per se -- that seems like something that would be engine specific


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org