You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "haohuaijin (via GitHub)" <gi...@apache.org> on 2023/12/01 02:33:43 UTC

[I] windows function don't support `partition by null` and `order by null` and cause error [arrow-datafusion]

haohuaijin opened a new issue, #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386

   ### Describe the bug
   
   During work on #8371, @comphead proposed that we can test `partition by null` and `order by null`
   
   I test this and find  `partition by null` and `order by null` will fail in all windows function(maybe, I'm not test all)
   ```
   ❯ CREATE TABLE t1 (a int) AS VALUES (1), (2), (3);
   0 rows in set. Query took 0.018 seconds.
   
   ❯ select rank() over (partition by null) from t1;
   thread 'tokio-runtime-worker' panicked at /home/hhj/datafusion/datafusion/physical-plan/src/repartition/mod.rs:208:79:
   called `Result::unwrap()` on an `Err` value: InvalidArgumentError("must either specify a row count or at least one column")
   Execution error: Expects PARTITION BY expression to be ordered
   
   ❯ select rank() over (order by null) from t1;
   type_coercion
   caused by
   Internal error: Cannot run range queries on datatype: Null.
   This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
   
   ❯ select sum(a) over (order by null) from t1;
   type_coercion
   caused by
   Internal error: Cannot run range queries on datatype: Null.
   This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
   
   ❯ select sum(a) over (partition by null) from t1;
   Internal error: All partition by columns should have an ordering.
   This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
   ```
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   correct work like duckdb
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386#issuecomment-1858516235

   Thank you guys, great work to close multiple issues related to this corner case, I created another PR for `PARTITION BY` which has been failed before, and there likely will be followup for RANK function which somehow works not correctly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386#issuecomment-1856695669

   Moving discussion from PR to the ticket
   
   @mustafasrepo  can I have your input please. The reason is the query fails
   
   ```
   select rank() over (order by 1) rnk from (select 1 a union all select 2 a) x
   Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema
   ```
   The reason for that is optimize_projections removes the unused fields and this breaks the consistency. The question for you, what do you think is expected behavior for such rare case?
   
   PS. If add a column name to projection or to ORDER BY it will expectedly work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses [arrow-datafusion]

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.

mustafasrepo closed issue #8386: [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses
URL: https://github.com/apache/arrow-datafusion/issues/8386


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses [arrow-datafusion]

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.

mustafasrepo commented on issue #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386#issuecomment-1857399233

   > Moving discussion from PR to the ticket
   > 
   > @mustafasrepo can I have your input please. The reason is the query fails
   > 
   > ```
   > select rank() over (order by 1) rnk from (select 1 a union all select 2 a) x
   > Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema
   > ```
   > 
   > The reason for that is optimize_projections removes the unused fields and this breaks the consistency. The question for you, what do you think is expected behavior for such rare case?
   > 
   > PS. If add a column name to projection or to ORDER BY it will expectedly work.
   
   I expect this query to work. I understood the root cause of this problem (explained in the PR body). This [PR](https://github.com/apache/arrow-datafusion/pull/8553) solves this issue. Thanks @comphead for discovering this bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] windows function don't support `partition by null` and `order by null` and cause error [arrow-datafusion]

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386#issuecomment-1836855380

   its not only null problem, DF cannot handle scalars as well
   
   ```
   ❯ select a, rank() over (partition by 1 order by 1) from (select 1 a union all select 2 a);
   Execution error: Sort operation is not applicable to scalar value 1
   ```
   For PG this query is valid. I'll take a look


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses [arrow-datafusion]

Posted by "mustafasrepo (via GitHub)" <gi...@apache.org>.

mustafasrepo commented on issue #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386#issuecomment-1857458176

   I want to ask a question regarding the scalars inside `PARTITION BY` AND `ORDER BY` clauses. Maybe someone familiar with the standard can answer this. 
   
   Is there any difference between results of `RANK() OVER()` AND `RANK() OVER(PARTITION BY <literal>, ORDER BY <literal>)`. If there is no difference maybe we can rewrite expression `RANK() OVER(PARTITION BY <literal>, ORDER BY <literal>)` into `RANK() OVER()`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386#issuecomment-1857772359

   > Is there any difference between results of RANK() OVER() AND RANK() OVER(PARTITION BY <literal>, ORDER BY <literal>). If there is no difference maybe we can rewrite expression RANK() OVER(PARTITION BY <literal>, ORDER BY <literal>) into RANK() OVER().
   
   I do not know of any difference in these queries
   
   Rewriting the query sounds like an elegant solution to me.
   
   I suggest we do the rewrite  in the analyzer pass (rather than SQL planner) so such queries can also be run via the DataFrame API


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] windows function don't support `partition by null` and `order by null` and cause error [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #8386:
URL: https://github.com/apache/arrow-datafusion/issues/8386#issuecomment-1836737533

   This is not yet fixed (I had hoped it would be in https://github.com/apache/arrow-datafusion/pull/8371)
   
   
   
   ```
   ❯  CREATE TABLE t1 (a int) AS VALUES (1), (2), (3);
   0 rows in set. Query took 0.008 seconds.
   
   ❯ select rank() over (order by null) from t1;
   type_coercion
   caused by
   Internal error: Cannot run range queries on datatype: Null.
   This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org