You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "milevin (via GitHub)" <gi...@apache.org> on 2023/03/22 18:57:03 UTC

[GitHub] [arrow-datafusion] milevin opened a new issue, #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

milevin opened a new issue, #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695

   ### Describe the bug
   
   This doesn't work but should:
   
   ```
   DataFusion CLI v20.0.0
   ❯ create table temp as with orders as (
     select 1 as o_custkey
   )
   SELECT RANK() OVER (PARTITION BY o_custkey)
   FROM orders;
   Error during planning: Mismatch between schema and batches
   ```
   
   ### To Reproduce
   
   See above
   
   ### Expected behavior
   
   This should not throw the mismatch error
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] milevin commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "milevin (via GitHub)" <gi...@apache.org>.

milevin commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1483886761

   In this last case, the issue is with inferring nullability incorrectly.
   
   ```
   Schema {
       fields: [
           Field {
               name: "col2",
               data_type: Utf8,
               nullable: false,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
           Field {
               name: "col4",
               data_type: Utf8,
               nullable: false,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
       ],
       metadata: {},
   }
   
   
   [
       Schema {
           fields: [
               Field {
                   name: "col2",
                   data_type: Utf8,
                   nullable: false,
                   dict_id: 0,
                   dict_is_ordered: false,
                   metadata: {},
               },
               Field {
                   name: "col4",
                   data_type: Utf8,
                   nullable: true,
                   dict_id: 0,
                   dict_is_ordered: false,
                   metadata: {},
               },
           ],
           metadata: {},
       },
   ]
   ```
   
   It incorrectly infers that col4 is non-nullable. It should be nullable. 
   
   Any idea which code is responsible for this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1482116361

   Yes @ozankabak the problem is broader, for the case above the diff is: 
   ```
   Schema {
       fields: [
           Field {
               name: "count",
               data_type: Int64,
               nullable: true,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
           Field {
               name: "n_regionkey",
               data_type: Int64,
               nullable: true,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
       ],
       metadata: {},
   }
   [arrow-datafusion/datafusion/core/src/datasource/memory.rs:61] &batches.schema() = Schema {
       fields: [
           Field {
               name: "count",
               data_type: Int64,
               nullable: true,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
           Field {
               name: "y",
               data_type: Int64,
               nullable: true,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
       ],
       metadata: {},
   }
   Error during planning: Mismatch between schema and batches
   ```
   
   I'm checking if its related to WITH
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1480248977

   Actually the problem is the column name mismatches  between schema and the batch
   
   ```
   [datafusion/core/src/datasource/memory.rs:60] &schema = Schema {
       fields: [
           Field {
               name: "RANK() PARTITION BY [orders.o_custkey] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING",
               data_type: UInt64,
               nullable: true,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
       ],
       metadata: {},
   }
   [datafusion/core/src/datasource/memory.rs:61] &batches.schema() = Schema {
       fields: [
           Field {
               name: "RANK()",
               data_type: UInt64,
               nullable: false,
               dict_id: 0,
               dict_is_ordered: false,
               metadata: {},
           },
       ],
       metadata: {},
   }
   thread 'sql::parquet::parquet_query_with_max_min1' panicked at 'called `Result::unwrap()` on an `Err` value: Plan("Mismatch between schema and batches")', datafusion/core/tests/sql/mod.rs:1163:33
   ```
   
   The reason for this I believe in for window function the name is shortened and looks not consistent
   
   ```
   SELECT RANK() OVER (PARTITION BY x order by x) FROM (select 1 as x);
   +--------+
   | RANK() |
   +--------+
   | 1      |
   +--------+
   ```
   
   once you wrap it into the function it stops name shorten
   
   ```
   SELECT abs(RANK() OVER (PARTITION BY x order by x)) FROM (select 1 as x);
   +------------------------------------------------------------------------------------------------------------+
   | abs(RANK() PARTITION BY [x] ORDER BY [x ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) |
   +------------------------------------------------------------------------------------------------------------+
   | 1.0                                                                                                        |
   +------------------------------------------------------------------------------------------------------------+
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] milevin commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "milevin (via GitHub)" <gi...@apache.org>.

milevin commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1480264523

   Great, thank you for looking into this! I also discovered a similar workaround (and added it into the Additional context).
   
   We might have observed the same issue outside of windowing functions; I'll see if I can create more repros. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] ozankabak commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "ozankabak (via GitHub)" <gi...@apache.org>.

ozankabak commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1482089903

   > Note: no windowing here, and all the columns are aliased
   
   You beat me to it, I looked around and was about to write that this seems to be a more general issue than windowing 🙂 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] milevin commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "milevin (via GitHub)" <gi...@apache.org>.

milevin commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1483091625

   Is there a bug in the logic of inferring output column names? Note how in this example, DF is assigning `y` to the second output column -- pretty randomly! -- instead of using the correct `n_regionkey`.
   
   (It is my understanding that in a SELECT with a UNION clause, the output column names are derived from the first SELECT sub-clause; perhaps somebody can dig through the standard to confirm that.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb closed issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb closed issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query 
URL: https://github.com/apache/arrow-datafusion/issues/5695


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1481578121

   > The code currently uses alias if its given or shortened the name to prevent huge unreadable names.
   @alamb I'm not sure tbh if we should revert
   
   I don't know the context of why this was done / if we should do something more fancy
   
   Perhaps @ozankabak  and @mustafasrepo have some input into this
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] milevin commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "milevin (via GitHub)" <gi...@apache.org>.

milevin commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1482086910

   @comphead Here is another instance
   
   ```
   ❯ create table temp as WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
   select count(*) count, n_regionkey from w2 group by n_regionkey
   union all
   select x, y from w1 order by n_regionkey, count desc;
   Error during planning: Mismatch between schema and batches
   ```
   
   Note: no windowing here, and all the columns are aliased


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1481413999

   @milevin I have looked into the code and another workaround, more natural is to give an alias
   
   ```
   ❯ create table temp1 as with orders as (
     select 1 as o_custkey
   )
   SELECT RANK() OVER (PARTITION BY o_custkey) as a
   FROM orders;
   0 rows in set. Query took 0.010 seconds.
   ```
   
   The code currently uses alias if its given or shortened the name to prevent huge unreadable names. 
   @alamb I'm not sure tbh if we should revert 
   https://github.com/apache/arrow-datafusion/blob/26e1b20ea3362ea62cb713004a0636b8af6a16d7/datafusion/core/src/physical_plan/planner.rs#L1630


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1481327006

   I'm looking into this today!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1483894531

   Thanks @milevin for reporting all of this. Tbh I think there are multiple places. I’ll start with Union all first. For nullability check we probably need to weaken the check itself to test names and data types only


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] milevin commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "milevin (via GitHub)" <gi...@apache.org>.

milevin commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1483877234

   @comphead Another day another "Mismatch" instance:
   
   ```
   DataFusion CLI v20.0.0
   ❯ create table temp as
   with t1 as (select 1 as col1, 'asd' as col2), t2 as (select 1 as col3, 'sdf' as col4)
   select col2, col4 from t1 left join t2 on col1 = col3;
   Error during planning: Mismatch between schema and batches
   ```
   
   Looks unrelated to the other two cases I provided.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1480252635

   as a workarounds you may want to use
   
   ```
   ❯ create table temp1 as with orders as (
     select 1 as o_custkey
   )
   SELECT 0+RANK() OVER (PARTITION BY o_custkey)
   FROM orders;
   0 rows in set. Query took 0.008 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1483550835

   you are right, col names from first union all branch are the driving
   
   This case is not correct, col names has to be `count, n_regionkey`
   ```
   ❯ WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
   select count(*) count, n_regionkey from w2 group by n_regionkey
   union all
   select x, y from w1 order by n_regionkey, count desc;
   +-------+----+
   | count | y  |
   +-------+----+
   | 1     | 5  |
   | 1     | 10 |
   +-------+----+
   ```
   If I remove order by I'm getting even more surprising
   ```
   ❯ WITH w1 AS (select 1 as x , max(10) as y), w2 AS (select 5 as n_regionkey)
   select count(*) count, n_regionkey from w2 group by n_regionkey
   union all
   select x, y from w1;
   +---+----+
   | x | y  |
   +---+----+
   | 1 | 10 |
   | 1 | 5  |
   +---+----+
   ```
   
   The bug partially related to wrong col name derivation in UNION ALL
   
   
   ```
   ❯ select  1 a, 2 b union all select 3 c, 4 d
   ;
   +---+---+
   | c | d |
   +---+---+
   | 3 | 4 |
   | 1 | 2 |
   +---+---+
   ```
   
   I will prepare a fix for UNION ALL first and then test out other scenarios, like not deterministic column naming with and without ORDER BY
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on issue #5695: Mismatch between schema and batches on a CREATE TABLE with a windowing query

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on issue #5695:
URL: https://github.com/apache/arrow-datafusion/issues/5695#issuecomment-1486668268

   Update is that @comphead  is tracking the issues on https://github.com/apache/arrow-datafusion/issues/5747 (thanks @comphead !)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org