You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Kyle McCarthy (Jira)" <ji...@apache.org> on 2019/12/30 22:43:00 UTC

[jira] [Updated] (ARROW-7480) [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns

     [ https://issues.apache.org/jira/browse/ARROW-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kyle McCarthy updated ARROW-7480:
---------------------------------
    Description: 
There are two scenarios that cause problems but are related to the queries with aggregate expressions and the SQL planner.

 

*Scenario 1*

Columns are grouped by but not selected.
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields in schema")){noformat}
You can reproduce with the aggregate_test_100 dataset with the query:
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

Columns are selected, but not grouped or part of an aggregate function. This query actually will run, but the wrong schema is produced.

Query: 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:
{code:java}
Schema {
    fields: [
        Field {
            name: "c0",
            data_type: Utf8,
            nullable: true,
        },
        Field {
            name: "c1",
            data_type: Float64,
            nullable: true,
        },
        Field {
            name: "c1",
            data_type: Float64,
            nullable: true,
        },
    ],
    metadata: {},
} {code}
This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, Float64).

 
----
Schema 2 is questionable since some DBMS will run the query (ex MySQL) but others will require that all the columns must be in the GROUP BY to be used in an aggregate function.

  was:
There are two scenarios that cause problems but are related to the queries with aggregate expressions and the SQL planner.

 

*Scenario 1*

Columns are grouped by but not selected.
{noformat}
ArrowError(InvalidArgumentError("number of columns must match number of fields in schema")){noformat}
You can reproduce with the aggregate_test_100 dataset with the query:
{code:java}
SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
 

*Senario 2*

Columns are selected, but not grouped or part of an aggregate function. This query actually will run, but the wrong schema is produced.

Query: 
{code:java}
SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
Schema generated:
{code:java}
Schema {
    fields: [
        Field {
            name: "c0",
            data_type: Utf8,
            nullable: true,
        },
        Field {
            name: "c1",
            data_type: Float64,
            nullable: true,
        },
        Field {
            name: "c1",
            data_type: Float64,
            nullable: true,
        },
    ],
    metadata: {},
} {code}
This should actually be Field(c0, Utf8), Field(c13, Utf8), Field(MAX, Float64).

 
----
Schema 2 is questionable since some DBMS will run the query (ex MySQL) but others will require that all the columns must be in the GROUP BY to be used in an aggregate function.


> [Rust] [DataFusion] Query fails/incorrect when aggregated + grouped columns don't match the selected columns
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7480
>                 URL: https://issues.apache.org/jira/browse/ARROW-7480
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Kyle McCarthy
>            Priority: Major
>
> There are two scenarios that cause problems but are related to the queries with aggregate expressions and the SQL planner.
>  
> *Scenario 1*
> Columns are grouped by but not selected.
> {noformat}
> ArrowError(InvalidArgumentError("number of columns must match number of fields in schema")){noformat}
> You can reproduce with the aggregate_test_100 dataset with the query:
> {code:java}
> SELECT c1, MIN(c12) FROM aggregate_test_100 GROUP BY c1, c13{code}
>  
> *Senario 2*
> Columns are selected, but not grouped or part of an aggregate function. This query actually will run, but the wrong schema is produced.
> Query: 
> {code:java}
> SELECT c1, c13, MIN(c12) FROM aggregate_test_100 GROUP BY c1{code}
> Schema generated:
> {code:java}
> Schema {
>     fields: [
>         Field {
>             name: "c0",
>             data_type: Utf8,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>         Field {
>             name: "c1",
>             data_type: Float64,
>             nullable: true,
>         },
>     ],
>     metadata: {},
> } {code}
> This should actually be Field(c1, Utf8), Field(c13, Utf8), Field(MAX, Float64).
>  
> ----
> Schema 2 is questionable since some DBMS will run the query (ex MySQL) but others will require that all the columns must be in the GROUP BY to be used in an aggregate function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)