You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/12/14 18:49:43 UTC

[GitHub] [arrow-datafusion] alamb commented on pull request #1448: Fix bug in projection: "column types must match schema types, expected XXX but found YYY"

alamb commented on pull request #1448:
URL: https://github.com/apache/arrow-datafusion/pull/1448#issuecomment-993877514


   > I have a perhaps naive question, shouldn't aliasing a query to an already existing column just fail outright? Wouldn't that reasonably be something of a "which column do you really want here?" confusion?
   
   
   @hntd187  -- I agree that aliasing one column to another column is unlikely to be actually useful 😆  it was just the minimal reproducer I could come up with. 
   
   Another use of `Projection`, despite its slightly misleading name, is to evaluate expressions as well as to control the names of the fields in the output schema, which is what IOx was doing that triggered this bug.
   
   Specifically, the plan in the IOx test was:
   
   ```
   2021-12-13T14:38:49.774041Z DEBUG datafusion::execution::context: Logical plan:
    Projection: #cpu.cpu, #cpu.host, CAST(#usage_system AS Int64) AS usage_system, CAST(#usage_user AS Int64) AS usage_user, #time
     Sort: #cpu.cpu ASC NULLS FIRST, #cpu.host ASC NULLS FIRST
       Projection: #cpu.cpu, #cpu.host, #usage_system, #usage_user, #time
         Aggregate: groupBy=[[#cpu.cpu, #cpu.host]], aggr=[[COUNT(#cpu.usage_system AS usage_system) AS usage_system, COUNT(#cpu.usage_user AS usage_user) AS usage_user, MAX(#cpu.time) AS time]]
           Filter: TimestampNanosecond(0) <= #cpu.time AND #cpu.time < TimestampNanosecond(2001)
             TableScan: cpu projection=None
   ```
   
   Which was built programatically, but is approximately what would come out of this query
   
   ```sql
   SELECT 
     cpu,
     host,
     count(usage_system) as usage_system, 
     count(usage_user) as usage_user
     max(time) as time,
   FROM
     cpu
   WHERE 0 < time AND time < 2001
   ```
   
   In this case, the consumer of the output expect the columns named a certain way, and without the alias `count(usage_system)` results in a column named something like `count(usage_system)` rather than the expected `usage_system`, and this the alias is added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org