You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/24 12:24:15 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #6758: Improve default output column names

alamb opened a new issue, #6758:
URL: https://github.com/apache/arrow-datafusion/issues/6758

   ### Is your feature request related to a problem or challenge?
   
   The way that DataFusion names the output columns can be quite messy. For example, the query below produces a column named `"FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12)"` (yes that whole thing)
   
   Arrow requires column names to be distinct strings and DataFusion creates the name based on the content of the expression. 
   
   Here is the entire query
   
   ```sql
   ❯ create table foo(host varchar, free int, time int) as values ('a', 1, 2), ('a', 3, 4);
   0 rows in set. Query took 0.003 seconds.
   ❯ select first_value(free) over (partition by host order by time desc) + 5 / 12 from foo limit 10;
   +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | FIRST_VALUE(foo.free) PARTITION BY [foo.host] ORDER BY [foo.time DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW + Int64(5) / Int64(12) |
   +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | 3                                                                                                                                                           |
   | 3                                                                                                                                                           |
   +-------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.003 seconds.
   ```
   
   It also means you can't select the same expression without an alias, which while annoying is likely not a critical usecase
   
   ```sql
   ❯ select free / 1, free / 1 from foo;
   Error during planning: Projections require unique expression names but the expression "foo.free / Int64(1)" at position 0 and "foo.free / Int64(1)" at position 1 have the same name. Consider aliasing ("AS") one of them.
   ❯ select free / 1, free / 1 as col2 from foo;
   +---------------------+------+
   | foo.free / Int64(1) | col2 |
   +---------------------+------+
   | 1                   | 1    |
   | 3                   | 3    |
   +---------------------+------+
   ```
   
   ### Describe the solution you'd like
   
   While the output column names can be controlled via adding an explicit alias,  I think it would be good to have a more concise way of naming the default outputs, especially for window functions
   
   Perhaps something like not adding the `OVER` clause's contents so `FIRST_VALUE(foo.free)` in the above example.
   
   If we do this we need to be careful to allow multiple calls to `FIRST_VALUE` with different `OVER` clauses in the same query
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   If we make the names more concise, we will probably need a solution like https://github.com/apache/arrow-datafusion/issues/6543 to ensure they remain unique


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org