You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/06/22 17:40:33 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #600: Allow User Defined Aggregates to return multiple values / structs

alamb opened a new issue #600:
URL: https://github.com/apache/arrow-datafusion/issues/600


   # Usecase
   I want to implement a user defined aggregate function that produces more than one column ( logical values)
   
   Specifically I am trying to implement the InfluxDB 'selector' functions `first`, `last`, `min`, and `max` as DataFusion aggregate functions.
   
   I can't use the built in aggregate functions in DataFusion as selector functions aren't exactly like normal aggregate functions – they return both the actual aggregate value as well as a timestamp. In addition, `first` and `last` pick a row in the value column based on the value in the timestamp column.
   
   After some investigation, I realize I can't elegantly use the built in user defined aggregate framework in DataFusion either. As an example of what is going on here, let's take
   
   ```
   value | time
   -----+-----
   3 | 1000
   2 | 2000
   1 | 3000
   ```
   
   The result of `last(value)` should be be two columns `1 | 3000` – however, modeling this as a DataFusion aggregate does not seem to be possible at this time. Each aggregate function can return a single columnar value but we need to return 2 (the `.value` and `.time` fields).
   
   
   See additional detail and context on https://github.com/influxdata/influxdb_iox/issues/448#issuecomment-744601824
   **Describe the solution you'd like**
   Ideally I was thinking that the UDF could produce a Struct (with named field `value` and `time`) but the evaluate function([code](https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/mod.rs#L238) returns a `ScalarValue` and at the moment they [don't have support for Structs](https://github.com/apache/arrow/blob/master/rust/datafusion/src/scalar.rs#L44)
   
   I suspect that we would also need to add support in DataFusion for selecting fields from structs
   
   **Additional context**
   Ported from original JIRA: https://issues.apache.org/jira/browse/ARROW-10945
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #600: Allow User Defined Aggregates to return multiple values / structs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #600:
URL: https://github.com/apache/arrow-datafusion/issues/600#issuecomment-866233559


   When I implemented todays' aggs, I Initially though about multiple return values, and then concluded that the Struct is sufficient and desirable. What I like about the struct is that it enables named fields, which imo makes the statements rather expressive. E.g. 
   
   ```
   df = df.agg(udaf.call("a").alias("a"))
   df.select(df["a"]["min"], df["a"]["max"])
   ```
   
   vs 
   
   ```
   df = df.agg(udaf.call("a").alias("a"))
   df.select(df["a"][0], df["a"][1])
   ```
   
   the context "min" and "max" imo helps the user at reading what they are extracting from the column.
   
   Would supporting structs for ScalarValues solve this nicely?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #600: Allow User Defined Aggregates to return multiple values / structs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #600:
URL: https://github.com/apache/arrow-datafusion/issues/600#issuecomment-866246437


   it makes a lot of sense.
   
   Would you be ok for you if we add two issues, one for supporting structs on ScalarValues, the other for supporting accessing struct fields by name on SQL as a replacement for this one? 
   
   I can work on both of them over this weekend.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #600: Allow User Defined Aggregates to return multiple values / structs

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #600:
URL: https://github.com/apache/arrow-datafusion/issues/600#issuecomment-866240917


   > Would supporting structs for ScalarValues solve this nicely?
   
   I think so @jorgecarleitao  - what is also needed is some way in SQL to refer to the structs
   
   So like if the UDA could return a struct with fields `min` and `max` I would want to be able to do something like
   
   ```sql
   select t.min, t.max from (select my_udagg(col1, col2) from my_table) as t
   ```
   
   or something


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #600: Allow User Defined Aggregates to return multiple values / structs

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #600:
URL: https://github.com/apache/arrow-datafusion/issues/600#issuecomment-866337998


   Perfect. I created #602 and #603 . 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #600: Allow User Defined Aggregates to return multiple values / structs

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #600:
URL: https://github.com/apache/arrow-datafusion/issues/600#issuecomment-866254025


   > Would you be ok for you if we add two issues, one for supporting structs on ScalarValues, the other for supporting accessing struct fields by name on SQL as a replacement for this one?
   
   Sure that would be great! Would you like me to file them?
   
   > I can work on both of them over this weekend.
   
   Thank you!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org