You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Andrew Lamb (Jira)" <ji...@apache.org> on 2020/12/16 18:46:00 UTC

[jira] [Created] (ARROW-10945) [Rust] [DataFusion] Allow User Defined Aggregates to return multiple values / structs

Andrew Lamb created ARROW-10945:
-----------------------------------

             Summary: [Rust] [DataFusion] Allow User Defined Aggregates to return multiple values / structs
                 Key: ARROW-10945
                 URL: https://issues.apache.org/jira/browse/ARROW-10945
             Project: Apache Arrow
          Issue Type: New Feature
            Reporter: Andrew Lamb



Usecase:
I want to implement a user defined aggregate function that produces more than one column ( logical values)

Specifically I am trying to implement the InfluxDB 'selector' functions `first`, `last`, `min`, and `max` as DataFusion aggregate functions.

I can't use the built in aggregate functions in DataFusion as selector functions aren't exactly like normal aggregate functions -- they return both the actual aggregate value as well as a timestamp. In addition, `first` and `last` pick a row in the value column based on the value in the timestamp column.

After some investigation, I realize I can't elegantly use the built in user defined aggregate framework in DataFusion either. As an example of what is going on here, let's take

```
value | time
------+------
  3   | 1000
  2   | 2000
  1   | 3000
```

The result of `last(value)` should be be two columns `1 | 3000` -- however, modeling this as a DataFusion aggregate does not seem to be possible at this time.  Each aggregate function can return a single columnar value but we need to return 2 (the `.value` and `.time` fields).

Ideally I was thinking that the UDF could produce a Struct (with named field `value` and `time`) but the evaluate function([code])(https://github.com/apache/arrow/blob/master/rust/datafusion/src/physical_plan/mod.rs#L238))returns a `ScalarValue` and at the moment they [don't have support for Structs](https://github.com/apache/arrow/blob/master/rust/datafusion/src/scalar.rs#L44)

I suspect that we would also need to add support in DataFusion for selecting fields from structs

See additional detail and context on https://github.com/influxdata/influxdb_iox/issues/448#issuecomment-744601824




--
This message was sent by Atlassian Jira
(v8.3.4#803005)