You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Jorge Cardoso Leitão <jo...@gmail.com> on 2020/08/23 17:16:43 UTC

[DataFusion] Proposal to change how UDFs are called in DataFrame API

Hi,

I came to a limitation that I would like to propose a resolution to.

TL;DR; currently, users plan UDFs calls via a call of the form

let e = scalar_functions(“my_udf”, vec![col(“a”)],DataType::Float64)]);
df.select(vec![e])

The proposal is to use instead:

let f = df.registry();

let e = f.udf(“my_udf”, vec![col(“a”)])?;

# note: no DataType::Float64

df.select(vec![e])

so that users do not have to know the return type of the udf they are using
(they still need to set it during registration). This will make our lives
easier, and will also enable our own UDFs (e.g. sqrt) to support variable
types (e.g. float32 and float64). This will be important for functions that
return composite objects, such as array(), whose return type heavily
depends on its input type.

Proposal:
https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing

Issue: https://issues.apache.org/jira/browse/ARROW-9836
PR: https://github.com/apache/arrow/pull/8032

Best,
Jorge

Re: [DataFusion] Proposal to change how UDFs are called in DataFrame API

Posted by Andrew Lamb <al...@influxdata.com>.
I think this is a good proposal and I support its implementation, for
whatever that is worth

On Sun, Aug 23, 2020 at 12:17 PM Jorge Cardoso Leitão <
jorgecarleitao@gmail.com> wrote:

> Hi,
>
> I came to a limitation that I would like to propose a resolution to.
>
> TL;DR; currently, users plan UDFs calls via a call of the form
>
> let e = scalar_functions(“my_udf”, vec![col(“a”)],DataType::Float64)]);
> df.select(vec![e])
>
> The proposal is to use instead:
>
> let f = df.registry();
>
> let e = f.udf(“my_udf”, vec![col(“a”)])?;
>
> # note: no DataType::Float64
>
> df.select(vec![e])
>
> so that users do not have to know the return type of the udf they are using
> (they still need to set it during registration). This will make our lives
> easier, and will also enable our own UDFs (e.g. sqrt) to support variable
> types (e.g. float32 and float64). This will be important for functions that
> return composite objects, such as array(), whose return type heavily
> depends on its input type.
>
> Proposal:
>
> https://docs.google.com/document/d/1Kzz642ScizeKXmVE1bBlbLvR663BKQaGqVIyy9cAscY/edit?usp=sharing
>
> Issue: https://issues.apache.org/jira/browse/ARROW-9836
> PR: https://github.com/apache/arrow/pull/8032
>
> Best,
> Jorge
>