You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Hamel Kothari <ha...@gmail.com> on 2016/04/15 17:44:53 UTC

Skipping Type Conversion and using InternalRows for UDF

Hi all,

So we have these UDFs which take <1ms to operate and we're seeing pretty
poor performance around them in practice, the overhead being >10ms for the
projections (this data is deeply nested with ArrayTypes and MapTypes so
that could be the cause). Looking at the logs and code for ScalaUDF, I
noticed that there are a series of projections which take place before and
after in order to make the Rows safe and then unsafe again. Is there any
way to opt out of this and input/return InternalRows to skip the
performance hit of the type conversion? It doesn't immediately appear to be
possible but I'd like to make sure that I'm not missing anything.

I suspect we could make this possible by checking if typetags in the
register function are all internal types, if they are, passing a false
value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF
checking for that flag to figure out if the conversion process needs to
take place. We're still left with the issue of missing a schema in the case
of outputting InternalRows, but we could expose the DataType parameter
rather than inferring it in the register function. Is there anything else
in the code that would prevent this from working?

Regards,
Hamel

Re: Skipping Type Conversion and using InternalRows for UDF

Posted by Michael Armbrust <mi...@databricks.com>.
This would also probably improve performance:
https://github.com/apache/spark/pull/9565

On Fri, Apr 15, 2016 at 8:44 AM, Hamel Kothari <ha...@gmail.com>
wrote:

> Hi all,
>
> So we have these UDFs which take <1ms to operate and we're seeing pretty
> poor performance around them in practice, the overhead being >10ms for the
> projections (this data is deeply nested with ArrayTypes and MapTypes so
> that could be the cause). Looking at the logs and code for ScalaUDF, I
> noticed that there are a series of projections which take place before and
> after in order to make the Rows safe and then unsafe again. Is there any
> way to opt out of this and input/return InternalRows to skip the
> performance hit of the type conversion? It doesn't immediately appear to be
> possible but I'd like to make sure that I'm not missing anything.
>
> I suspect we could make this possible by checking if typetags in the
> register function are all internal types, if they are, passing a false
> value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF
> checking for that flag to figure out if the conversion process needs to
> take place. We're still left with the issue of missing a schema in the case
> of outputting InternalRows, but we could expose the DataType parameter
> rather than inferring it in the register function. Is there anything else
> in the code that would prevent this from working?
>
> Regards,
> Hamel
>