You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by em...@yeikel.com on 2020/02/05 03:11:29 UTC

RE: [SQL] Is it worth it (and advisable) to implement native UDFs?

Is there any documentation/ sample about this besides the pull requests merged to spark core?

 

It seems that I need to create my custom functions under the package org.apache.spark.sql.* in order to be able to access some of the internal classes I saw in[1] such as Column[2]

 

Could you please confirm if that’s how it should be?

 

Thanks!

 

[1] https://github.com/apache/spark/pull/7214

[2] https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37 

 

From: Reynold Xin <rx...@databricks.com> 
Sent: Wednesday, January 22, 2020 2:22 AM
To: email@yeikel.com
Cc: dev@spark.apache.org
Subject: Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

 

  <https://r.superhuman.com/Dd8uXfQcJohyMvhLOn3aqVsTZa3RdFBJPsUUr_dlog2VG11E1e82IkBbF3kBBymivY9nQTEl6YyZ75qkdkrNKIab-ZiQZnpFKxBMbCD68X_aZP0tZFX2aKKjwD8BxV1YeeNquiifnXHyGLyK6BWyf37y0KtR1f6B03NV5eWY9Vh6iK8t-MwvNQ.gif> 

If your UDF itself is very CPU intensive, it probably won't make that much of difference, because the UDF itself will dwarf the serialization/deserialization overhead.

 

If your UDF is cheap, it will help tremendously.

 

 

On Mon, Jan 20, 2020 at 6:33 PM, <email@yeikel.com <ma...@yeikel.com> > wrote:

Hi, 

 

I read online[1] that for a best UDF performance it is possible to implement them using internal Spark expressions, and I also saw a couple of pull requests such as [2] and [3] where this was put to practice (not sure if for that reason or just to extend the API). 

 

We have an algorithm that computes a score similar to what the Levenshtein distance does and it takes about 30%-40% of the overall time of our job. We are looking for ways to improve it without adding more resources.

 

I was wondering if it would be advisable to implement it extending BinaryExpression like[1] and if it would result in any performance gains. 

 

Thanks for your help!

 

[1] https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11 

[2] https://github.com/apache/spark/pull/7214

[3] https://github.com/apache/spark/pull/7236

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

Posted by Walaa Eldin Moustafa <wa...@gmail.com>.

For a general-purpose code example, you may take a look at the class we
defined in Transport UDFs to express all Expression UDFs [1]. This is an
internal class though and not a user-facing API. User-facing UDF example is
in [2]. It leverages [1] behind the scenes.

[1]
https://github.com/linkedin/transport/blob/master/transportable-udfs-spark/src/main/scala/com/linkedin/transport/spark/StdUdfWrapper.scala
[2[
https://github.com/linkedin/transport/blob/master/transportable-udfs-examples/transportable-udfs-example-udfs/src/main/java/com/linkedin/transport/examples/MapFromTwoArraysFunction.java

Thanks,
Walaa.

On Wed, Feb 5, 2020 at 12:06 AM Wenchen Fan <cl...@gmail.com> wrote:

> This is a hack really and we don't recommend users to access internal
> classes directly. That's why there is no public document.
>
> If you really need to do it and are aware of the risks, you can read the
> source code. All expressions (or the so-called "native UDF") extend the
> base class `Expression`. You can read the code comments and look at some
> implementations.
>
> On Wed, Feb 5, 2020 at 11:11 AM <em...@yeikel.com> wrote:
>
>> Is there any documentation/ sample about this besides the pull requests
>> merged to spark core?
>>
>>
>>
>> It seems that I need to create my custom functions under the package
>> *org.apache.spark.sql.** in order to be able to access some of the
>> internal classes I saw in[1] such as Column[2]
>>
>>
>>
>> Could you please confirm if that’s how it should be?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> [1] https://github.com/apache/spark/pull/7214
>>
>> [2]
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37
>>
>>
>>
>> *From:* Reynold Xin <rx...@databricks.com>
>> *Sent:* Wednesday, January 22, 2020 2:22 AM
>> *To:* email@yeikel.com
>> *Cc:* dev@spark.apache.org
>> *Subject:* Re: [SQL] Is it worth it (and advisable) to implement native
>> UDFs?
>>
>>
>>
>> If your UDF itself is very CPU intensive, it probably won't make that
>> much of difference, because the UDF itself will dwarf the
>> serialization/deserialization overhead.
>>
>>
>>
>> If your UDF is cheap, it will help tremendously.
>>
>>
>>
>>
>>
>> On Mon, Jan 20, 2020 at 6:33 PM, <em...@yeikel.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I read online[1] that for a best UDF performance it is possible to
>> implement them using internal Spark expressions, and I also saw a couple of
>> pull requests such as [2] and [3] where this was put to practice (not sure
>> if for that reason or just to extend the API).
>>
>>
>>
>> We have an algorithm that computes a score similar to what the
>> Levenshtein distance does and it takes about 30%-40% of the overall time of
>> our job. We are looking for ways to improve it without adding more
>> resources.
>>
>>
>>
>> I was wondering if it would be advisable to implement it extending BinaryExpression
>> like[1] and if it would result in any performance gains.
>>
>>
>>
>> Thanks for your help!
>>
>>
>>
>> [1]
>> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
>>
>> [2] https://github.com/apache/spark/pull/7214
>>
>> [3] https://github.com/apache/spark/pull/7236
>>
>>
>>
>

Re: [SQL] Is it worth it (and advisable) to implement native UDFs?

Posted by Wenchen Fan <cl...@gmail.com>.

This is a hack really and we don't recommend users to access internal
classes directly. That's why there is no public document.

If you really need to do it and are aware of the risks, you can read the
source code. All expressions (or the so-called "native UDF") extend the
base class `Expression`. You can read the code comments and look at some
implementations.

On Wed, Feb 5, 2020 at 11:11 AM <em...@yeikel.com> wrote:

> Is there any documentation/ sample about this besides the pull requests
> merged to spark core?
>
>
>
> It seems that I need to create my custom functions under the package
> *org.apache.spark.sql.** in order to be able to access some of the
> internal classes I saw in[1] such as Column[2]
>
>
>
> Could you please confirm if that’s how it should be?
>
>
>
> Thanks!
>
>
>
> [1] https://github.com/apache/spark/pull/7214
>
> [2]
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L37
>
>
>
> *From:* Reynold Xin <rx...@databricks.com>
> *Sent:* Wednesday, January 22, 2020 2:22 AM
> *To:* email@yeikel.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: [SQL] Is it worth it (and advisable) to implement native
> UDFs?
>
>
>
> If your UDF itself is very CPU intensive, it probably won't make that much
> of difference, because the UDF itself will dwarf the
> serialization/deserialization overhead.
>
>
>
> If your UDF is cheap, it will help tremendously.
>
>
>
>
>
> On Mon, Jan 20, 2020 at 6:33 PM, <em...@yeikel.com> wrote:
>
> Hi,
>
>
>
> I read online[1] that for a best UDF performance it is possible to
> implement them using internal Spark expressions, and I also saw a couple of
> pull requests such as [2] and [3] where this was put to practice (not sure
> if for that reason or just to extend the API).
>
>
>
> We have an algorithm that computes a score similar to what the Levenshtein
> distance does and it takes about 30%-40% of the overall time of our job. We
> are looking for ways to improve it without adding more resources.
>
>
>
> I was wondering if it would be advisable to implement it extending BinaryExpression
> like[1] and if it would result in any performance gains.
>
>
>
> Thanks for your help!
>
>
>
> [1]
> https://hackernoon.com/apache-spark-tips-and-tricks-for-better-performance-cf2397cac11
>
> [2] https://github.com/apache/spark/pull/7214
>
> [3] https://github.com/apache/spark/pull/7236
>
>
>