You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by RD <rd...@gmail.com> on 2017/06/15 04:52:19 UTC

[Spark Sql/ UDFs] Spark and Hive UDFs parity

Hi Spark folks,

    Is there any plan to support the richer UDF API that Hive supports for
Spark UDFs ? Hive supports the GenericUDF API which has, among others
methods like initialize(), configure() (called once on the cluster) etc,
which a lot of our users use. We have now a lot of UDFs in Hive which make
use of these methods. We plan to move to UDFs to Spark UDFs but are being
limited by not having similar lifecycle methods.
   Are there plans to address these? Or do people usually adopt some sort
of workaround?

   If we  directly use  the Hive UDFs  in Spark we pay a performance
penalty. I think Spark anyways does a conversion from InternalRow to Row
back to InternalRow for native spark udfs and for Hive it does InternalRow
to Hive Object back to InternalRow but somehow the conversion in native
udfs is more performant.

-Best,
R.

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Posted by Yong Zhang <ja...@hotmail.com>.

I assume you use Scala to implement your UDFs.


In this case, Scala language itself provides some options already for you.


If you want to control more logic when UDFs init, you can define a Scala object, def your UDF as part of it, then the object in Scala will behavior like Singleton pattern for you.


So the Sacala object's constructor logic can be treated as init/configure contract as in Hive. They will be called once per JVM, to init your Scala object. That should meet your requirement.


The only trick part is the context reference for configure() method, which allow you to pass some configuration dynamic to your UDF for runtime. Since object in Scala has to fix at compile time, so you cannot pass any parameters to the construct of it. But there is nothing stopping you building Scala class/companion object to allow any parameter passed in at constructor/init time, which can control your UDF's behavior.


If you have a concrete example that you cannot do in Spark Scala UDF, you can post here.


Yong


________________________________
From: RD <rd...@gmail.com>
Sent: Friday, June 16, 2017 11:37 AM
To: Georg Heiler
Cc: user@spark.apache.org
Subject: Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can you elaborate?



On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler <ge...@gmail.com>> wrote:
What about using map partitions instead?

RD <rd...@gmail.com>> schrieb am Do. 15. Juni 2017 um 06:52:
Hi Spark folks,

    Is there any plan to support the richer UDF API that Hive supports for Spark UDFs ? Hive supports the GenericUDF API which has, among others methods like initialize(), configure() (called once on the cluster) etc, which a lot of our users use. We have now a lot of UDFs in Hive which make use of these methods. We plan to move to UDFs to Spark UDFs but are being limited by not having similar lifecycle methods.
   Are there plans to address these? Or do people usually adopt some sort of workaround?

   If we  directly use  the Hive UDFs  in Spark we pay a performance penalty. I think Spark anyways does a conversion from InternalRow to Row back to InternalRow for native spark udfs and for Hive it does InternalRow to Hive Object back to InternalRow but somehow the conversion in native udfs is more performant.

-Best,
R.

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Posted by Georg Heiler <ge...@gmail.com>.

I assume you want to have this life cycle in oder to create big/ heavy /
complex objects only once ( per partition) map partitions should fit this
usecase pretty well.
RD <rd...@gmail.com> schrieb am Fr. 16. Juni 2017 um 17:37:

> Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can
> you elaborate?
>
>
>
> On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler <ge...@gmail.com>
> wrote:
>
>> What about using map partitions instead?
>>
>> RD <rd...@gmail.com> schrieb am Do. 15. Juni 2017 um 06:52:
>>
>>> Hi Spark folks,
>>>
>>>     Is there any plan to support the richer UDF API that Hive supports
>>> for Spark UDFs ? Hive supports the GenericUDF API which has, among others
>>> methods like initialize(), configure() (called once on the cluster) etc,
>>> which a lot of our users use. We have now a lot of UDFs in Hive which make
>>> use of these methods. We plan to move to UDFs to Spark UDFs but are being
>>> limited by not having similar lifecycle methods.
>>>    Are there plans to address these? Or do people usually adopt some
>>> sort of workaround?
>>>
>>>    If we  directly use  the Hive UDFs  in Spark we pay a performance
>>> penalty. I think Spark anyways does a conversion from InternalRow to Row
>>> back to InternalRow for native spark udfs and for Hive it does InternalRow
>>> to Hive Object back to InternalRow but somehow the conversion in native
>>> udfs is more performant.
>>>
>>> -Best,
>>> R.
>>>
>>
>

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Posted by RD <rd...@gmail.com>.

Thanks Georg. But I'm not sure how mapPartitions is relevant here.  Can you
elaborate?



On Thu, Jun 15, 2017 at 4:18 AM, Georg Heiler <ge...@gmail.com>
wrote:

> What about using map partitions instead?
>
> RD <rd...@gmail.com> schrieb am Do. 15. Juni 2017 um 06:52:
>
>> Hi Spark folks,
>>
>>     Is there any plan to support the richer UDF API that Hive supports
>> for Spark UDFs ? Hive supports the GenericUDF API which has, among others
>> methods like initialize(), configure() (called once on the cluster) etc,
>> which a lot of our users use. We have now a lot of UDFs in Hive which make
>> use of these methods. We plan to move to UDFs to Spark UDFs but are being
>> limited by not having similar lifecycle methods.
>>    Are there plans to address these? Or do people usually adopt some sort
>> of workaround?
>>
>>    If we  directly use  the Hive UDFs  in Spark we pay a performance
>> penalty. I think Spark anyways does a conversion from InternalRow to Row
>> back to InternalRow for native spark udfs and for Hive it does InternalRow
>> to Hive Object back to InternalRow but somehow the conversion in native
>> udfs is more performant.
>>
>> -Best,
>> R.
>>
>

Re: [Spark Sql/ UDFs] Spark and Hive UDFs parity

Posted by Georg Heiler <ge...@gmail.com>.

What about using map partitions instead?
RD <rd...@gmail.com> schrieb am Do. 15. Juni 2017 um 06:52:

> Hi Spark folks,
>
>     Is there any plan to support the richer UDF API that Hive supports for
> Spark UDFs ? Hive supports the GenericUDF API which has, among others
> methods like initialize(), configure() (called once on the cluster) etc,
> which a lot of our users use. We have now a lot of UDFs in Hive which make
> use of these methods. We plan to move to UDFs to Spark UDFs but are being
> limited by not having similar lifecycle methods.
>    Are there plans to address these? Or do people usually adopt some sort
> of workaround?
>
>    If we  directly use  the Hive UDFs  in Spark we pay a performance
> penalty. I think Spark anyways does a conversion from InternalRow to Row
> back to InternalRow for native spark udfs and for Hive it does InternalRow
> to Hive Object back to InternalRow but somehow the conversion in native
> udfs is more performant.
>
> -Best,
> R.
>