You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by "daniel.mescheder" <da...@realimpactanalytics.com> on 2015/04/23 10:30:18 UTC

In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

Hi everyone,

I was playing with the integration of Hive UDAFs in Spark-SQL and noticed that the terminatePartial and merge methods of custom UDAFs were not called. This made me curious as those two methods are the ones responsible for distributing the UDAF execution in Hive.
Looking at the code of HiveUdafFunction which seems to be the wrapper for all native Hive functions for which there exists no spark-sql specific implementation, I noticed that it

a) extends AggregateFunction and not PartialAggregate
b) only contains calls to iterate and evaluate, but never to merge of the underlying UDAFEvaluator object

My question is thus twofold: Is my observation correct, that to achieve distributed execution of a UDAF I have to add a custom implementation at the spark-sql layer (like the examples in aggregates.scala)? If that is the case, how difficult would it be to use the terminatePartial and merge functions provided by the UDAFEvaluator to make Hive UDAFs distributed by default?


Cheers,

Daniel



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/In-Spark-SQL-is-there-support-for-distributed-execution-of-native-Hive-UDAFs-tp11753.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: In Spark-SQL, is there support for distributed execution of native Hive UDAFs?

Posted by Reynold Xin <rx...@databricks.com>.

Your understanding is correct -- there is no partial aggregation currently
for Hive UDAF.

However, there is a PR to fix that:
https://github.com/apache/spark/pull/5542



On Thu, Apr 23, 2015 at 1:30 AM, daniel.mescheder <
daniel.mescheder@realimpactanalytics.com> wrote:

> Hi everyone,
>
> I was playing with the integration of Hive UDAFs in Spark-SQL and noticed
> that the terminatePartial and merge methods of custom UDAFs were not
> called. This made me curious as those two methods are the ones responsible
> for distributing the UDAF execution in Hive.
> Looking at the code of HiveUdafFunction which seems to be the wrapper for
> all native Hive functions for which there exists no spark-sql specific
> implementation, I noticed that it
>
> a) extends AggregateFunction and not PartialAggregate
> b) only contains calls to iterate and evaluate, but never to merge of the
> underlying UDAFEvaluator object
>
> My question is thus twofold: Is my observation correct, that to achieve
> distributed execution of a UDAF I have to add a custom implementation at
> the spark-sql layer (like the examples in aggregates.scala)? If that is the
> case, how difficult would it be to use the terminatePartial and merge
> functions provided by the UDAFEvaluator to make Hive UDAFs distributed by
> default?
>
>
> Cheers,
>
> Daniel
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/In-Spark-SQL-is-there-support-for-distributed-execution-of-native-Hive-UDAFs-tp11753.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.