You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pedro Rodriguez <sk...@gmail.com> on 2016/07/11 20:40:39 UTC

Spark SQL: Merge Arrays/Sets

Is it possible with Spark SQL to merge columns whose types are Arrays or
Sets?

My use case would be something like this:

DF types
id: String
words: Array[String]

I would want to do something like

df.groupBy('id).agg(merge_arrays('words)) -> list of all words
df.groupBy('id).agg(merge_sets('words)) -> list of distinct words

Thanks,
-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark SQL: Merge Arrays/Sets

Posted by Pedro Rodriguez <sk...@gmail.com>.
I saw that answer before, but as the response mentions its quite expensive.
I was able to do so with a UDAF, but was curious if I was just missing
something.

A more general question, what are the requirements to decide that a new
Spark SQL function should be added? Being able to make UDAFs is great, but
they also don't have native code generated and don't have supports to
"generics".

Pedro

On Mon, Jul 11, 2016 at 11:52 PM, Yash Sharma <ya...@gmail.com> wrote:

> This answers exactly what you are looking for -
>
> http://stackoverflow.com/a/34204640/1562474
>
> On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez <sk...@gmail.com>
> wrote:
>
>> Is it possible with Spark SQL to merge columns whose types are Arrays or
>> Sets?
>>
>> My use case would be something like this:
>>
>> DF types
>> id: String
>> words: Array[String]
>>
>> I would want to do something like
>>
>> df.groupBy('id).agg(merge_arrays('words)) -> list of all words
>> df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
>>
>> Thanks,
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>


-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience

Re: Spark SQL: Merge Arrays/Sets

Posted by Yash Sharma <ya...@gmail.com>.
This answers exactly what you are looking for -

http://stackoverflow.com/a/34204640/1562474

On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez <sk...@gmail.com>
wrote:

> Is it possible with Spark SQL to merge columns whose types are Arrays or
> Sets?
>
> My use case would be something like this:
>
> DF types
> id: String
> words: Array[String]
>
> I would want to do something like
>
> df.groupBy('id).agg(merge_arrays('words)) -> list of all words
> df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>