You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Pedro Rodriguez <sk...@gmail.com> on 2016/07/11 20:40:39 UTC
Spark SQL: Merge Arrays/Sets
Is it possible with Spark SQL to merge columns whose types are Arrays or
Sets?
My use case would be something like this:
DF types
id: String
words: Array[String]
I would want to do something like
df.groupBy('id).agg(merge_arrays('words)) -> list of all words
df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
Thanks,
--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni
ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience
Re: Spark SQL: Merge Arrays/Sets
Posted by Pedro Rodriguez <sk...@gmail.com>.
I saw that answer before, but as the response mentions its quite expensive.
I was able to do so with a UDAF, but was curious if I was just missing
something.
A more general question, what are the requirements to decide that a new
Spark SQL function should be added? Being able to make UDAFs is great, but
they also don't have native code generated and don't have supports to
"generics".
Pedro
On Mon, Jul 11, 2016 at 11:52 PM, Yash Sharma <ya...@gmail.com> wrote:
> This answers exactly what you are looking for -
>
> http://stackoverflow.com/a/34204640/1562474
>
> On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez <sk...@gmail.com>
> wrote:
>
>> Is it possible with Spark SQL to merge columns whose types are Arrays or
>> Sets?
>>
>> My use case would be something like this:
>>
>> DF types
>> id: String
>> words: Array[String]
>>
>> I would want to do something like
>>
>> df.groupBy('id).agg(merge_arrays('words)) -> list of all words
>> df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
>>
>> Thanks,
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>
--
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni
ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience
Re: Spark SQL: Merge Arrays/Sets
Posted by Yash Sharma <ya...@gmail.com>.
This answers exactly what you are looking for -
http://stackoverflow.com/a/34204640/1562474
On Tue, Jul 12, 2016 at 6:40 AM, Pedro Rodriguez <sk...@gmail.com>
wrote:
> Is it possible with Spark SQL to merge columns whose types are Arrays or
> Sets?
>
> My use case would be something like this:
>
> DF types
> id: String
> words: Array[String]
>
> I would want to do something like
>
> df.groupBy('id).agg(merge_arrays('words)) -> list of all words
> df.groupBy('id).agg(merge_sets('words)) -> list of distinct words
>
> Thanks,
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>