You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chengi Liu <ch...@gmail.com> on 2014/03/24 18:21:37 UTC

distinct in data frame in spark

Hi,
  I have a very simple use case:

I have an rdd as following:

d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]

Now, I want to remove all the duplicates from a column and return the
remaining frame..
For example:
If i want to remove the duplicate based on column 1.
Then basically I would remove either row 1 or row 2 in my final result..
because the column 1 of both first and second row is the same element (1)
.. and hence the duplicate..
So, a possible result is:

output = [[1,2,3,4],[2,3,4,5]]

How do I do this in spark?
Thanks

Re: distinct in data frame in spark

Posted by Andrew Ash <an...@andrewash.com>.

My thought would be to key by the first item in each array, then take just
one array for each key.  Something like the below:

v = sc.parallelize(Seq(Seq(1,2,3,4),Seq(1,5,2,3),Seq(2,3,4,5)))
col = 0
output = v.keyBy(_(col)).reduceByKey(a,b => a).values


On Tue, Mar 25, 2014 at 1:21 AM, Chengi Liu <ch...@gmail.com> wrote:

> Hi,
>   I have a very simple use case:
>
> I have an rdd as following:
>
> d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]]
>
> Now, I want to remove all the duplicates from a column and return the
> remaining frame..
> For example:
> If i want to remove the duplicate based on column 1.
> Then basically I would remove either row 1 or row 2 in my final result..
> because the column 1 of both first and second row is the same element (1)
> .. and hence the duplicate..
> So, a possible result is:
>
> output = [[1,2,3,4],[2,3,4,5]]
>
> How do I do this in spark?
> Thanks
>