You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lu...@sina.com on 2016/07/06 11:07:17 UTC

how to select first 50 value of each group after group by?

hi thereI have a DF with 3 columns: id , pv, location.(the rows are already grouped by location and sort by pv in des)  I wanna get the first 50 id values grouped by location. I checked the API of dataframe,groupeddata,pairRDD, and found no match.      is there a way to do this naturally?      any info will be appreciated.


--------------------------------

 

Thanks&amp;Best regards!
San.Luo

Re: how to select first 50 value of each group after group by?

Posted by Anton Okolnychyi <an...@gmail.com>.

The following resources should be useful:

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html

The last link should have the exact solution

2016-07-06 16:55 GMT+02:00 Tal Grynbaum <ta...@gmail.com>:

> You can use rank window function to rank each row in the group,  and then
> filter the rowz with rank < 50
>
> On Wed, Jul 6, 2016, 14:07 <lu...@sina.com> wrote:
>
>> hi there
>> I have a DF with 3 columns: id , pv, location.(the rows are already
>> grouped by location and sort by pv in des)  I wanna get the first 50 id
>> values grouped by location. I checked the API of
>> dataframe,groupeddata,pairRDD, and found no match.
>>       is there a way to do this naturally?
>>       any info will be appreciated.
>>
>>
>>
>> --------------------------------
>>
>> Thanks&amp;Best regards!
>> San.Luo
>>
>

Re: how to select first 50 value of each group after group by?

Posted by Tal Grynbaum <ta...@gmail.com>.

You can use rank window function to rank each row in the group,  and then
filter the rowz with rank < 50

On Wed, Jul 6, 2016, 14:07 <lu...@sina.com> wrote:

> hi there
> I have a DF with 3 columns: id , pv, location.(the rows are already
> grouped by location and sort by pv in des)  I wanna get the first 50 id
> values grouped by location. I checked the API of
> dataframe,groupeddata,pairRDD, and found no match.
>       is there a way to do this naturally?
>       any info will be appreciated.
>
>
>
> --------------------------------
>
> Thanks&amp;Best regards!
> San.Luo
>