You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by lu...@sina.com on 2016/07/06 11:07:17 UTC
how to select first 50 value of each group after group by?
hi thereI have a DF with 3 columns: id , pv, location.(the rows are already grouped by location and sort by pv in des) I wanna get the first 50 id values grouped by location. I checked the API of dataframe,groupeddata,pairRDD, and found no match. is there a way to do this naturally? any info will be appreciated.
--------------------------------
Thanks&Best regards!
San.Luo
Re: how to select first 50 value of each group after group by?
Posted by Anton Okolnychyi <an...@gmail.com>.
The following resources should be useful:
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html
The last link should have the exact solution
2016-07-06 16:55 GMT+02:00 Tal Grynbaum <ta...@gmail.com>:
> You can use rank window function to rank each row in the group, and then
> filter the rowz with rank < 50
>
> On Wed, Jul 6, 2016, 14:07 <lu...@sina.com> wrote:
>
>> hi there
>> I have a DF with 3 columns: id , pv, location.(the rows are already
>> grouped by location and sort by pv in des) I wanna get the first 50 id
>> values grouped by location. I checked the API of
>> dataframe,groupeddata,pairRDD, and found no match.
>> is there a way to do this naturally?
>> any info will be appreciated.
>>
>>
>>
>> --------------------------------
>>
>> Thanks&Best regards!
>> San.Luo
>>
>
Re: how to select first 50 value of each group after group by?
Posted by Tal Grynbaum <ta...@gmail.com>.
You can use rank window function to rank each row in the group, and then
filter the rowz with rank < 50
On Wed, Jul 6, 2016, 14:07 <lu...@sina.com> wrote:
> hi there
> I have a DF with 3 columns: id , pv, location.(the rows are already
> grouped by location and sort by pv in des) I wanna get the first 50 id
> values grouped by location. I checked the API of
> dataframe,groupeddata,pairRDD, and found no match.
> is there a way to do this naturally?
> any info will be appreciated.
>
>
>
> --------------------------------
>
> Thanks&Best regards!
> San.Luo
>