You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Wei Chen <we...@gmail.com> on 2016/01/06 01:56:31 UTC

pyspark dataframe: row with a minimum value of a column for each group

Hi,

I am trying to retrieve the rows with a minimum value of a column for each
group. For example: the following dataframe:

a | b | c
----------
1 | 1 | 1
1 | 2 | 2
1 | 3 | 3
2 | 1 | 4
2 | 2 | 5
2 | 3 | 6
3 | 1 | 7
3 | 2 | 8
3 | 3 | 9
----------

I group by 'a', and want the rows with the smallest 'b', that is, I want to
return the following dataframe:

a | b | c
----------
1 | 1 | 1
2 | 1 | 4
3 | 1 | 7
----------

The dataframe I have is huge so get the minimum value of b from each group
and joining on the original dataframe is very expensive. Is there a better
way to do this?


Thanks,
Wei

Re: pyspark dataframe: row with a minimum value of a column for each group

Posted by Kristina Rogale Plazonic <kp...@gmail.com>.
Try redefining your window, without sortBy part. In other words, rerun your
code with

window = Window.partitionBy("a")

The thing is that the window is defined differently in these two cases. In
your example, in the group where "a" is 1,

  - If you include "sortBy" option, it is a rolling window:
       - 1st min is computed on the first row in this group,
       - 2nd min is computed on the first 2 rows in this group,
       - 3rd min is computed on the first 3 rows in this group

  - if you don't include the sortBy option, min is computed on a constant
window of width 3.

On Wed, Jan 6, 2016 at 2:34 PM, Wei Chen <we...@gmail.com> wrote:

> Thank you. I have tried the window function as follows:
>
> import pyspark.sql.functions as f
> sqc = sqlContext
> from pyspark.sql import Window
> import pandas as pd
>
> DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
>                    'b': [1,2,3,1,2,3,1,2,3],
>                    'c': [1,2,3,4,5,6,7,8,9]
>                   })
>
> df = sqc.createDataFrame(DF)
>
> window = Window.partitionBy("a").orderBy("c")
>
> df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show()
>
> I got the following result which is understandable:
>
> +---+---+---+---+
> |  a|  b|  c|  y|
> +---+---+---+---+
> |  1|  1|  1|  1|
> |  1|  2|  2|  1|
> |  1|  3|  3|  1|
> |  2|  1|  4|  4|
> |  2|  2|  5|  4|
> |  2|  3|  6|  4|
> |  3|  1|  7|  7|
> |  3|  2|  8|  7|
> |  3|  3|  9|  7|
> +---+---+---+---+
>
>
> However if I change min to max, the result is not what is expected:
>
> df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives
>
> +---+---+---+---+
> |  a|  b|  c|  y|
> +---+---+---+---+
> |  1|  1|  1|  1|
> |  1|  2|  2|  2|
> |  1|  3|  3|  3|
> |  2|  1|  4|  4|
> |  2|  2|  5|  5|
> |  2|  3|  6|  6|
> |  3|  1|  7|  7|
> |  3|  2|  8|  8|
> |  3|  3|  9|  9|
> +---+---+---+---+
>
>
>
> Thanks,
>
> Wei
>
>
> On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <gu...@gmail.com> wrote:
>
>> Yes there is. It is called window function over partitions.
>>
>> Equivalent SQL would be:
>>
>> select * from
>>          (select a,b,c, rank() over (partition by a order by b) r from
>> df) x
>> where r = 1
>>
>> You can register your DF as a temp table and use the sql form. Or,
>> (>Spark 1.4) you can use window methods and their variants in Spark SQL
>> module.
>>
>> HTH....
>>
>> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <we...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to retrieve the rows with a minimum value of a column for
>>> each group. For example: the following dataframe:
>>>
>>> a | b | c
>>> ----------
>>> 1 | 1 | 1
>>> 1 | 2 | 2
>>> 1 | 3 | 3
>>> 2 | 1 | 4
>>> 2 | 2 | 5
>>> 2 | 3 | 6
>>> 3 | 1 | 7
>>> 3 | 2 | 8
>>> 3 | 3 | 9
>>> ----------
>>>
>>> I group by 'a', and want the rows with the smallest 'b', that is, I want
>>> to return the following dataframe:
>>>
>>> a | b | c
>>> ----------
>>> 1 | 1 | 1
>>> 2 | 1 | 4
>>> 3 | 1 | 7
>>> ----------
>>>
>>> The dataframe I have is huge so get the minimum value of b from each
>>> group and joining on the original dataframe is very expensive. Is there a
>>> better way to do this?
>>>
>>>
>>> Thanks,
>>> Wei
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Wei Chen, Ph.D.
> Astronomer and Data Scientist
> Phone: (832)646-7124
> Email: wei.chen.riceu@gmail.com
> LinkedIn: https://www.linkedin.com/in/weichen1984
>

Re: pyspark dataframe: row with a minimum value of a column for each group

Posted by Wei Chen <we...@gmail.com>.
Thank you. I have tried the window function as follows:

import pyspark.sql.functions as f
sqc = sqlContext
from pyspark.sql import Window
import pandas as pd

DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
                   'b': [1,2,3,1,2,3,1,2,3],
                   'c': [1,2,3,4,5,6,7,8,9]
                  })

df = sqc.createDataFrame(DF)

window = Window.partitionBy("a").orderBy("c")

df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show()

I got the following result which is understandable:

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  1|
|  1|  3|  3|  1|
|  2|  1|  4|  4|
|  2|  2|  5|  4|
|  2|  3|  6|  4|
|  3|  1|  7|  7|
|  3|  2|  8|  7|
|  3|  3|  9|  7|
+---+---+---+---+


However if I change min to max, the result is not what is expected:

df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  2|
|  1|  3|  3|  3|
|  2|  1|  4|  4|
|  2|  2|  5|  5|
|  2|  3|  6|  6|
|  3|  1|  7|  7|
|  3|  2|  8|  8|
|  3|  3|  9|  9|
+---+---+---+---+



Thanks,

Wei


On Tue, Jan 5, 2016 at 8:30 PM, ayan guha <gu...@gmail.com> wrote:

> Yes there is. It is called window function over partitions.
>
> Equivalent SQL would be:
>
> select * from
>          (select a,b,c, rank() over (partition by a order by b) r from df)
> x
> where r = 1
>
> You can register your DF as a temp table and use the sql form. Or, (>Spark
> 1.4) you can use window methods and their variants in Spark SQL module.
>
> HTH....
>
> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <we...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to retrieve the rows with a minimum value of a column for
>> each group. For example: the following dataframe:
>>
>> a | b | c
>> ----------
>> 1 | 1 | 1
>> 1 | 2 | 2
>> 1 | 3 | 3
>> 2 | 1 | 4
>> 2 | 2 | 5
>> 2 | 3 | 6
>> 3 | 1 | 7
>> 3 | 2 | 8
>> 3 | 3 | 9
>> ----------
>>
>> I group by 'a', and want the rows with the smallest 'b', that is, I want
>> to return the following dataframe:
>>
>> a | b | c
>> ----------
>> 1 | 1 | 1
>> 2 | 1 | 4
>> 3 | 1 | 7
>> ----------
>>
>> The dataframe I have is huge so get the minimum value of b from each
>> group and joining on the original dataframe is very expensive. Is there a
>> better way to do this?
>>
>>
>> Thanks,
>> Wei
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.riceu@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984

Re: pyspark dataframe: row with a minimum value of a column for each group

Posted by ayan guha <gu...@gmail.com>.
Yes there is. It is called window function over partitions.

Equivalent SQL would be:

select * from
         (select a,b,c, rank() over (partition by a order by b) r from df) x
where r = 1

You can register your DF as a temp table and use the sql form. Or, (>Spark
1.4) you can use window methods and their variants in Spark SQL module.

HTH....

On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen <we...@gmail.com> wrote:

> Hi,
>
> I am trying to retrieve the rows with a minimum value of a column for each
> group. For example: the following dataframe:
>
> a | b | c
> ----------
> 1 | 1 | 1
> 1 | 2 | 2
> 1 | 3 | 3
> 2 | 1 | 4
> 2 | 2 | 5
> 2 | 3 | 6
> 3 | 1 | 7
> 3 | 2 | 8
> 3 | 3 | 9
> ----------
>
> I group by 'a', and want the rows with the smallest 'b', that is, I want
> to return the following dataframe:
>
> a | b | c
> ----------
> 1 | 1 | 1
> 2 | 1 | 4
> 3 | 1 | 7
> ----------
>
> The dataframe I have is huge so get the minimum value of b from each group
> and joining on the original dataframe is very expensive. Is there a better
> way to do this?
>
>
> Thanks,
> Wei
>
>


-- 
Best Regards,
Ayan Guha