You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Davies Liu <da...@databricks.com> on 2016/06/10 00:00:52 UTC

Re: pyspark.GroupedData.agg works incorrectly when one column is aggregated twice?

This one works as expected:

```
>>> spark.range(10).selectExpr("id", "id as k").groupBy("k").agg({"k": "count", "id": "sum"}).show()
+---+--------+-------+
|  k|count(k)|sum(id)|
+---+--------+-------+
|  0|       1|      0|
|  7|       1|      7|
|  6|       1|      6|
|  9|       1|      9|
|  5|       1|      5|
|  1|       1|      1|
|  3|       1|      3|
|  8|       1|      8|
|  2|       1|      2|
|  4|       1|      4|
+---+--------+-------+
```

Have you try to remove the orderBy? that looks weird.


On Fri, May 27, 2016 at 4:28 AM, Andrew Vykhodtsev <yo...@gmail.com> wrote:
> Dear list,
>
> I am trying to calculate sum and count on the same column:
>
> user_id_books_clicks =
> (sqlContext.read.parquet('hdfs:///projects/kaggle-expedia/input/train.parquet')
>                                   .groupby('user_id')
>                                   .agg({'is_booking':'count',
> 'is_booking':'sum'})
>                                   .orderBy(fn.desc('count(user_id)'))
>                                   .cache()
>                        )
>
> If I do it like that, it only gives me one (last) aggregate -
> sum(is_booking)
>
> But if I change to .agg({'user_id':'count', 'is_booking':'sum'})  -  it
> gives me both. I am on 1.6.1. Is it fixed in 2.+? Or should I report it to
> JIRA?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org