You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Davies Liu <da...@databricks.com> on 2016/06/10 00:00:52 UTC
Re: pyspark.GroupedData.agg works incorrectly when one column is
aggregated twice?
This one works as expected:
```
>>> spark.range(10).selectExpr("id", "id as k").groupBy("k").agg({"k": "count", "id": "sum"}).show()
+---+--------+-------+
| k|count(k)|sum(id)|
+---+--------+-------+
| 0| 1| 0|
| 7| 1| 7|
| 6| 1| 6|
| 9| 1| 9|
| 5| 1| 5|
| 1| 1| 1|
| 3| 1| 3|
| 8| 1| 8|
| 2| 1| 2|
| 4| 1| 4|
+---+--------+-------+
```
Have you try to remove the orderBy? that looks weird.
On Fri, May 27, 2016 at 4:28 AM, Andrew Vykhodtsev <yo...@gmail.com> wrote:
> Dear list,
>
> I am trying to calculate sum and count on the same column:
>
> user_id_books_clicks =
> (sqlContext.read.parquet('hdfs:///projects/kaggle-expedia/input/train.parquet')
> .groupby('user_id')
> .agg({'is_booking':'count',
> 'is_booking':'sum'})
> .orderBy(fn.desc('count(user_id)'))
> .cache()
> )
>
> If I do it like that, it only gives me one (last) aggregate -
> sum(is_booking)
>
> But if I change to .agg({'user_id':'count', 'is_booking':'sum'}) - it
> gives me both. I am on 1.6.1. Is it fixed in 2.+? Or should I report it to
> JIRA?
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org