You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by 崔苗 <cu...@danale.com> on 2018/08/25 02:55:07 UTC

Fw:multiple group by action







-------- Forwarding messages --------
From: "崔苗" <cu...@danale.com>
Date: 2018-08-25 10:54:31
To: dev@spark.apache.org
Subject: multiple group by action
Hi,
we have some user data with columns(userId,company,client,country,region,city),
now we want to count userId by multiple column,such as :
select count(distinct userId) group by company
select count(distinct userId) group by company,client
select count(distinct userId) group by company,client,country
select count(distinct userId) group by company,client,country,region
etc
so each action will bring a shuffle stage, as for columns( company,client) contain column company,
Is there a way to reduce shuffle stage?


Thanks for any replys











Re: Fw:multiple group by action

Posted by Reynold Xin <rx...@databricks.com>.
Use rollout and cube.

On Fri, Aug 24, 2018 at 7:55 PM 崔苗 <cu...@danale.com> wrote:

>
>
>
>
>
>
> -------- Forwarding messages --------
> From: "崔苗" <cu...@danale.com>
> Date: 2018-08-25 10:54:31
> To: dev@spark.apache.org
> Subject: multiple group by action
>
> Hi,
> we have some user data with
> columns(userId,company,client,country,region,city),
> now we want to count userId by multiple column,such as :
> select count(distinct userId) group by company
> select count(distinct userId) group by company,client
> select count(distinct userId) group by company,client,country
> select count(distinct userId) group by company,client,country,region
> etc
> so each action will bring a shuffle stage, as for columns( company,client)
> contain column company,
> Is there a way to reduce shuffle stage?
>
> Thanks for any replys
>
>
>
>