You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Sonu Jyotshna <so...@gmail.com> on 2019/02/09 04:46:49 UTC

Multiple column aggregations

Hello,

I have a requirement where I need to group by multiple columns and
aggregate them not at same time .. I mean I have a structure which contains
accountid, some cols, order id . I need to calculate some scenarios like
account having multiple orders so group by account and aggregate will work
here but I need to find orderid associated to multiple accounts so may be
group by orderid will work here but for better performance on the dataset
level can we do in single step? Where both will work or any better approach
I can follow . Can you help


Regards,
Sonu

RE: Multiple column aggregations

Posted by Shiva Prashanth Vallabhaneni <sh...@tavant.com>.
Hi Sonu,

You could use a query that is similar to the below one. You could further optimize the below query by adding a WHERE clause. I would suggest that you benchmark the performance of both approaches (multiple group-by queries vs single query with multiple window functions), before choosing one of these options. Before running the benchmark, I would ensure that the underlying data is stored in a columnar storage format with compression enabled. For instance, you could use parquet file format with block-level compression using Snappy.

SELECT  SUM(CASE WHEN accountRank =2 THEN 1 ELSE 0 END) AS accountsWithMoreThanOneOrder,
SUM(CASE WHEN orderRank =2 THEN 1 ELSE 0 END) AS ordersWithMoreThanOneAccount,
FROM   (
                  SELECT  accountNo,
                               orderNo,
  rank() OVER (PARTITION BY orderNo ORDER BY accountNo) AS orderRank,
 rank() OVER (PARTITION BY accountNo ORDER BY orderNo) AS accountRank
                  FROM   accountOrders
                )

P.S – You will need to check the above query for any syntax errors.

– Shiva

From: Sonu Jyotshna <so...@gmail.com>
Sent: Saturday, February 9, 2019 10:17 AM
To: user@spark.apache.org
Subject: Multiple column aggregations


Hello,

I have a requirement where I need to group by multiple columns and aggregate them not at same time .. I mean I have a structure which contains accountid, some cols, order id . I need to calculate some scenarios like account having multiple orders so group by account and aggregate will work here but I need to find orderid associated to multiple accounts so may be group by orderid will work here but for better performance on the dataset level can we do in single step? Where both will work or any better approach I can follow . Can you help


Regards,
Sonu
________________________________
Any comments or statements made in this email are not necessarily those of Tavant Technologies. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. If you have received this in error, please contact the sender and delete the material from any computer. All emails sent from or to Tavant Technologies may be subject to our monitoring procedures.