You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by lsn24 <le...@gmail.com> on 2020/06/26 16:53:06 UTC

Data Explosion and repartition before group bys

Hi ,

 We have a use case where one record  needs to be in two different
aggregations.

Say for example a credit card  transaction "A",  which  belongs to 
transaction category ATM and crossborder.

If I need to take the count of ATM transaction,  I need to consider
transaction A . For count of crossBorder transactions too I need to consider 
transaction A.

If this has to run in parallel, we decided to go with data explosion.  So
that transaction A can be  aggregate twice.

Question:
   1. Is Data explosion the only way to address it ?
   2. The data has skew, so it runs out of executor memory when we tried to
aggregate. Repartition after the data explosion to address the data skew is
killing us.

What other ways can we address this problem ?

Note : A transaction is marked as an ATM transaction  or a cross border
transaction by a boolean  value.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org