You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Amit Mor <am...@gmail.com> on 2013/10/28 19:15:01 UTC

Modeling and implementation

Hello friends. Newbie here, at least when it goes to Spark. I would be very
thankful for data modeling suggestions for this scenario : I have 3 types
of logs, with more than 48 columns each. For simplicity I modeled each as
Tuple(PKsTuple, FinanceDataTuple, AuxData), i.e. Tuple of tuples.
Eventually I want to join the 3 RDD's by the PKsTuple and do a few calcs on
the FinanceDataTuple. Is this the path to go ? I thought about using a
Tuple of case classes just for more elegant access to the data but I was
worried that ser/de overhead would be wasteful. Btw, is there a recommanded
operation for this multi way join ? The data is very skewed, much more data
from one log and far less from the others.

Thanks,
Amit Mor