You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kumar sp <kr...@gmail.com> on 2019/02/13 16:07:42 UTC

Design recommendation

Hello I  need a design recommendation.

I need to calcualte a couple of calculations with min shuffling and better
perf. I have an nested structure with say a class have n number of students
and structure will be similiar to this

{ classId: String,
StudendId:String,
Score:Int,
AreaCode:String}

now i have to validate say class will have students who should be all part
of same area code and another one student who is taking more than one class.
I can create groupby classId and count(AreaCode) get classID, count..
similiarly groupby StudentID and count(Class_Id)  get aggregated structure
and join these two with say studentId but this is taking multiple
shuffles and data is huge so cant really use broadcast join .

Can you please suggest some better approach.

Regards,
sk