You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Kumar sp <kr...@gmail.com> on 2019/02/13 16:07:42 UTC
Design recommendation
Hello I need a design recommendation.
I need to calcualte a couple of calculations with min shuffling and better
perf. I have an nested structure with say a class have n number of students
and structure will be similiar to this
{ classId: String,
StudendId:String,
Score:Int,
AreaCode:String}
now i have to validate say class will have students who should be all part
of same area code and another one student who is taking more than one class.
I can create groupby classId and count(AreaCode) get classID, count..
similiarly groupby StudentID and count(Class_Id) get aggregated structure
and join these two with say studentId but this is taking multiple
shuffles and data is huge so cant really use broadcast join .
Can you please suggest some better approach.
Regards,
sk