You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by YaoPau <jo...@gmail.com> on 2014/11/18 18:26:16 UTC

RDD needs several JOINs and COUNTs ... how do I optimize?

Each row of my RDD looks like this:

line = (key1, key2, x1, x2, x3, x4)

I have another table with (key1, y1, y2, y3) and another with (key2, z1,
z2).  I want to JOIN them all together and then take COUNTs of x2, x4, y1,
y2, y3, z1, z2 for every minutes in a Spark Streaming job.  Here is my
current process (and I think there's probably a more efficient way):

map line to: (key1, (key2, x1, x2, x3, x4)
join#1 to create (key1, ((key2, x1, x2, x3, x4),(y1,y2,y3)
map line to: (key2, (key1, x1, x2, x3, x4, y1, y2, y3))
join #2 to create (key2,((key1, x1, x2, x3, x4, y1, y2, y3),(z1,z2))
map line to:(x2,1)
group by key and count.
map line to: (x4,1)
group by key and count.
repeat for y1, y2, y3, z1, z2

That's a whole lot of re-mapping and creating of new objects.  Is there a
faster way?





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-needs-several-JOINs-and-COUNTs-how-do-I-optimize-tp19203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org