You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jason White <ja...@shopify.com> on 2016/01/08 17:56:06 UTC

Efficient join multiple times

I'm trying to join a contant large-ish RDD to each RDD in a DStream, and I'm
trying to keep the join as efficient as possible so each batch finishes
within the batch window. I'm using PySpark on 1.6

I've tried the trick of keying the large RDD into (k, v) pairs and using
.partitionBy(100).persist() to pre-partition it for each join. This works,
and definitely cuts down on the time. The dstream is also mapped to matching
(k, v) pairs.

Then I'm joining the data using:
my_dstream.transform(lambda rdd: rdd.leftOuterJoin(big_rdd))

What I'm seeing happening is that, while the right side is partitioned
exactly once, thus saving me an expensive shuffle each batch, the data is
still being transferred across the network each batch. This is putting me up
to or beyond my batch window.

I thought the point of the .partitionBy() call was to persist the data at a
fixed set of nodes, and have the data from the smaller RDD shuffled to those
nodes?

I've also tried using a .rightOuterJoin instead, it appears to make no
difference. Any suggestions?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Efficient-join-multiple-times-tp25922.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org