You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Eric Beabes <ma...@gmail.com> on 2021/08/12 23:10:13 UTC

Replacing BroadcastNestedLoopJoin

We’ve two datasets that look like this:

Dataset A: App specific data that contains (among other fields): ip_address


Dataset B: Location data that contains start_ip_address_int,
end_ip_address_int, latitude, longitude

We’re (left) joining these two datasets as: A.ip_address >=
B.start_ip_address_int AND A.ip_address <= B.end_ip_address_int. When
there's a match, we pick latitude & longitude from Dataset B.

This works fine but it takes a LONG time (over 20 minutes) to complete for
SMALL datasets.

Dataset A => Usually contains 110,000

Dataset B => Contains 12.5 Million rows. This is “static” data. Hasn’t
changed since August 2020.

When we looked at the DAG, it seems a BroadcastNestedLoopJoin is getting
used which supposedly is very slow. It seems Spark selects it by default
when we have “in equal” conditions such as “greater than”, “less than”.

What’s the best way to speed up this process? Thanks in advance.