You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rick Moritz <ra...@gmail.com> on 2017/04/11 17:15:50 UTC

Feasability limits of joins in SparkSQL (Why does my driver explode with a large number of joins?)

Hi List,

I'm currently trying to naively implement a Data-Vault-type Data-Warehouse
using SparkSQL, and was wondering whether there's an inherent practical
limit to query complexity, beyond which SparkSQL will stop functioning,
even for relatively small amounts of data.

I'm currently looking at a query, which has 19 joins (including some
cartesian joins) in the main query, and another instance of the same 19
joins in a subquery.
What I'm seeing is, that even with very restrictive filtering, which gets
pushed down the pipeline, I run out of driver memory (36G) after just a few
minutes, into a ~4900-task stage.
In fact, quite often just using SparkUI pushes me into the GC-Overhead
limit, with the job then failing.

Obviously, this way of organizing the data isn't ideal, and we're looking
into moving most of the joins into a relational DB. Nonetheless, the way
the driver explodes with no apparent reason is pretty worrying. The
behaviour is also quite independent of how much memory I give the driver.
I'm currently looking into getting a memory dump of the driver, to figure
out which object is hogging memory in the driver. Given that I don't
consciously collect() any major amount of data, I'm surprised about this
behavior, I even suspect that the large graph might be causing issues in
just the SparkUI - I'll have to retry with it disabled.

If you have any experience with significant amount of joins in single
queries, then I'd love to hear from you - maybe someone has also
experienced exploding driver syndrome with non-obvious causes in this
context.

Thanks for any input,

Rick