You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by David Smith <da...@gmail.com> on 2015/08/27 03:10:43 UTC

Differing performance in self joins

I've noticed that two queries, which return identical results, have very
different performance. I'd be interested in any hints about how avoid
problems like this.

The DataFrame df contains a string field "series" and an integer "eday", the
number of days since (or before) the 1970-01-01 epoch.

I'm doing some analysis over a sliding date window and, for now, avoiding
UDAFs. I'm therefore using a self join. First, I create 

val laggard = df.withColumnRenamed("series",
"p_series").withColumnRenamed("eday", "p_eday")

Then, the following query runs in 16s:

df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") ===
(laggard("p_eday") + 1))).count

while the following query runs in 4 - 6 minutes:

df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") -
laggard("p_eday")) === 1)).count

It's worth noting that the series term is necessary to keep the query from
doing a complete cartesian product over the data.

Ideally, I'd like to look at lags of more than one day, but the following is
equally slow:

df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") -
laggard("p_eday")).between(1,7)).count

Any advice about the general principle at work here would be welcome.

Thanks,
David



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Differing performance in self joins

Posted by Michael Armbrust <mi...@databricks.com>.

-dev +user

I'd suggest running .explain() on both dataframes to understand the
performance better.  The problem is likely that we have a pattern that
looks for cases where you have an equality predicate where either side can
be evaluated using one side of the join.  We turn this into a hash join.

(df("eday") - laggard("p_eday")) === 1) is pretty tricky for us to
understand, and so the pattern misses the possible optimized plan.

On Wed, Aug 26, 2015 at 6:10 PM, David Smith <da...@gmail.com> wrote:

> I've noticed that two queries, which return identical results, have very
> different performance. I'd be interested in any hints about how avoid
> problems like this.
>
> The DataFrame df contains a string field "series" and an integer "eday",
> the
> number of days since (or before) the 1970-01-01 epoch.
>
> I'm doing some analysis over a sliding date window and, for now, avoiding
> UDAFs. I'm therefore using a self join. First, I create
>
> val laggard = df.withColumnRenamed("series",
> "p_series").withColumnRenamed("eday", "p_eday")
>
> Then, the following query runs in 16s:
>
> df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") ===
> (laggard("p_eday") + 1))).count
>
> while the following query runs in 4 - 6 minutes:
>
> df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") -
> laggard("p_eday")) === 1)).count
>
> It's worth noting that the series term is necessary to keep the query from
> doing a complete cartesian product over the data.
>
> Ideally, I'd like to look at lags of more than one day, but the following
> is
> equally slow:
>
> df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") -
> laggard("p_eday")).between(1,7)).count
>
> Any advice about the general principle at work here would be welcome.
>
> Thanks,
> David
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>