You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Michael Armbrust <mi...@databricks.com> on 2015/08/27 03:27:37 UTC
Re: Differing performance in self joins

-dev +user

I'd suggest running .explain() on both dataframes to understand the
performance better.  The problem is likely that we have a pattern that
looks for cases where you have an equality predicate where either side can
be evaluated using one side of the join.  We turn this into a hash join.

(df("eday") - laggard("p_eday")) === 1) is pretty tricky for us to
understand, and so the pattern misses the possible optimized plan.

On Wed, Aug 26, 2015 at 6:10 PM, David Smith <da...@gmail.com> wrote:

> I've noticed that two queries, which return identical results, have very
> different performance. I'd be interested in any hints about how avoid
> problems like this.
>
> The DataFrame df contains a string field "series" and an integer "eday",
> the
> number of days since (or before) the 1970-01-01 epoch.
>
> I'm doing some analysis over a sliding date window and, for now, avoiding
> UDAFs. I'm therefore using a self join. First, I create
>
> val laggard = df.withColumnRenamed("series",
> "p_series").withColumnRenamed("eday", "p_eday")
>
> Then, the following query runs in 16s:
>
> df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") ===
> (laggard("p_eday") + 1))).count
>
> while the following query runs in 4 - 6 minutes:
>
> df.join(laggard, (df("series") === laggard("p_series")) && ((df("eday") -
> laggard("p_eday")) === 1)).count
>
> It's worth noting that the series term is necessary to keep the query from
> doing a complete cartesian product over the data.
>
> Ideally, I'd like to look at lags of more than one day, but the following
> is
> equally slow:
>
> df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") -
> laggard("p_eday")).between(1,7)).count
>
> Any advice about the general principle at work here would be welcome.
>
> Thanks,
> David
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Differing-performance-in-self-joins-tp13864.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>