You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nikolas Vanderhoof (JIRA)" <ji...@apache.org> on 2019/04/03 22:59:00 UTC

[jira] [Created] (SPARK-27359) Joins on some array functions can be optimized

Nikolas Vanderhoof created SPARK-27359:
------------------------------------------

             Summary: Joins on some array functions can be optimized
                 Key: SPARK-27359
                 URL: https://issues.apache.org/jira/browse/SPARK-27359
             Project: Spark
          Issue Type: Improvement
          Components: Optimizer, SQL
    Affects Versions: 3.0.0
            Reporter: Nikolas Vanderhoof


I encounter these cases frequently, and implemented the optimization manually (as shown here). If others experience this as well, perhaps it would be good to add appropriate tree transformations into catalyst. I can create some rough draft implementations but expect I will need assistance when it comes to resolving the generating expressions in the logical plan.

h1. Case 1
A join like this:
{code:scala}
left.join(
  right,
  arrays_overlap(left("a"), right("b"))     // Creates a cartesian product in the logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
    rightPrime,
    leftPrime("exploded_a") === rightPrime("exploded_b")
      // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}
{code}

h1. Case 2
A join like this:
{code:scala}
left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
    right,
    leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

h1. Case 3
A join like this:
{code:scala}
left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
    rightPrime,
    left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org