You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nikolas Vanderhoof (JIRA)" <ji...@apache.org> on 2019/05/09 00:57:00 UTC
[jira] [Updated] (SPARK-27359) Joins on some array functions can be optimized

     [ https://issues.apache.org/jira/browse/SPARK-27359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nikolas Vanderhoof updated SPARK-27359:
---------------------------------------
    Description: 
I encounter these cases frequently, and implemented the optimization manually (as shown here). If others experience this as well, perhaps it would be good to add appropriate tree transformations into catalyst. 

h1. Case 1
A join like this:
{code:scala}
left.join(
  right,
  arrays_overlap(left("a"), right("b"))     // Creates a cartesian product in the logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
    rightPrime,
    leftPrime("exploded_a") === rightPrime("exploded_b")
      // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}
{code}

h1. Case 2
A join like this:
{code:scala}
left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
    right,
    leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

h1. Case 3
A join like this:
{code:scala}
left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
    rightPrime,
    left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

  was:
I encounter these cases frequently, and implemented the optimization manually (as shown here). If others experience this as well, perhaps it would be good to add appropriate tree transformations into catalyst. I can create some rough draft implementations but expect I will need assistance when it comes to resolving the generating expressions in the logical plan.

h1. Case 1
A join like this:
{code:scala}
left.join(
  right,
  arrays_overlap(left("a"), right("b"))     // Creates a cartesian product in the logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_a", explode(col("a")))
  val rightPrime = right.withColumn("exploded_b", explode(col("b")))

  leftPrime.join(
    rightPrime,
    leftPrime("exploded_a") === rightPrime("exploded_b")
      // Equijoin doesn't produce cartesian product
  ).drop("exploded_a", "exploded_b").distinct
}
{code}

h1. Case 2
A join like this:
{code:scala}
left.join(
  right,
  array_contains(left("arr"), right("value")) // Cartesian product in logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))

  leftPrime.join(
    right,
    leftPrime("exploded_arr") === right("value") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}

h1. Case 3
A join like this:
{code:scala}
left.join(
  right,
  array_contains(right("arr"), left("value")) // Cartesian product in logical plan
)
{code}

will produce the same results as:
{code:scala}
{
  val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))

  left.join(
    rightPrime,
    left("value") === rightPrime("exploded_arr") // Fast equijoin
  ).drop("exploded_arr").distinct
}
{code}


> Joins on some array functions can be optimized
> ----------------------------------------------
>
>                 Key: SPARK-27359
>                 URL: https://issues.apache.org/jira/browse/SPARK-27359
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 3.0.0
>            Reporter: Nikolas Vanderhoof
>            Priority: Minor
>
> I encounter these cases frequently, and implemented the optimization manually (as shown here). If others experience this as well, perhaps it would be good to add appropriate tree transformations into catalyst. 
> h1. Case 1
> A join like this:
> {code:scala}
> left.join(
>   right,
>   arrays_overlap(left("a"), right("b"))     // Creates a cartesian product in the logical plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_a", explode(col("a")))
>   val rightPrime = right.withColumn("exploded_b", explode(col("b")))
>   leftPrime.join(
>     rightPrime,
>     leftPrime("exploded_a") === rightPrime("exploded_b")
>       // Equijoin doesn't produce cartesian product
>   ).drop("exploded_a", "exploded_b").distinct
> }
> {code}
> h1. Case 2
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(left("arr"), right("value")) // Cartesian product in logical plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val leftPrime = left.withColumn("exploded_arr", explode(col("arr")))
>   leftPrime.join(
>     right,
>     leftPrime("exploded_arr") === right("value") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}
> h1. Case 3
> A join like this:
> {code:scala}
> left.join(
>   right,
>   array_contains(right("arr"), left("value")) // Cartesian product in logical plan
> )
> {code}
> will produce the same results as:
> {code:scala}
> {
>   val rightPrime = right.withColumn("exploded_arr", explode(col("arr")))
>   left.join(
>     rightPrime,
>     left("value") === rightPrime("exploded_arr") // Fast equijoin
>   ).drop("exploded_arr").distinct
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org