You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2016/03/15 23:27:23 UTC

[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/11743

    [SPARK-13918] [SQL] Merge SortMergeJoin and SortMergerOuterJoin

    ## What changes were proposed in this pull request?
    
    This PR just move some code from SortMergeOuterJoin into SortMergeJoin.
    
    This is for support codegen for outer join.
    
    ## How was this patch tested?
    
    existing tests.
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark gen_smjouter

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11743.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11743
    
----
commit 73028b8bd5b56803e17f57400f70007e6e2894be
Author: Davies Liu <da...@databricks.com>
Date:   2016-03-15T22:22:43Z

    Merge SortMergeJoin and SortMergerOuterJoin

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/11743


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197054032
  
    **[Test build #53236 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53236/consoleFull)** for PR 11743 at commit [`73028b8`](https://github.com/apache/spark/commit/73028b8bd5b56803e17f57400f70007e6e2894be).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197071955
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53236/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197126897
  
    cc @sameeragarwal for review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197073607
  
    **[Test build #53245 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53245/consoleFull)** for PR 11743 at commit [`4008dcd`](https://github.com/apache/spark/commit/4008dcd9a083e501310bf0adb358cdde86bc0b76).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197128362
  
    Actually this is a straightforward change. Merging in master.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197052748
  
    cc @rxin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197102131
  
    **[Test build #53245 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53245/consoleFull)** for PR 11743 at commit [`4008dcd`](https://github.com/apache/spark/commit/4008dcd9a083e501310bf0adb358cdde86bc0b76).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197128428
  
    Can you fix all the indenting issues in your next pr?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197157236
  
    Will do that when implement codegen support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197102550
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197071954
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197071817
  
    **[Test build #53236 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/53236/consoleFull)** for PR 11743 at commit [`73028b8`](https://github.com/apache/spark/commit/73028b8bd5b56803e17f57400f70007e6e2894be).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11743#discussion_r56275827
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala ---
    @@ -570,3 +659,307 @@ private[joins] class SortMergeJoinScanner(
         } while (bufferedRow != null && keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)
       }
     }
    +
    +/**
    + * An iterator for outputting rows in left outer join.
    + */
    +private class LeftOuterIterator(
    +  smjScanner: SortMergeJoinScanner,
    +  rightNullRow: InternalRow,
    +  boundCondition: InternalRow => Boolean,
    +  resultProj: InternalRow => InternalRow,
    +  numOutputRows: LongSQLMetric)
    +  extends OneSideOuterIterator(
    +    smjScanner, rightNullRow, boundCondition, resultProj, numOutputRows) {
    +
    +  protected override def setStreamSideOutput(row: InternalRow): Unit = joinedRow.withLeft(row)
    +  protected override def setBufferedSideOutput(row: InternalRow): Unit = joinedRow.withRight(row)
    +}
    +
    +/**
    + * An iterator for outputting rows in right outer join.
    + */
    +private class RightOuterIterator(
    +  smjScanner: SortMergeJoinScanner,
    --- End diff --
    
    4 space here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11743#discussion_r56275821
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala ---
    @@ -570,3 +659,307 @@ private[joins] class SortMergeJoinScanner(
         } while (bufferedRow != null && keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)
       }
     }
    +
    +/**
    + * An iterator for outputting rows in left outer join.
    + */
    +private class LeftOuterIterator(
    +  smjScanner: SortMergeJoinScanner,
    --- End diff --
    
    4 space here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/11743#issuecomment-197102554
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/53245/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-13918] [SQL] Merge SortMergeJoin and So...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/11743#discussion_r56275439
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoin.scala ---
    @@ -570,3 +659,307 @@ private[joins] class SortMergeJoinScanner(
         } while (bufferedRow != null && keyOrdering.compare(streamedRowKey, bufferedRowKey) == 0)
       }
     }
    +
    +/**
    + * An iterator for outputting rows in left outer join.
    + */
    +private class LeftOuterIterator(
    +  smjScanner: SortMergeJoinScanner,
    +  rightNullRow: InternalRow,
    +  boundCondition: InternalRow => Boolean,
    +  resultProj: InternalRow => InternalRow,
    +  numOutputRows: LongSQLMetric)
    +  extends OneSideOuterIterator(
    +    smjScanner, rightNullRow, boundCondition, resultProj, numOutputRows) {
    +
    +  protected override def setStreamSideOutput(row: InternalRow): Unit = joinedRow.withLeft(row)
    +  protected override def setBufferedSideOutput(row: InternalRow): Unit = joinedRow.withRight(row)
    +}
    +
    +/**
    + * An iterator for outputting rows in right outer join.
    + */
    +private class RightOuterIterator(
    +  smjScanner: SortMergeJoinScanner,
    +  leftNullRow: InternalRow,
    +  boundCondition: InternalRow => Boolean,
    +  resultProj: InternalRow => InternalRow,
    +  numOutputRows: LongSQLMetric)
    +  extends OneSideOuterIterator(
    +    smjScanner, leftNullRow, boundCondition, resultProj, numOutputRows) {
    +
    +  protected override def setStreamSideOutput(row: InternalRow): Unit = joinedRow.withRight(row)
    +  protected override def setBufferedSideOutput(row: InternalRow): Unit = joinedRow.withLeft(row)
    +}
    +
    +/**
    + * An abstract iterator for sharing code between [[LeftOuterIterator]] and [[RightOuterIterator]].
    + *
    + * Each [[OneSideOuterIterator]] has a streamed side and a buffered side. Each row on the
    + * streamed side will output 0 or many rows, one for each matching row on the buffered side.
    + * If there are no matches, then the buffered side of the joined output will be a null row.
    + *
    + * In left outer join, the left is the streamed side and the right is the buffered side.
    + * In right outer join, the right is the streamed side and the left is the buffered side.
    + *
    + * @param smjScanner a scanner that streams rows and buffers any matching rows
    + * @param bufferedSideNullRow the default row to return when a streamed row has no matches
    + * @param boundCondition an additional filter condition for buffered rows
    + * @param resultProj how the output should be projected
    + * @param numOutputRows an accumulator metric for the number of rows output
    + */
    +private abstract class OneSideOuterIterator(
    +  smjScanner: SortMergeJoinScanner,
    +  bufferedSideNullRow: InternalRow,
    +  boundCondition: InternalRow => Boolean,
    +  resultProj: InternalRow => InternalRow,
    +  numOutputRows: LongSQLMetric) extends RowIterator {
    +
    +  // A row to store the joined result, reused many times
    +  protected[this] val joinedRow: JoinedRow = new JoinedRow()
    +
    +  // Index of the buffered rows, reset to 0 whenever we advance to a new streamed row
    +  private[this] var bufferIndex: Int = 0
    +
    +  // This iterator is initialized lazily so there should be no matches initially
    +  assert(smjScanner.getBufferedMatches.length == 0)
    +
    +  // Set output methods to be overridden by subclasses
    +  protected def setStreamSideOutput(row: InternalRow): Unit
    +  protected def setBufferedSideOutput(row: InternalRow): Unit
    +
    +  /**
    +   * Advance to the next row on the stream side and populate the buffer with matches.
    +   * @return whether there are more rows in the stream to consume.
    +   */
    +  private def advanceStream(): Boolean = {
    +    bufferIndex = 0
    +    if (smjScanner.findNextOuterJoinRows()) {
    +      setStreamSideOutput(smjScanner.getStreamedRow)
    +      if (smjScanner.getBufferedMatches.isEmpty) {
    +        // There are no matching rows in the buffer, so return the null row
    +        setBufferedSideOutput(bufferedSideNullRow)
    +      } else {
    +        // Find the next row in the buffer that satisfied the bound condition
    +        if (!advanceBufferUntilBoundConditionSatisfied()) {
    +          setBufferedSideOutput(bufferedSideNullRow)
    +        }
    +      }
    +      true
    +    } else {
    +      // Stream has been exhausted
    +      false
    +    }
    +  }
    +
    +  /**
    +   * Advance to the next row in the buffer that satisfies the bound condition.
    +   * @return whether there is such a row in the current buffer.
    +   */
    +  private def advanceBufferUntilBoundConditionSatisfied(): Boolean = {
    +    var foundMatch: Boolean = false
    +    while (!foundMatch && bufferIndex < smjScanner.getBufferedMatches.length) {
    +      setBufferedSideOutput(smjScanner.getBufferedMatches(bufferIndex))
    +      foundMatch = boundCondition(joinedRow)
    +      bufferIndex += 1
    +    }
    +    foundMatch
    +  }
    +
    +  override def advanceNext(): Boolean = {
    +    val r = advanceBufferUntilBoundConditionSatisfied() || advanceStream()
    +    if (r) numOutputRows += 1
    +    r
    +  }
    +
    +  override def getRow: InternalRow = resultProj(joinedRow)
    +}
    +
    +private class SortMergeFullOuterJoinScanner(
    +  leftKeyGenerator: Projection,
    --- End diff --
    
    4 sapce indent here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org