You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gczsjdy <gi...@git.apache.org> on 2017/12/01 17:04:30 UTC

[GitHub] spark pull request #19862: Make SortMergeJoin read less data when wholeStage...

GitHub user gczsjdy opened a pull request:

    https://github.com/apache/spark/pull/19862

    Make SortMergeJoin read less data when wholeStageCodegen is off

    ## What changes were proposed in this pull request?
    
    In SortMergeJoin(with wholeStageCodegen), an optimization already exists: if the left table of a partition is empty then there is no need to read the right table of this corresponding partition. This benefits the case in which many partitions of left table is empty and the right table is big.
    
    While in the code path without wholeStageCodegen, this optimization doesn't happen. This is mainly due to the lack of optimization in codegen-SortMergeJoin & the lack of support in `SortExec`, which doesn't do lazy evaluation. UI when wholeStageCodegen is off:
    <img width="908" alt="off_wholestage_before" src="https://user-images.githubusercontent.com/7685352/33493586-8311ac58-d6fb-11e7-816c-7a0fb2065345.PNG">
    
    When the switch is on: 
    ![image](https://user-images.githubusercontent.com/7685352/33493675-c821b81a-d6fb-11e7-8cf8-2e5baff75be3.png)
    
    This PR lazy evaluates sort, and only trigger the right table read after reading the left table and find it's not empty.
    
    After this PR, with wholeStageCodegen off:
    ![image](https://user-images.githubusercontent.com/7685352/33493784-2e1ee89a-d6fc-11e7-8201-71273de0b857.png)
    
    ## How was this patch tested?
    
    Existed test suites.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gczsjdy/spark smj_less_read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19862.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19862
    
----
commit a8511e8b7e0240c471f88e703545301425960901
Author: GuoChenzhao <ch...@intel.com>
Date:   2017-11-28T09:01:34Z

    Init solution

commit a5c14b473636e571c6d4f17798f220be080a27a1
Author: GuoChenzhao <ch...@intel.com>
Date:   2017-11-28T09:31:07Z

    Style

commit a26ca57d56b0b2df81daa82da49f4ff564fc10f5
Author: GuoChenzhao <ch...@intel.com>
Date:   2017-11-28T09:37:54Z

    Comments

commit 6d875f84de95d67f84ef9774cb6b6ee8273d46a6
Author: GuoChenzhao <ch...@intel.com>
Date:   2017-12-01T11:32:06Z

    lazy evaluation in SortExec

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154560524
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -674,8 +674,9 @@ private[joins] class SortMergeJoinScanner(
       private[this] val bufferedMatches =
         new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold)
     
    -  // Initialization (note: do _not_ want to advance streamed here).
    -  advancedBufferedToRowWithNullFreeJoinKey()
    +  // Initialization (note: do _not_ want to advance streamed here). This is made lazy to prevent
    +  // unnecessary trigger of calculation.
    +  private lazy val advancedBufferedIterRes = advancedBufferedToRowWithNullFreeJoinKey()
    --- End diff --
    
    Isn't it the same to simply call `advancedBufferedToRowWithNullFreeJoinKey` at needed places?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154564488
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    +      if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    --- End diff --
    
    I agree with you.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84840 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84840/testReport)** for PR 19862 at commit [`80231ab`](https://github.com/apache/spark/commit/80231ab670d5bf1640fad3a9741b6315dba9d1bb).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84836/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154560155
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    To advance buffer iterator here, won't we miss the `bufferedRow` advanced before?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84840/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154567319
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -674,8 +674,9 @@ private[joins] class SortMergeJoinScanner(
       private[this] val bufferedMatches =
         new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold)
     
    -  // Initialization (note: do _not_ want to advance streamed here).
    -  advancedBufferedToRowWithNullFreeJoinKey()
    +  // Initialization (note: do _not_ want to advance streamed here). This is made lazy to prevent
    +  // unnecessary trigger of calculation.
    +  private lazy val advancedBufferedIterRes = advancedBufferedToRowWithNullFreeJoinKey()
    --- End diff --
    
    `This is made lazy to run the initialization only once when accessing it.`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [WIP][SPARK-22671][SQL] Make SortMergeJoin shuffl...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154581844
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java ---
    @@ -159,6 +154,12 @@ public boolean hasNext() {
             @Override
             public UnsafeRow next() {
               try {
    +            if (!alreadyCalculated) {
    --- End diff --
    
    Yes. There are mistakes here, I will change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84836 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84836/testReport)** for PR 19862 at commit [`57550fb`](https://github.com/apache/spark/commit/57550fbd0c42c1616dee0197af6dedbd57a8da89).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84839 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84839/testReport)** for PR 19862 at commit [`e40c2f1`](https://github.com/apache/spark/commit/e40c2f138a8640487a18665e2caf62fce1ce5c8a).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    ok to test


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154566374
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    Once we advance both the streamed and buffered iterators, and call `bufferMatchingRows` at the last turn, it will advance buffered iterator until the `bufferedRow` doesn't match with current `streamedRowKey`. 
    
    In the next turn, the call of ``advancedBufferedIterRes` here will advance buffered iterator and so the `bufferedRow` will be missed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: Make SortMergeJoin read less data when wholeStageCodegen...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84842/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r156581645
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java ---
    @@ -159,6 +154,12 @@ public boolean hasNext() {
             @Override
             public UnsafeRow next() {
               try {
    +            if (!alreadyCalculated) {
    +              while (inputIterator.hasNext()) {
    +                insertRow(inputIterator.next());
    +              }
    +              alreadyCalculated = true;
    +            }
                 sortedIterator.loadNext();
    --- End diff --
    
    Yes, you are right. Now I modified the `sortedIterator` after inserting rows. Due to I can only access an outer final field inside an inner class, so I used an array, is there better solution?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    I don't agree this is a small change, and users using spark prior to 2.0 won't get this patch, as we don't backport performance improvement patches.
    
    Overall this patch won't bring much benifit to Spark and may not worth the time to review it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154567693
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -750,6 +756,8 @@ private[joins] class SortMergeJoinScanner(
           bufferedMatches.clear()
           false
         } else {
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    Add a comment like https://github.com/apache/spark/pull/19862/files#r154567168.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154635850
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    Good advice. Thx.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154556774
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java ---
    @@ -159,6 +159,12 @@ public boolean hasNext() {
             @Override
             public UnsafeRow next() {
               try {
    +            if (!alreadyCalculated) {
    +              while (inputIterator.hasNext()) {
    +                insertRow(inputIterator.next());
    +              }
    +              alreadyCalculated = true;
    --- End diff --
    
    We have cleaned up resources when we have an empty iterator at L144. We should still follow it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84471 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84471/testReport)** for PR 19862 at commit [`c2563d2`](https://github.com/apache/spark/commit/c2563d29be96a36ad5e419b3086007dd7c6f5aa0).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84842 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84842/testReport)** for PR 19862 at commit [`4571b08`](https://github.com/apache/spark/commit/4571b08678d180fefacfffbf0de2e5289066dd73).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84836 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84836/testReport)** for PR 19862 at commit [`57550fb`](https://github.com/apache/spark/commit/57550fbd0c42c1616dee0197af6dedbd57a8da89).
     * This patch **fails due to an unknown error code, -9**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154566463
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -700,38 +701,43 @@ private[joins] class SortMergeJoinScanner(
           bufferedMatches.clear()
           false
         } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    +      // The new streamed row has the same join key as the previous row, so return the same
    +      // matches.
    --- End diff --
    
    Unnecessary change.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    This is actually a small change, but it can provide not small optimization for users who don't use `WholeStageCodegen`, for example there're still some users who use under 2.0 version of Spark.
    
    Also, the behavior of codegen and non-codegen code paths are supposed to be the same, so in this way it is a 'bug'.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84471 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84471/testReport)** for PR 19862 at commit [`c2563d2`](https://github.com/apache/spark/commit/c2563d29be96a36ad5e419b3086007dd7c6f5aa0).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154568281
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java ---
    @@ -159,6 +154,12 @@ public boolean hasNext() {
             @Override
             public UnsafeRow next() {
               try {
    +            if (!alreadyCalculated) {
    +              while (inputIterator.hasNext()) {
    +                insertRow(inputIterator.next());
    +              }
    +              alreadyCalculated = true;
    +            }
                 sortedIterator.loadNext();
    --- End diff --
    
    `sortedIterator` is already assigned at L143. When you insert rows when first time to call `next`, can the `sortedIterator` correctly return sorted elements?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154558106
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    +      if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    --- End diff --
    
    This block can be excluded from this else block. It can be at the original position. We don't need to advance buffer rows too if this condition is hit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154566562
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    Oh. I see. `advancedBufferedIterRes` is a lazy val.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84835/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84842 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84842/testReport)** for PR 19862 at commit [`4571b08`](https://github.com/apache/spark/commit/4571b08678d180fefacfffbf0de2e5289066dd73).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154567585
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java ---
    @@ -182,18 +183,14 @@ public UnsafeRow next() {
             }
           };
         } catch (IOException e) {
    -      cleanupResources();
           throw e;
    +    } finally {
    +      // Since we won't ever call next() on an empty iterator, we need to clean up resources
    +      // here in order to prevent memory leaks.
    +      cleanupResources();
    --- End diff --
    
    This makes the resource cleaned up when we return iterator too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84471/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84839 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84839/testReport)** for PR 19862 at commit [`e40c2f1`](https://github.com/apache/spark/commit/e40c2f138a8640487a18665e2caf62fce1ce5c8a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/84839/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    I'm not sure we should spend time to optimize non-whole-stage-codegen path, which is only used as a safe fallback if whole stage codegen is not applicable. I think the more important thing to do is to support whole stage codegen for more queries.  cc @hvanhovell  


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154568554
  
    --- Diff: sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java ---
    @@ -159,6 +154,12 @@ public boolean hasNext() {
             @Override
             public UnsafeRow next() {
               try {
    +            if (!alreadyCalculated) {
    --- End diff --
    
    When `hasNext` is called, doesn't `sortedIterator` return no element anymore since we haven't added rows into it?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84840 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84840/testReport)** for PR 19862 at commit [`80231ab`](https://github.com/apache/spark/commit/80231ab670d5bf1640fad3a9741b6315dba9d1bb).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin read less data whe...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    cc @cloud-fan @viirya @ConeyLiu 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    cc @cloud-fan @hvanhovell  @viirya 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy closed the pull request at:

    https://github.com/apache/spark/pull/19862


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    @cloud-fan Ok, thanks for your time, I will close this.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154567168
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    Then add a comment like `Initialization at the first time reaching here`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by viirya <gi...@git.apache.org>.

Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154560474
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -750,6 +756,8 @@ private[joins] class SortMergeJoinScanner(
           bufferedMatches.clear()
           false
         } else {
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    ditto. We may miss the bufferedRow advanced before.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84835/testReport)** for PR 19862 at commit [`012c9ee`](https://github.com/apache/spark/commit/012c9ee61d03c0e8fa8dff1a7a84e0adcda2c67c).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154563897
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -674,8 +674,9 @@ private[joins] class SortMergeJoinScanner(
       private[this] val bufferedMatches =
         new ExternalAppendOnlyUnsafeRowArray(inMemoryThreshold, spillThreshold)
     
    -  // Initialization (note: do _not_ want to advance streamed here).
    -  advancedBufferedToRowWithNullFreeJoinKey()
    +  // Initialization (note: do _not_ want to advance streamed here). This is made lazy to prevent
    +  // unnecessary trigger of calculation.
    +  private lazy val advancedBufferedIterRes = advancedBufferedToRowWithNullFreeJoinKey()
    --- End diff --
    
    This function should be called (to try to set `BufferedRow`) before `BufferedRow` is checked, and it should be only once. This is the original requirement due to the logic. While to add this optimization, I think this is the best way.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    **[Test build #84835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/84835/testReport)** for PR 19862 at commit [`012c9ee`](https://github.com/apache/spark/commit/012c9ee61d03c0e8fa8dff1a7a84e0adcda2c67c).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle rea...

Posted by gczsjdy <gi...@git.apache.org>.

Github user gczsjdy commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19862#discussion_r154564327
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ---
    @@ -699,39 +700,44 @@ private[joins] class SortMergeJoinScanner(
           matchJoinKey = null
           bufferedMatches.clear()
           false
    -    } else if (matchJoinKey != null && keyOrdering.compare(streamedRowKey, matchJoinKey) == 0) {
    -      // The new streamed row has the same join key as the previous row, so return the same matches.
    -      true
    -    } else if (bufferedRow == null) {
    -      // The streamed row's join key does not match the current batch of buffered rows and there are
    -      // no more rows to read from the buffered iterator, so there can be no more matches.
    -      matchJoinKey = null
    -      bufferedMatches.clear()
    -      false
         } else {
    -      // Advance both the streamed and buffered iterators to find the next pair of matching rows.
    -      var comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -      do {
    -        if (streamedRowKey.anyNull) {
    -          advancedStreamed()
    -        } else {
    -          assert(!bufferedRowKey.anyNull)
    -          comp = keyOrdering.compare(streamedRowKey, bufferedRowKey)
    -          if (comp > 0) advancedBufferedToRowWithNullFreeJoinKey()
    -          else if (comp < 0) advancedStreamed()
    -        }
    -      } while (streamedRow != null && bufferedRow != null && comp != 0)
    -      if (streamedRow == null || bufferedRow == null) {
    -        // We have either hit the end of one of the iterators, so there can be no more matches.
    +      // To make sure vars like bufferedRow is set
    +      advancedBufferedIterRes
    --- End diff --
    
    This advance function is only called once actually, so no `bufferedRow` will be missed. Or maybe I didn't understand your meaning?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #19862: [SPARK-22671][SQL] Make SortMergeJoin shuffle read less ...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/19862
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org