You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2016/01/05 21:19:48 UTC

[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/10605

    [SPARK-12295] [SQL] external spilling for window functions

    This PR manage the memory used by window functions (buffered rows), also enable external spilling.
    
    After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark unsafe_window

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/10605.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #10605
    
----
commit 6b98593fb969e9c0dfdde3ced313dff101791d1c
Author: Davies Liu <da...@databricks.com>
Date:   2016-01-05T08:30:27Z

    external spilling for window functions

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r48966227
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -307,27 +314,63 @@ case class Window(
               // Collect all the rows in the current partition.
               // Before we start to fetch new input rows, make a copy of nextGroup.
               val currentGroup = nextGroup.copy()
    -          rows.clear()
    +
    +          // clear last partition
    +          if (sorter != null) {
    +            // the last sorter of this task will be cleaned up via task completion listener
    +            sorter.cleanupResources()
    +            sorter = null
    +          } else {
    +            rows.clear()
    +          }
    +
               while (nextRowAvailable && nextGroup == currentGroup) {
    -            rows += nextRow.copy()
    +            if (sorter == null) {
    +              rows += nextRow.copy()
    +
    +              if (rows.length >= 4096) {
    --- End diff --
    
    Should we make this configurable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169164110
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169227159
  
    **[Test build #48806 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48806/consoleFull)** for PR 10605 at commit [`f464cf6`](https://github.com/apache/spark/commit/f464cf6645fccbeed912859cd57594ba1c1fc1df).
     * This patch **fails from timeout after a configured wait of \`250m\`**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r49219114
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -307,27 +314,63 @@ case class Window(
               // Collect all the rows in the current partition.
               // Before we start to fetch new input rows, make a copy of nextGroup.
               val currentGroup = nextGroup.copy()
    -          rows.clear()
    +
    +          // clear last partition
    +          if (sorter != null) {
    +            // the last sorter of this task will be cleaned up via task completion listener
    +            sorter.cleanupResources()
    +            sorter = null
    +          } else {
    +            rows.clear()
    +          }
    +
               while (nextRowAvailable && nextGroup == currentGroup) {
    -            rows += nextRow.copy()
    +            if (sorter == null) {
    +              rows += nextRow.copy()
    +
    +              if (rows.length >= 4096) {
    +                // We will not sort the rows, so prefixComparator and recordComparator are null.
    +                sorter = UnsafeExternalSorter.create(
    +                  TaskContext.get().taskMemoryManager(),
    +                  SparkEnv.get.blockManager,
    +                  TaskContext.get(),
    +                  null,
    +                  null,
    +                  1024,
    +                  SparkEnv.get.memoryManager.pageSizeBytes)
    +                rows.foreach { r =>
    +                  sorter.insertRecord(r.getBaseObject, r.getBaseOffset, r.getSizeInBytes, 0)
    +                }
    +                rows.clear()
    +              }
    +            } else {
    +              sorter.insertRecord(nextRow.getBaseObject, nextRow.getBaseOffset,
    +                nextRow.getSizeInBytes, 0)
    +            }
                 fetchNextRow()
               }
    +          if (sorter != null) {
    +            rowBuffer = new ExternalRowBuffer(sorter, inputFields)
    +          } else {
    +            rowBuffer = new ArrayRowBuffer(rows)
    +          }
     
               // Setup the frames.
               var i = 0
               while (i < numFrames) {
    -            frames(i).prepare(rows)
    +            frames(i).prepare(rowBuffer.copy())
    --- End diff --
    
    @tedyu both buffers implementations are statefull (they contain an iterator). Each frame traverses the partition in a different way, so we cannot share iterators between them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r49219913
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -307,27 +314,63 @@ case class Window(
               // Collect all the rows in the current partition.
               // Before we start to fetch new input rows, make a copy of nextGroup.
               val currentGroup = nextGroup.copy()
    -          rows.clear()
    +
    +          // clear last partition
    +          if (sorter != null) {
    +            // the last sorter of this task will be cleaned up via task completion listener
    +            sorter.cleanupResources()
    +            sorter = null
    +          } else {
    +            rows.clear()
    +          }
    +
               while (nextRowAvailable && nextGroup == currentGroup) {
    -            rows += nextRow.copy()
    +            if (sorter == null) {
    +              rows += nextRow.copy()
    +
    +              if (rows.length >= 4096) {
    --- End diff --
    
    +1 on making this configurable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169163892
  
    **[Test build #48790 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48790/consoleFull)** for PR 10605 at commit [`2f8705a`](https://github.com/apache/spark/commit/2f8705a9e5f90539c11a7487f842ebe7e79fc170).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r48902027
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -661,29 +719,35 @@ private[execution] final class UnboundedFollowingWindowFunctionFrame(
         lbound: BoundOrdering) extends WindowFunctionFrame {
     
       /** Rows of the partition currently being processed. */
    -  private[this] var input: ArrayBuffer[InternalRow] = null
    +  private[this] var input: ExternalRowBuffer = null
     
       /** Index of the first input row with a value equal to or greater than the lower bound of the
        * current output row. */
       private[this] var inputIndex = 0
     
    -  /** Index of the row we are currently writing. */
    -  private[this] var outputIndex = 0
    -
       /** Prepare the frame for calculating a new partition. */
    -  override def prepare(rows: ArrayBuffer[InternalRow]): Unit = {
    +  override def prepare(rows: ExternalRowBuffer): Unit = {
         input = rows
         inputIndex = 0
    -    outputIndex = 0
       }
     
       /** Write the frame columns for the current row to the given target row. */
    -  override def write(): Unit = {
    -    var bufferUpdated = outputIndex == 0
    +  override def write(index: Int, current: InternalRow): Unit = {
    +    var bufferUpdated = index == 0
    +
    +    // Duplicate the input to have a new iterator
    +    val tmp = input.copy()
     
         // Drop all rows from the buffer for which the input row value is smaller than
         // the output row lower bound.
    -    while (inputIndex < input.size && lbound.compare(input, inputIndex, outputIndex) < 0) {
    +    var i = 0
    +    while (i < inputIndex) {
    +      tmp.next()
    --- End diff --
    
    Yes, it doubles the time. If N is large, it's kind of like double of infinity is still infinity :-) 
    
    Actually I tried to add a `skip(n)` API for UnsafeSortIterator to improve this a little bit, but it seems not worth that complexity, so removed.
    
    I think the next step could be have a fast path for those functions that has communitativity, we could reduce the complexity to `O(N * 2)`
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169220440
  
    **[Test build #48812 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48812/consoleFull)** for PR 10605 at commit [`60edf13`](https://github.com/apache/spark/commit/60edf1325cd95372038922858c186ff4029965d3).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169227195
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48806/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169203354
  
    @hvanhovell Here is the result after having a fast path for small partition:
    
    ```
    +-----+----------------+------+-----------+-----+
    | Size|            Name|Master|SPARK-12295| Diff|
    +-----+----------------+------+-----------+-----+
    |    2|Entire Partition|  2007|       1933|0.963|
    |    2|         Growing|  2576|       2193|0.851|
    |    2|       Shrinking|  2945|       2829|0.961|
    |    2|         Sliding|  2206|       2320|1.052|
    |    8|Entire Partition|  2097|       1913|0.912|
    |    8|         Growing|  1959|       2235|1.141|
    |    8|       Shrinking|  2099|       2359|1.124|
    |    8|         Sliding|  2124|       2040|0.960|
    |   32|Entire Partition|  2261|       2260|1.000|
    |   32|         Growing|  2143|       2218|1.035|
    |   32|       Shrinking|  2357|       2401|1.019|
    |   32|         Sliding|  2691|       2370|0.881|
    |  128|Entire Partition|  2146|       1907|0.889|
    |  128|         Growing|  2019|       1928|0.955|
    |  128|       Shrinking|  3085|       3725|1.207|
    |  128|         Sliding|  2040|       2110|1.034|
    |  256|Entire Partition|  1962|       2042|1.041|
    |  256|         Growing|  2777|       2335|0.841|
    |  256|       Shrinking|  4179|       5027|1.203|
    |  256|         Sliding|  2106|       2155|1.023|
    | 1024|Entire Partition|  1710|       1734|1.014|
    | 1024|         Growing|  1873|       2022|1.080|
    | 1024|       Shrinking| 10456|      14588|1.395|
    | 1024|         Sliding|  1928|       1917|0.994|
    | 4096|Entire Partition|  1769|       1670|0.944|
    | 4096|         Growing|  1986|       1844|0.928|
    | 4096|       Shrinking| 39184|      67151|1.714|
    | 4096|         Sliding|  2080|       2108|1.013|
    |16192|Entire Partition|  2835|       1805|0.637|
    |16192|         Growing|  2189|       1851|0.846|
    |16192|       Shrinking|151624|     337517|2.226|
    |16192|         Sliding|  4734|       5615|1.186|
    +-----+----------------+------+-----------+-----+
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169121426
  
    **[Test build #48787 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48787/consoleFull)** for PR 10605 at commit [`6b98593`](https://github.com/apache/spark/commit/6b98593fb969e9c0dfdde3ced313dff101791d1c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169235118
  
    **[Test build #48817 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48817/consoleFull)** for PR 10605 at commit [`18c7a22`](https://github.com/apache/spark/commit/18c7a22d38c710af98964c78e17f51252580bf93).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169126127
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48787/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169220564
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48812/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r48900944
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -661,29 +719,35 @@ private[execution] final class UnboundedFollowingWindowFunctionFrame(
         lbound: BoundOrdering) extends WindowFunctionFrame {
     
       /** Rows of the partition currently being processed. */
    -  private[this] var input: ArrayBuffer[InternalRow] = null
    +  private[this] var input: ExternalRowBuffer = null
     
       /** Index of the first input row with a value equal to or greater than the lower bound of the
        * current output row. */
       private[this] var inputIndex = 0
     
    -  /** Index of the row we are currently writing. */
    -  private[this] var outputIndex = 0
    -
       /** Prepare the frame for calculating a new partition. */
    -  override def prepare(rows: ArrayBuffer[InternalRow]): Unit = {
    +  override def prepare(rows: ExternalRowBuffer): Unit = {
         input = rows
         inputIndex = 0
    -    outputIndex = 0
       }
     
       /** Write the frame columns for the current row to the given target row. */
    -  override def write(): Unit = {
    -    var bufferUpdated = outputIndex == 0
    +  override def write(index: Int, current: InternalRow): Unit = {
    +    var bufferUpdated = index == 0
    +
    +    // Duplicate the input to have a new iterator
    +    val tmp = input.copy()
     
         // Drop all rows from the buffer for which the input row value is smaller than
         // the output row lower bound.
    -    while (inputIndex < input.size && lbound.compare(input, inputIndex, outputIndex) < 0) {
    +    var i = 0
    +    while (i < inputIndex) {
    +      tmp.next()
    --- End diff --
    
    We are skipping all the records we have already processed here; this shouldn't be to expensive; but it still adds some complexity (```O(n * (n - 1) /2)``` towards ```O(n * n)```). I'd like to see/do a few benchmarks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r48901213
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -498,7 +548,13 @@ private[execution] final class SlidingWindowFunctionFrame(
         ubound: BoundOrdering) extends WindowFunctionFrame {
     
       /** Rows of the partition currently being processed. */
    -  private[this] var input: ArrayBuffer[InternalRow] = null
    +  private[this] var input: ExternalRowBuffer = null
    +
    +  /** The next row from `input`. */
    +  private[this] var nextRow: InternalRow = null
    +
    +  /** The rows within current sliding window. */
    +  private[this] val buffer: util.Queue[InternalRow] = new util.ArrayDeque[InternalRow]()
    --- End diff --
    
    Nit/Performance Paranoia: I would not type it as a ```Queue``` but as an ```ArrayDeque``` (or have scala infer the type). In order to ever prevent megamorphic calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169235926
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169134274
  
    **[Test build #48790 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48790/consoleFull)** for PR 10605 at commit [`2f8705a`](https://github.com/apache/spark/commit/2f8705a9e5f90539c11a7487f842ebe7e79fc170).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169204200
  
    **[Test build #48812 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48812/consoleFull)** for PR 10605 at commit [`60edf13`](https://github.com/apache/spark/commit/60edf1325cd95372038922858c186ff4029965d3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169235927
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48817/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169394742
  
    @hvanhovell Before the fast path, tiny partitions look horrible:
    
    ```
    +-----+----------------+------+----------+-----+
    | Size|            Name|Master|This PR(before fast path) | Diff|
    +-----+----------------+------+----------+-----+
    |    2|Entire Partition|  2007|     12395|6.176|
    |    2|         Growing|  2576|     15729|6.106|
    |    2|       Shrinking|  2945|     13376|4.542|
    |    2|         Sliding|  2206|     13419|6.083|
    |    8|Entire Partition|  2097|      4869|2.322|
    |    8|         Growing|  1959|      4495|2.295|
    |    8|       Shrinking|  2099|      5021|2.392|
    |    8|         Sliding|  2124|      4739|2.231|
    |   32|Entire Partition|  2261|      2750|1.216|
    |   32|         Growing|  2143|      3062|1.429|
    |   32|       Shrinking|  2357|      3233|1.372|
    |   32|         Sliding|  2691|      3270|1.215|
    |  128|Entire Partition|  2146|      1964|0.915|
    |  128|         Growing|  2019|      2331|1.155|
    |  128|       Shrinking|  3085|      4845|1.571|
    |  128|         Sliding|  2040|      2406|1.179|
    |  256|Entire Partition|  1962|      2110|1.075|
    |  256|         Growing|  2777|      2479|0.893|
    |  256|       Shrinking|  4179|      7733|1.850|
    |  256|         Sliding|  2106|      2344|1.113|
    | 1024|Entire Partition|  1710|      2367|1.384|
    | 1024|         Growing|  1873|      2096|1.119|
    | 1024|       Shrinking| 10456|     22028|2.107|
    | 1024|         Sliding|  1928|      2456|1.274|
    | 4096|Entire Partition|  1769|      2062|1.166|
    | 4096|         Growing|  1986|      1941|0.977|
    | 4096|       Shrinking| 39184|     75981|1.939|
    | 4096|         Sliding|  2080|      2376|1.142|
    |16192|Entire Partition|  2835|      1630|0.575|
    |16192|         Growing|  2189|      1801|0.823|
    |16192|       Shrinking|151624|    305837|2.017|
    |16192|         Sliding|  4734|      4430|0.936|
    +-----+----------------+------+----------+-----+
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169149338
  
    @davies this is pretty awesome! I have taken long look at the window code and it looks solid. I am less of an expert on the Memory management front, so maybe someone else should take a look at that.
    
    I do have one small concern: I am absolutely convinced that in the case of large partition sizes this will outperform the current implementation by a margin. However, I am wondering what happens if we consider smaller partition sizes (e.g. n 2-32). We might take a small hit in these cases, because of the added complexity. Have you done some benchmarking on this? If you haven't this is a link to benchmark I used for my initial window prototype: https://issues.apache.org/jira/secure/attachment/12745984/perf_test3.scala
    
    I'd like to finish with something we should **not** address in this PR (thinking out loud if you will). The child node of a ```Window``` operator is allmost always an ```ExternalSort``` operator. Wouldn't it be cool if we could eliminate the row buffer of the ```Window``` by using the external sorts buffer?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169164112
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48790/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169227194
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r48967411
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -283,23 +287,26 @@ case class Window(
             val grouping = UnsafeProjection.create(partitionSpec, child.output)
     
             // Manage the stream and the grouping.
    -        var nextRow: InternalRow = EmptyRow
    -        var nextGroup: InternalRow = EmptyRow
    +        var nextRow: UnsafeRow = null
    +        var nextGroup: UnsafeRow = null
             var nextRowAvailable: Boolean = false
             private[this] def fetchNextRow() {
               nextRowAvailable = stream.hasNext
               if (nextRowAvailable) {
    -            nextRow = stream.next()
    +            nextRow = stream.next().asInstanceOf[UnsafeRow]
                 nextGroup = grouping(nextRow)
               } else {
    -            nextRow = EmptyRow
    -            nextGroup = EmptyRow
    +            nextRow = null
    +            nextGroup = null
               }
             }
             fetchNextRow()
     
             // Manage the current partition.
    -        val rows = ArrayBuffer.empty[InternalRow]
    +        val rows = ArrayBuffer.empty[UnsafeRow]
    --- End diff --
    
    Minor (tiny): We could set the initial size to the value at which we start using the external sorter (4096). Or we could use an array for that matter.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169220562
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169240430
  
    **[Test build #48835 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48835/consoleFull)** for PR 10605 at commit [`560ef90`](https://github.com/apache/spark/commit/560ef903ddaaa6783d2ca5121e705fb27e207556).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169186092
  
    **[Test build #48806 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48806/consoleFull)** for PR 10605 at commit [`f464cf6`](https://github.com/apache/spark/commit/f464cf6645fccbeed912859cd57594ba1c1fc1df).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169210423
  
    **[Test build #48817 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48817/consoleFull)** for PR 10605 at commit [`18c7a22`](https://github.com/apache/spark/commit/18c7a22d38c710af98964c78e17f51252580bf93).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169263417
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48835/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169263236
  
    **[Test build #48835 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48835/consoleFull)** for PR 10605 at commit [`560ef90`](https://github.com/apache/spark/commit/560ef903ddaaa6783d2ca5121e705fb27e207556).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169263416
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169379107
  
    @davies I left some final comments; feel free to use/ignore them.
    
    The benchmark looks good. The shrinking case was expected to perform horrible, and it does. The other figures are so close to the old figure that I don't think we can safely say that they are different (given the limited scope and duration of the test). I was wondering, how does this perform without the optimized (hybrid) approach?
    
    LGTM (pending review of the ```ExternalSort*``` modifications)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169126081
  
    **[Test build #48787 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48787/consoleFull)** for PR 10605 at commit [`6b98593`](https://github.com/apache/spark/commit/6b98593fb969e9c0dfdde3ced313dff101791d1c).
     * This patch **fails MiMa tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class ExternalRowBuffer(sorter: UnsafeExternalSorter, numFields: Int) `


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/10605


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169581349
  
    I'm going to merge this to avoid conflict, any comments will be addressed by follow-up PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r48893663
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -380,8 +408,36 @@ private[execution] final case class RangeBoundOrdering(
         ordering: Ordering[InternalRow],
         current: Projection,
         bound: Projection) extends BoundOrdering {
    -  override def compare(input: Seq[InternalRow], inputIndex: Int, outputIndex: Int): Int =
    -    ordering.compare(current(input(inputIndex)), bound(input(outputIndex)))
    +  override def compare(
    +      inputRow: InternalRow,
    +      inputIndex: Int,
    +      outputRow: InternalRow,
    +      outputIndex: Int): Int =
    +    ordering.compare(current(inputRow), bound(outputRow))
    +}
    +
    +/**
    +  * An external buffer of rows based on UnsafeExternalSorter
    +  */
    +class ExternalRowBuffer(sorter: UnsafeExternalSorter, numFields: Int) {
    --- End diff --
    
    private[execution]


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r48903674
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -661,29 +719,35 @@ private[execution] final class UnboundedFollowingWindowFunctionFrame(
         lbound: BoundOrdering) extends WindowFunctionFrame {
     
       /** Rows of the partition currently being processed. */
    -  private[this] var input: ArrayBuffer[InternalRow] = null
    +  private[this] var input: ExternalRowBuffer = null
     
       /** Index of the first input row with a value equal to or greater than the lower bound of the
        * current output row. */
       private[this] var inputIndex = 0
     
    -  /** Index of the row we are currently writing. */
    -  private[this] var outputIndex = 0
    -
       /** Prepare the frame for calculating a new partition. */
    -  override def prepare(rows: ArrayBuffer[InternalRow]): Unit = {
    +  override def prepare(rows: ExternalRowBuffer): Unit = {
         input = rows
         inputIndex = 0
    -    outputIndex = 0
       }
     
       /** Write the frame columns for the current row to the given target row. */
    -  override def write(): Unit = {
    -    var bufferUpdated = outputIndex == 0
    +  override def write(index: Int, current: InternalRow): Unit = {
    +    var bufferUpdated = index == 0
    +
    +    // Duplicate the input to have a new iterator
    +    val tmp = input.copy()
     
         // Drop all rows from the buffer for which the input row value is smaller than
         // the output row lower bound.
    -    while (inputIndex < input.size && lbound.compare(input, inputIndex, outputIndex) < 0) {
    +    var i = 0
    +    while (i < inputIndex) {
    +      tmp.next()
    --- End diff --
    
    true that :)
    
    This is fine for now; it is not used that much.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169396852
  
    @JoshRosen Could you help to review the ExternalSort* parts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/10605#issuecomment-169126122
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #10605: [SPARK-12295] [SQL] external spilling for window ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r93116743
  
    --- Diff: core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java ---
    @@ -19,6 +19,8 @@
     
     import java.util.Comparator;
     
    +import org.apache.avro.reflect.Nullable;
    --- End diff --
    
    FYI this should probably have been ` import javax.annotation.Nullable`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-12295] [SQL] external spilling for wind...

Posted by tedyu <gi...@git.apache.org>.
Github user tedyu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10605#discussion_r49218188
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala ---
    @@ -307,27 +314,63 @@ case class Window(
               // Collect all the rows in the current partition.
               // Before we start to fetch new input rows, make a copy of nextGroup.
               val currentGroup = nextGroup.copy()
    -          rows.clear()
    +
    +          // clear last partition
    +          if (sorter != null) {
    +            // the last sorter of this task will be cleaned up via task completion listener
    +            sorter.cleanupResources()
    +            sorter = null
    +          } else {
    +            rows.clear()
    +          }
    +
               while (nextRowAvailable && nextGroup == currentGroup) {
    -            rows += nextRow.copy()
    +            if (sorter == null) {
    +              rows += nextRow.copy()
    +
    +              if (rows.length >= 4096) {
    +                // We will not sort the rows, so prefixComparator and recordComparator are null.
    +                sorter = UnsafeExternalSorter.create(
    +                  TaskContext.get().taskMemoryManager(),
    +                  SparkEnv.get.blockManager,
    +                  TaskContext.get(),
    +                  null,
    +                  null,
    +                  1024,
    +                  SparkEnv.get.memoryManager.pageSizeBytes)
    +                rows.foreach { r =>
    +                  sorter.insertRecord(r.getBaseObject, r.getBaseOffset, r.getSizeInBytes, 0)
    +                }
    +                rows.clear()
    +              }
    +            } else {
    +              sorter.insertRecord(nextRow.getBaseObject, nextRow.getBaseOffset,
    +                nextRow.getSizeInBytes, 0)
    +            }
                 fetchNextRow()
               }
    +          if (sorter != null) {
    +            rowBuffer = new ExternalRowBuffer(sorter, inputFields)
    +          } else {
    +            rowBuffer = new ArrayRowBuffer(rows)
    +          }
     
               // Setup the frames.
               var i = 0
               while (i < numFrames) {
    -            frames(i).prepare(rows)
    +            frames(i).prepare(rowBuffer.copy())
    --- End diff --
    
    For ExternalRowBuffer, can this copy() be saved ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org