You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/10/29 22:10:35 UTC

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/3003

    [SPARK-3466] Limit size of results that a driver collects for each action

    Right now, operations like collect() and take() can crash the driver with an OOM if they bring back too many data.
    
    This PR will introduce spark.driver.maxResultSize, after setting it, the driver will abort a job if its result is bigger than it.
    
    By default, it's unlimited, for backward compatibility.
    
    In local mode, the driver and executor share the same JVM, and the default JVM heap size is too small, we can not set a proper limit for it as default, or it can not protect JVM from OOM, or it's too small for normal usage.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark collect

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3003.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3003
    
----
commit ca8267d2b0fbbf7ebcf59e0f50862ed10433e9d4
Author: Davies Liu <da...@databricks.com>
Date:   2014-10-29T20:58:25Z

    limit the size of data by collect

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61186022
  
    @kayousterhout Keep the details in the message of exception will help at first, but becoming noisy later. In order to put the bytes in the message, it needs to grab some details from TaskSetManager in TaskResultGetter, so I would like to avoid it to keep the code more separated. Also there are many cases that the final message in the exception did not carry the details of failure, once these confusing behavior happens, user should look for logs( at least for ERROR log).
    
    In shared SparkContext (in our product), it's possible that multiple collect may happen in the same time. But use maxResultSize for all concurrent collect() will let it hard to understand, some small collect() will be aborted by others. So I think it's better to do it in current approach.
    
    Thanks for your comments!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61179156
  
    I was just referring to the comments I made throughput the code!  Thanks!!
    
    On Thu, Oct 30, 2014 at 3:11 PM, Davies Liu <no...@github.com>
    wrote:
    
    > Thanks for reviewing this, I'm working on addressing all the comments, but
    > one quick question:
    >
    > I think some small changes will help this commit avoid contributing to
    > these problems!
    >
    > Which kind of changes can help to avoid these problems? sorry, I didn't
    > catch this.
    >
    > —
    > Reply to this email directly or view it on GitHub
    > <https://github.com/apache/spark/pull/3003#issuecomment-61178674>.
    >


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61133561
  
    Looks good to me modulo the way the error is propagated from the executor. @kayousterhout do you want to take a look at the TaskResultGetter code too?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19620969
  
    --- Diff: docs/configuration.md ---
    @@ -112,6 +112,16 @@ of the most common options to set are:
       </td>
     </tr>
     <tr>
    +  <td><code>spark.driver.maxResultSize</code></td>
    +  <td>1g</td>
    +  <td>
    +    Limit of total size of serialized bytes of all partitions during collect, it should be at least 1M
    +    or 0 (means unlimited). The stage will be aborted if the total size go above this limit. 
    +    Having high limit may cause OOM in driver (depends on spark.driver.memory and memory overhead
    --- End diff --
    
    OOM -> out-of-memory errors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61393459
  
      [Test build #22738 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22738/consoleFull) for   PR 3003 at commit [`272522e`](https://github.com/apache/spark/commit/272522e65177364adae742bed31df93271dc3bf2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61396477
  
      [Test build #22745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22745/consoleFull) for   PR 3003 at commit [`248ed5e`](https://github.com/apache/spark/commit/248ed5e62e8a262590baa8fb1613305907c46254).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19573001
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -68,6 +68,10 @@ private[spark] class TaskSetManager(
       val SPECULATION_QUANTILE = conf.getDouble("spark.speculation.quantile", 0.75)
       val SPECULATION_MULTIPLIER = conf.getDouble("spark.speculation.multiplier", 1.5)
     
    +  // Limit of bytes for total size of results (default is 20MB)
    +  var MAX_RESULT_SIZE =
    +    Utils.memoryStringToMb(conf.get("spark.driver.maxResultSize", "0")) << 20
    --- End diff --
    
    Also you probably want to make it a Long before doing the shift by 20


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19636413
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -210,25 +213,26 @@ private[spark] class Executor(
             val resultSize = serializedDirectResult.limit
     
             // directSend = sending directly back to the driver
    -        val (serializedResult, directSend) = {
    -          if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
    +        val serializedResult = {
    +          if (resultSize > maxResultSize) {
    +            logInfo(s"Finished $taskName (TID $taskId). result is too large (${resultSize} bytes),"
    --- End diff --
    
    If you think the logging will not ever be read by anyone, you should not add it all :).  If you're going to have a log message, please make it descriptive!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19572955
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -68,6 +68,10 @@ private[spark] class TaskSetManager(
       val SPECULATION_QUANTILE = conf.getDouble("spark.speculation.quantile", 0.75)
       val SPECULATION_MULTIPLIER = conf.getDouble("spark.speculation.multiplier", 1.5)
     
    +  // Limit of bytes for total size of results (default is 20MB)
    +  var MAX_RESULT_SIZE =
    +    Utils.memoryStringToMb(conf.get("spark.driver.maxResultSize", "0")) << 20
    --- End diff --
    
    This can be a val instead of a var


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61052816
  
      [Test build #22523 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22523/consoleFull) for   PR 3003 at commit [`47b144f`](https://github.com/apache/spark/commit/47b144f66badf9484966d3f2c74ccdb594350751).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19642119
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala ---
    @@ -64,9 +77,10 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul
                   val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                     serializedTaskResult.get)
                   sparkEnv.blockManager.master.removeBlock(blockId)
    -              deserializedResult
    +              (deserializedResult, size)
               }
    -          result.metrics.resultSize = serializedData.limit()
    +
    +          result.metrics.resultSize = size
    --- End diff --
    
    is the change here in how the size gets set still necessary?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61181600
  
    @kayousterhout I should had addressed all your comments, or leave comments there, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/3003


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61025858
  
    Also, can we make TaskResultGetter avoid *fetching* more than maxResultSize for indirect task results? Right now it will fetch them (maybe in parallel) and it's only after we pass them to the TaskSetManager that we might realize there are too many bytes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19620795
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -210,25 +213,26 @@ private[spark] class Executor(
             val resultSize = serializedDirectResult.limit
     
             // directSend = sending directly back to the driver
    -        val (serializedResult, directSend) = {
    -          if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
    +        val serializedResult = {
    +          if (resultSize > maxResultSize) {
    +            logInfo(s"Finished $taskName (TID $taskId). result is too large (${resultSize} bytes),"
    +              + " drop it")
    +            ser.serialize(new TooLargeTaskResult(resultSize))
    --- End diff --
    
    It's kind of weird that we send a new type of result (TooLargeTaskResult) in this case instead of failing the task. On the web UI, the task will look like it completed, but then the job will abort. Is there a way we could fail the task instead, and send a special TaskFailedReason that says the result is too large? Then the TaskSetManager can abort when it sees that, the same way it does for FetchFailed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61165401
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22547/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635301
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -210,25 +213,26 @@ private[spark] class Executor(
             val resultSize = serializedDirectResult.limit
     
             // directSend = sending directly back to the driver
    -        val (serializedResult, directSend) = {
    -          if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
    +        val serializedResult = {
    +          if (resultSize > maxResultSize) {
    +            logInfo(s"Finished $taskName (TID $taskId). result is too large (${resultSize} bytes),"
    --- End diff --
    
    result --> Result.
    
    Also I think it would be helpful to mention the config parameter name and the current setting here to aid with debuggability; e.g.,:
    
    "Finished $taskName (TID $taskID), but dropping result because the size ($resultSize bytes) is larger than the maximum result size ($maxResultSize).  Increase spark.driver.maxResultSize to allow larger task results."
    
    This is a longer message, but in general I think this makes it much easier for a user than if they have to google the error and track down the config option they may want to change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19708474
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -522,6 +526,24 @@ private[spark] class TaskSetManager(
       }
     
       /**
    +   * Check whether has enough quota to fetch the result with `size` bytes
    +   */
    +  def canFetchMoreResult(size: Long): Boolean = synchronized {
    --- End diff --
    
    Nit: call this canFetchMoreResults (with an s at the end)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61178674
  
    Thanks for reviewing this, I'm working on addressing all the comments, but one quick question:
    
    > I think some small changes will help this commit avoid contributing to these problems!
    
    Which kind of changes can help to avoid these problems? sorry, I didn't catch this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61006272
  
    IMO we should have a limit by default, but just make it large (say 1 GB). Otherwise nobody will know to configure this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19636248
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala ---
    @@ -47,9 +49,18 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul
         getTaskResultExecutor.execute(new Runnable {
           override def run(): Unit = Utils.logUncaughtExceptions {
             try {
    -          val result = serializer.get().deserialize[TaskResult[_]](serializedData) match {
    -            case directResult: DirectTaskResult[_] => directResult
    -            case IndirectTaskResult(blockId) =>
    +          val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {
    +            case directResult: DirectTaskResult[_] =>
    +              if (!taskSetManager.canFetchMoreResult(serializedData.limit())) {
    +                throw new BigResultException
    --- End diff --
    
    Can you just abort the task set here, rather than throwing an exception? I think it would make the code easier to read. Also, exceptions have gotten us into trouble before where we don't catch them properly, especially when we change the code late.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635836
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -89,6 +92,7 @@ private[spark] class TaskSetManager(
       var stageId = taskSet.stageId
       var name = "TaskSet_" + taskSet.stageId.toString
       var parent: Pool = null
    +  var totalResultSize = 0L
    --- End diff --
    
    Ah I see this is just for one job.  But then this ignores the fact that many jobs can be running at once?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61169382
  
    Sorry for the large number of comments -- but generally I've become concerned lately that the number of configuration parameters makes the scheduler code really hard to use / understand / troubleshoot, and also that many of the tests around this stuff are very hard to understand.  I think some small changes will help this commit avoid contributing to these problems!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61179733
  
    The logging looks like this now:
    ```
    14/10/30 15:17:36 WARN Executor: Finished task 1.0 in stage 0.0 (TID 1). Result is larger than maxResultSize (4.5 MB > 1024.0 KB), drop it
    14/10/30 15:17:36 ERROR TaskSetManager: Total number of bytes of serialized results is bigger than maxResultSize: 4.5 MB > 1024.0 KB
    14/10/30 15:17:36 INFO TaskSchedulerImpl: Cancelling stage 0
    org.apache.spark.SparkException: Job aborted due to stage failure: Total size of results is bigger than maxResultSize, please increase spark.driver.maxResultSize or avoid collect() if possible
    	at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1191)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1180)
    	at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1179)
    	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    ```
    
    Does it work for you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19637064
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala ---
    @@ -64,11 +75,15 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul
                   val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                     serializedTaskResult.get)
                   sparkEnv.blockManager.master.removeBlock(blockId)
    -              deserializedResult
    +              (deserializedResult, size)
               }
    -          result.metrics.resultSize = serializedData.limit()
    +
    +          result.metrics.resultSize = size
               scheduler.handleSuccessfulTask(taskSetManager, tid, result)
             } catch {
    +          case e: BigResultException =>
    +            taskSetManager.abort("Total size of result is bigger than maxResultSize, " +
    --- End diff --
    
    The abort() message is a little different than the logInfos, because that's the one that gets propagated directly back to the user (it eventually becomes the error message in the Spark exception that the user gets)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61351042
  
      [Test build #499 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/499/consoleFull) for   PR 3003 at commit [`2c35773`](https://github.com/apache/spark/commit/2c35773a83968c43bf8385cc20d76dd9d744a317).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19708494
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -210,25 +213,27 @@ private[spark] class Executor(
             val resultSize = serializedDirectResult.limit
     
             // directSend = sending directly back to the driver
    -        val (serializedResult, directSend) = {
    -          if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
    +        val serializedResult = {
    +          if (resultSize > maxResultSize) {
    +            logWarning(s"Finished $taskName (TID $taskId). Result is larger than maxResultSize " +
    +              s"(${Utils.bytesToString(resultSize)} > ${Utils.bytesToString(maxResultSize)}), " +
    +              s"drop it.")
    --- End diff --
    
    drop => dropping


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19620932
  
    --- Diff: docs/configuration.md ---
    @@ -112,6 +112,16 @@ of the most common options to set are:
       </td>
     </tr>
     <tr>
    +  <td><code>spark.driver.maxResultSize</code></td>
    +  <td>1g</td>
    +  <td>
    +    Limit of total size of serialized bytes of all partitions during collect, it should be at least 1M
    --- End diff --
    
    Instead of "during collect" say "for each Spark action (e.g. collect)"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61152425
  
      [Test build #22544 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22544/consoleFull) for   PR 3003 at commit [`6e8a135`](https://github.com/apache/spark/commit/6e8a135525511d551c55d29eb649d2bceb161e49).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635887
  
    --- Diff: docs/configuration.md ---
    @@ -112,6 +112,18 @@ of the most common options to set are:
       </td>
     </tr>
     <tr>
    +  <td><code>spark.driver.maxResultSize</code></td>
    +  <td>1g</td>
    +  <td>
    +    Limit of total size of serialized bytes of all partitions for each Spark action (e.g. collect),
    +    it should be at least 1M or 0 (means unlimited). The stage will be aborted if the total size
    +    go above this limit. 
    +    Having high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
    +    and memory overhead of objects in JVM). Set a proper limit can protect driver from
    --- End diff --
    
    "Set a proper limit can protect driver" --> "Setting a proper limit can protect the driver"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61393735
  
    Looks like this has a compile error after the change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61190158
  
      [Test build #22566 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22566/consoleFull) for   PR 3003 at commit [`5d62303`](https://github.com/apache/spark/commit/5d62303e3ebdfddf95447e5ed8aea90c6c010084).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61291807
  
      [Test build #22613 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22613/consoleFull) for   PR 3003 at commit [`2c35773`](https://github.com/apache/spark/commit/2c35773a83968c43bf8385cc20d76dd9d744a317).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61048818
  
      [Test build #22523 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22523/consoleFull) for   PR 3003 at commit [`47b144f`](https://github.com/apache/spark/commit/47b144f66badf9484966d3f2c74ccdb594350751).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61152815
  
      [Test build #22544 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22544/consoleFull) for   PR 3003 at commit [`6e8a135`](https://github.com/apache/spark/commit/6e8a135525511d551c55d29eb649d2bceb161e49).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635706
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -563,6 +563,28 @@ class TaskSetManagerSuite extends FunSuite with LocalSparkContext with Logging {
         assert(manager.emittedTaskSizeWarning)
       }
     
    +  test("abort the job if total size of results is too large") {
    +    val conf = new SparkConf().set("spark.driver.maxResultSize", "2m")
    +    sc = new SparkContext("local", "test", conf)
    +
    +    // make it serializable
    --- End diff --
    
    This comment is not very helpful.  Can you describe what this function actually does? What part of this is involved in making it serializable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61395147
  
      [Test build #22745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22745/consoleFull) for   PR 3003 at commit [`248ed5e`](https://github.com/apache/spark/commit/248ed5e62e8a262590baa8fb1613305907c46254).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61027764
  
    Good point, I will do it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61396591
  
    Thanks, merged this in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635504
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -89,6 +92,7 @@ private[spark] class TaskSetManager(
       var stageId = taskSet.stageId
       var name = "TaskSet_" + taskSet.stageId.toString
       var parent: Pool = null
    +  var totalResultSize = 0L
    --- End diff --
    
    This seems to only ever increase -- so eventually all jobs will fail because there's not enough space? Shouldn't this be decremented somewhere (and how will you decide when to do that)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19572924
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -524,7 +529,14 @@ private[spark] class TaskSetManager(
       /**
        * Marks the task as successful and notifies the DAGScheduler that a task has ended.
        */
    -  def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]) = {
    +  def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {
    +    totalResultSize += result.metrics.resultSize
    +    if (MAX_RESULT_SIZE > 0 && totalResultSize > MAX_RESULT_SIZE) {
    +      abort(s"The size of results ${Utils.bytesToString(totalResultSize)}" +
    +            s" extends limit ${Utils.bytesToString(MAX_RESULT_SIZE)}, please increase" +
    +            s" spark.driver.maxResultSize or void collect() if it's possible")
    --- End diff --
    
    should say avoid


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61165620
  
      [Test build #493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/493/consoleFull) for   PR 3003 at commit [`6e8a135`](https://github.com/apache/spark/commit/6e8a135525511d551c55d29eb649d2bceb161e49).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61155165
  
      [Test build #22547 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22547/consoleFull) for   PR 3003 at commit [`bc3c077`](https://github.com/apache/spark/commit/bc3c0770e38527f99c4e58525b0376b675166b94).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61165392
  
      [Test build #22547 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22547/consoleFull) for   PR 3003 at commit [`bc3c077`](https://github.com/apache/spark/commit/bc3c0770e38527f99c4e58525b0376b675166b94).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class RegressionMetrics(predictionAndObservations: RDD[(Double, Double)]) extends Logging `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19636864
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala ---
    @@ -64,11 +75,15 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul
                   val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                     serializedTaskResult.get)
                   sparkEnv.blockManager.master.removeBlock(blockId)
    -              deserializedResult
    +              (deserializedResult, size)
               }
    -          result.metrics.resultSize = serializedData.limit()
    +
    +          result.metrics.resultSize = size
               scheduler.handleSuccessfulTask(taskSetManager, tid, result)
             } catch {
    +          case e: BigResultException =>
    +            taskSetManager.abort("Total size of result is bigger than maxResultSize, " +
    --- End diff --
    
    good point!
    
    PS: before this abort(), there is a error logging tell what the size of result and maxResultSize is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61198814
  
    @mateiz No, maxResultSize is checked per stage.
    
    I will add the sizes into abort message.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19637193
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -89,6 +92,7 @@ private[spark] class TaskSetManager(
       var stageId = taskSet.stageId
       var name = "TaskSet_" + taskSet.stageId.toString
       var parent: Pool = null
    +  var totalResultSize = 0L
    --- End diff --
    
    Each TaskSetManager object is used for only one stage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61015776
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22476/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61393503
  
      [Test build #22738 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22738/consoleFull) for   PR 3003 at commit [`272522e`](https://github.com/apache/spark/commit/272522e65177364adae742bed31df93271dc3bf2).
     * This patch **fails to build**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61300123
  
      [Test build #22613 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22613/consoleFull) for   PR 3003 at commit [`2c35773`](https://github.com/apache/spark/commit/2c35773a83968c43bf8385cc20d76dd9d744a317).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61396480
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22745/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19620876
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -563,6 +563,31 @@ class TaskSetManagerSuite extends FunSuite with LocalSparkContext with Logging {
         assert(manager.emittedTaskSizeWarning)
       }
     
    +  test("abort the job if total size of results is too large") {
    +    val conf = new SparkConf().set("spark.driver.maxResultSize", "2m")
    +    sc = new SparkContext("local", "test", conf)
    +
    +    // multiple tiny result
    +    val r = sc.makeRDD(0 until (1 << 10), 10).collect()
    +    assert((1 << 10) === r.size )
    +
    +    // single large result
    +    try {
    +      sc.makeRDD(0 until (1<<20), 1).collect()
    +      throw new Exception("single large result should fail")
    +    } catch {
    +      case e: SparkException =>
    +    }
    +
    +    // multiple small result
    +    try {
    +      sc.makeRDD(0 until (1<<20), 10).collect()
    --- End diff --
    
    Nit: put spaces around `<<`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61006299
  
      [Test build #22476 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22476/consoleFull) for   PR 3003 at commit [`ca8267d`](https://github.com/apache/spark/commit/ca8267d2b0fbbf7ebcf59e0f50862ed10433e9d4).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19621398
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -563,6 +563,31 @@ class TaskSetManagerSuite extends FunSuite with LocalSparkContext with Logging {
         assert(manager.emittedTaskSizeWarning)
       }
     
    +  test("abort the job if total size of results is too large") {
    +    val conf = new SparkConf().set("spark.driver.maxResultSize", "2m")
    +    sc = new SparkContext("local", "test", conf)
    +
    +    // multiple tiny result
    +    val r = sc.makeRDD(0 until (1 << 10), 10).collect()
    --- End diff --
    
    It's not 100% clear what the size of this RDD will be in bytes; might be better to change the test to use byte arrays filled with random bytes (to avoid compression).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61193359
  
    Wait, is maxResultSize across collects? That doesn't make sense, it should be enforced for each pending operation. I agree that it would be good to send this size to the user in the abort message, after all we have it right there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61183847
  
    Thanks for updating this!  I'd still like the error message returned to the user (the one in the abort() call) to include the size of the too-big result, as well as the configured maximum size.  I think there is little cost to adding this information, and great savings to a user who is trying to understand this functionality.  It looks like you're running from a Spark shell with the logging level set to info, but users in other environments will only see the SparkException and not the log message.
    
    Also, it looks like you didn't address the comment about multiple jobs/stages running at once.  Right now, the maximum limit only applies to one stage.  This seems like an issue because multiple concurrent stages or jobs that all collect results can together add up to more than the limit.  @mateiz do you think this is a non-issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61154911
  
      [Test build #493 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/493/consoleFull) for   PR 3003 at commit [`6e8a135`](https://github.com/apache/spark/commit/6e8a135525511d551c55d29eb649d2bceb161e49).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61018878
  
      [Test build #22479 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22479/consoleFull) for   PR 3003 at commit [`3d81af2`](https://github.com/apache/spark/commit/3d81af20a06e41b43cde23dce56b250acb88c835).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61048612
  
    @mateiz  I had re-implemented it, now it checks the size result before sending from executor and fetching in driver, please review again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19636523
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala ---
    @@ -31,8 +31,8 @@ import org.apache.spark.util.Utils
     private[spark] sealed trait TaskResult[T]
     
     /** A reference to a DirectTaskResult that has been stored in the worker's BlockManager. */
    -private[spark]
    -case class IndirectTaskResult[T](blockId: BlockId) extends TaskResult[T] with Serializable
    +private[spark] case class IndirectTaskResult[T](blockId: BlockId, size: Int)
    --- End diff --
    
    There will be three lines after add `size: Int`. For consistency, it will also looks different than TaskResult.
    
    I think the basic idea behind break the line is that it can not fit in line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635053
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
    @@ -29,7 +29,7 @@ import scala.math.min
     import org.apache.spark._
     import org.apache.spark.TaskState.TaskState
     import org.apache.spark.executor.TaskMetrics
    -import org.apache.spark.util.{Clock, SystemClock}
    +import org.apache.spark.util.{Utils, Clock, SystemClock}
    --- End diff --
    
    nit: alphabetization here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19642646
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala ---
    @@ -64,9 +77,10 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul
                   val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                     serializedTaskResult.get)
                   sparkEnv.blockManager.master.removeBlock(blockId)
    -              deserializedResult
    +              (deserializedResult, size)
               }
    -          result.metrics.resultSize = serializedData.limit()
    +
    +          result.metrics.resultSize = size
    --- End diff --
    
    Oh that's a good point about the original one being incorrect for indirect task results!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61025739
  
    Hey Davies, one other thing: in Executor, you should just not send back a result if the single result exceeds maxResultSize. That way you avoid killing the driver with that single result in clusters where executors have more memory than the driver (which I think can happen). You could just serialize the result, and if it's too big, fail the task with a "result too big" exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61015771
  
      [Test build #22476 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22476/consoleFull) for   PR 3003 at commit [`ca8267d`](https://github.com/apache/spark/commit/ca8267d2b0fbbf7ebcf59e0f50862ed10433e9d4).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61152820
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22544/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61393504
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22738/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61190165
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22566/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19708506
  
    --- Diff: docs/configuration.md ---
    @@ -112,6 +112,18 @@ of the most common options to set are:
       </td>
     </tr>
     <tr>
    +  <td><code>spark.driver.maxResultSize</code></td>
    +  <td>1g</td>
    +  <td>
    +    Limit of total size of serialized bytes of all partitions for each Spark action (e.g. collect),
    +    it should be at least 1M or 0 (means unlimited). The stage will be aborted if the total size
    +    is above this limit. 
    +    Having high limit may cause out-of-memory errors in driver (depends on spark.driver.memory
    --- End diff --
    
    high limit => a high limit


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19708503
  
    --- Diff: docs/configuration.md ---
    @@ -112,6 +112,18 @@ of the most common options to set are:
       </td>
     </tr>
     <tr>
    +  <td><code>spark.driver.maxResultSize</code></td>
    +  <td>1g</td>
    +  <td>
    +    Limit of total size of serialized bytes of all partitions for each Spark action (e.g. collect),
    +    it should be at least 1M or 0 (means unlimited). The stage will be aborted if the total size
    +    is above this limit. 
    --- End diff --
    
    Small wording fix:
    "Limit of total size of serialized results of all partitions for each Spark action (e.g. collect). Should be at least 1M, or 0 for unlimited. Jobs will be aborted if the total size of results is above this limit."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61300138
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22613/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61181743
  
      [Test build #22566 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22566/consoleFull) for   PR 3003 at commit [`5d62303`](https://github.com/apache/spark/commit/5d62303e3ebdfddf95447e5ed8aea90c6c010084).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635858
  
    --- Diff: docs/configuration.md ---
    @@ -112,6 +112,18 @@ of the most common options to set are:
       </td>
     </tr>
     <tr>
    +  <td><code>spark.driver.maxResultSize</code></td>
    +  <td>1g</td>
    +  <td>
    +    Limit of total size of serialized bytes of all partitions for each Spark action (e.g. collect),
    +    it should be at least 1M or 0 (means unlimited). The stage will be aborted if the total size
    +    go above this limit. 
    --- End diff --
    
    go --> is


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635398
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResult.scala ---
    @@ -31,8 +31,8 @@ import org.apache.spark.util.Utils
     private[spark] sealed trait TaskResult[T]
     
     /** A reference to a DirectTaskResult that has been stored in the worker's BlockManager. */
    -private[spark]
    -case class IndirectTaskResult[T](blockId: BlockId) extends TaskResult[T] with Serializable
    +private[spark] case class IndirectTaskResult[T](blockId: BlockId, size: Int)
    --- End diff --
    
    Why did you change this spacing?  Either spacing is fine but for consistency, this should be the same as the formatting for DirectTaskResult class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19636342
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -210,25 +213,26 @@ private[spark] class Executor(
             val resultSize = serializedDirectResult.limit
     
             // directSend = sending directly back to the driver
    -        val (serializedResult, directSend) = {
    -          if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
    +        val serializedResult = {
    +          if (resultSize > maxResultSize) {
    +            logInfo(s"Finished $taskName (TID $taskId). result is too large (${resultSize} bytes),"
    --- End diff --
    
    This logging happens on executor, is not user-faced directly in cluster mode.
    
    The logging in driver is more important, one error logging and a message for abort(), I will improve those twos, but keep this one simple. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61345768
  
      [Test build #499 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/499/consoleFull) for   PR 3003 at commit [`2c35773`](https://github.com/apache/spark/commit/2c35773a83968c43bf8385cc20d76dd9d744a317).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19642508
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala ---
    @@ -64,9 +77,10 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul
                   val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                     serializedTaskResult.get)
                   sparkEnv.blockManager.master.removeBlock(blockId)
    -              deserializedResult
    +              (deserializedResult, size)
               }
    -          result.metrics.resultSize = serializedData.limit()
    +
    +          result.metrics.resultSize = size
    --- End diff --
    
    The metrics maybe used by others, such as WebUI, so let keep it.
    
    The original one is not correct for IndirectTaskResult(), I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19621262
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -563,6 +563,31 @@ class TaskSetManagerSuite extends FunSuite with LocalSparkContext with Logging {
         assert(manager.emittedTaskSizeWarning)
       }
     
    +  test("abort the job if total size of results is too large") {
    +    val conf = new SparkConf().set("spark.driver.maxResultSize", "2m")
    +    sc = new SparkContext("local", "test", conf)
    +
    +    // multiple tiny result
    +    val r = sc.makeRDD(0 until (1 << 10), 10).collect()
    +    assert((1 << 10) === r.size )
    +
    +    // single large result
    +    try {
    +      sc.makeRDD(0 until (1<<20), 1).collect()
    +      throw new Exception("single large result should fail")
    +    } catch {
    +      case e: SparkException =>
    +    }
    --- End diff --
    
    Instead of this pattern, you can use
    ```
    intercept[SparkException] { sc.makeRDD(...).collect() }
    ```
    This is provided by ScalaTest and expects that exception.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by davies <gi...@git.apache.org>.

Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19623260
  
    --- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
    @@ -210,25 +213,26 @@ private[spark] class Executor(
             val resultSize = serializedDirectResult.limit
     
             // directSend = sending directly back to the driver
    -        val (serializedResult, directSend) = {
    -          if (resultSize >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
    +        val serializedResult = {
    +          if (resultSize > maxResultSize) {
    +            logInfo(s"Finished $taskName (TID $taskId). result is too large (${resultSize} bytes),"
    +              + " drop it")
    +            ser.serialize(new TooLargeTaskResult(resultSize))
    --- End diff --
    
    At first, I tried the way you suggested, introduced another kind of failure, then we need to do special things for it, because we won't retry this task in some way. 
    
    If size of single task is not bigger than maxResultSize, all the tasks are successful, but we still abort the job. So I think it's better to do similar things for single large task.
    
    So I think the current approach is better than introducing another type of failure.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61018889
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22479/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61052819
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22523/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19635782
  
    --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
    @@ -563,6 +563,28 @@ class TaskSetManagerSuite extends FunSuite with LocalSparkContext with Logging {
         assert(manager.emittedTaskSizeWarning)
       }
     
    +  test("abort the job if total size of results is too large") {
    +    val conf = new SparkConf().set("spark.driver.maxResultSize", "2m")
    +    sc = new SparkContext("local", "test", conf)
    +
    +    // make it serializable
    +    val genBytes = (size: Int) => { (x: Int) =>
    +      val bytes = Array.ofDim[Byte](size)
    +      scala.util.Random.nextBytes(bytes)
    +      bytes
    +    }
    +
    +    // multiple 1k result
    +    val r = sc.makeRDD(0 until 10, 10).map(genBytes(1024)).collect()
    +    assert(10 === r.size )
    +
    +    // single 10M result
    +    intercept[SparkException] {sc.makeRDD(genBytes(10 << 20)(0), 1).collect()}
    --- End diff --
    
    Can you also make sure this is the correct exception (the result too big one)?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by mateiz <gi...@git.apache.org>.

Github user mateiz commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61387167
  
    Hey @davies, this looks good to me. Made a few comments on wording and the name of one method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/3003#issuecomment-61009945
  
      [Test build #22479 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22479/consoleFull) for   PR 3003 at commit [`3d81af2`](https://github.com/apache/spark/commit/3d81af20a06e41b43cde23dce56b250acb88c835).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-3466] Limit size of results that a driv...

Posted by kayousterhout <gi...@git.apache.org>.

Github user kayousterhout commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3003#discussion_r19636070
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskResultGetter.scala ---
    @@ -64,11 +75,15 @@ private[spark] class TaskResultGetter(sparkEnv: SparkEnv, scheduler: TaskSchedul
                   val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](
                     serializedTaskResult.get)
                   sparkEnv.blockManager.master.removeBlock(blockId)
    -              deserializedResult
    +              (deserializedResult, size)
               }
    -          result.metrics.resultSize = serializedData.limit()
    +
    +          result.metrics.resultSize = size
               scheduler.handleSuccessfulTask(taskSetManager, tid, result)
             } catch {
    +          case e: BigResultException =>
    +            taskSetManager.abort("Total size of result is bigger than maxResultSize, " +
    --- End diff --
    
    again, can you say what the result size was and what maxResultSize is here? That can help a lot for debugging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org