You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by davies <gi...@git.apache.org> on 2014/10/24 22:37:18 UTC

[GitHub] spark pull request: use broadcast for task only when task is large...

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/2933

    use broadcast for task only when task is large enough

    Using broadcast for small tasks has no benefits or even some regressions (several RPCs),  also there some stable issues with broadcast, so we should use broadcast for tasks only when the serialized tasks are large enough (larger than 8k, be default, maybe changed in future).
    
    In practice, most of tasks are small, so this should improve the stability for most user cases, especially for tests, which will start and stop context multiple times.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark broadcast

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2933.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2933
    
----
commit 4ca3caa3f56e56139b2e33151ff95688232c4fe3
Author: Davies Liu <da...@databricks.com>
Date:   2014-10-24T20:22:58Z

    use broadcast when task is large enough

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60531516
  
    > The motivation is not about performance, it's about stability. 
    
    We're fighting with the problem of failure during deserialize a task for days, they can not be reproduced easily. Hope that we can fix it before 1.2 release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60494261
  
      [Test build #450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/450/consoleFull) for   PR 2933 at commit [`7dfe41e`](https://github.com/apache/spark/commit/7dfe41e7449c1103840a389de87557a84f8d9d9d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60446743
  
      [Test build #22161 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22161/consoleFull) for   PR 2933 at commit [`4ca3caa`](https://github.com/apache/spark/commit/4ca3caa3f56e56139b2e33151ff95688232c4fe3).
     * This patch **fails Scala style tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class LogInfo(startTime: Long, endTime: Long, path: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60529659
  
    Broadcast (especially TorrentBroadcast) is designed for large object, using it to send out small shared variables just like using tank to shot a mosquitoes, it's not a good approach in the begging, which make simple things complicated.
    
    The motivation of broadcasting tasks, is to solve the performance for `BIG` closure, it should not brings any regression for other cases (small closure), the latter are more common and important in daily usage. In order to fix the regression (performance or stability), we may need to introduce even more complicated logic (just like embedded broadcast or piggy back small blocks).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by squito <gi...@git.apache.org>.
Github user squito commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-62330965
  
    I agree with @pwendell .  It seems like the right thing to do is just fix Broadcast  ... and if we can't, then wouldn't you also want to turn off Broadcast even for big closures?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60500716
  
    I've been thinking about this some more and I wonder about the motivation for this change: how much of a performance benefit does this buy us for typical workloads?  This (and the other torrentbroadcast inlining patch) add extra code-paths / complexity, but do they buy us measurable performance benefits?  I'm concerned about adding extra branches to already-complicated code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60457547
  
      [Test build #22169 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22169/consoleFull) for   PR 2933 at commit [`28a9409`](https://github.com/apache/spark/commit/28a9409325d8fe7299dd8e01598ef86e73f70fa9).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2933#discussion_r19369470
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -124,6 +123,10 @@ class DAGScheduler(
       /** If enabled, we may run certain actions like take() and first() locally. */
       private val localExecutionEnabled = sc.getConf.getBoolean("spark.localExecution.enabled", false)
     
    +  /** Broadcast the serialized tasks only when they are bigger than it */
    +  private val broadcastTaskMinSize =
    +    sc.getConf.getInt("spark.scheduler.broadcastTaskMinSize", 8) * 1024
    --- End diff --
    
    As discussed offline, user will take the risk if they change it to non-reasonable values.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60553953
  
    Can you point me to the commit that produced that stacktrace?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-67107062
  
    Close this now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60446746
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22161/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2933#discussion_r19377630
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -124,6 +123,10 @@ class DAGScheduler(
       /** If enabled, we may run certain actions like take() and first() locally. */
       private val localExecutionEnabled = sc.getConf.getBoolean("spark.localExecution.enabled", false)
     
    +  /** Broadcast the serialized tasks only when they are bigger than it */
    +  private val broadcastTaskMinSize =
    +    sc.getConf.getInt("spark.scheduler.broadcastTaskMinSize", 8) * 1024
    --- End diff --
    
    I think it's better to keep this internal, it's a tradeoff between 1.0 and 1.1, most of the users do need to touch this.
    
    We could document it later if user really need it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60474693
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22196/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60491396
  
      [Test build #450 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/450/consoleFull) for   PR 2933 at commit [`7dfe41e`](https://github.com/apache/spark/commit/7dfe41e7449c1103840a389de87557a84f8d9d9d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-61395215
  
      [Test build #22746 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22746/consoleFull) for   PR 2933 at commit [`aab61a8`](https://github.com/apache/spark/commit/aab61a855c69b995f25070014c85d2ce04f39d15).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-61396544
  
      [Test build #22746 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22746/consoleFull) for   PR 2933 at commit [`aab61a8`](https://github.com/apache/spark/commit/aab61a855c69b995f25070014c85d2ce04f39d15).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class DecimalType(DataType):`
      * `case class UnscaledValue(child: Expression) extends UnaryExpression `
      * `case class MakeDecimal(child: Expression, precision: Int, scale: Int) extends UnaryExpression `
      * `case class MutableLiteral(var value: Any, dataType: DataType, nullable: Boolean = true)`
      * `case class PrecisionInfo(precision: Int, scale: Int)`
      * `case class DecimalType(precisionInfo: Option[PrecisionInfo]) extends FractionalType `
      * `final class Decimal extends Ordered[Decimal] with Serializable `
      * `  trait DecimalIsConflicted extends Numeric[Decimal] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60498227
  
      [Test build #22223 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22223/consoleFull) for   PR 2933 at commit [`f3e2081`](https://github.com/apache/spark/commit/f3e20814411b36140edb726ac7e6b1dec1a8f939).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2933#discussion_r19369647
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/Stage.scala ---
    @@ -69,6 +70,10 @@ private[spark] class Stage(
       var resultOfJob: Option[ActiveJob] = None
       var pendingTasks = new HashSet[Task[_]]
     
    +  /** This is used to track the life cycle of broadcast,
    --- End diff --
    
    Super-minor style nit, but I think our usual style is to not place comments on the first `/**` line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60554225
  
    @JoshRosen @pwendell  The test branch (internal) did not have that commit. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60452771
  
      [Test build #418 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/418/consoleFull) for   PR 2933 at commit [`4ca3caa`](https://github.com/apache/spark/commit/4ca3caa3f56e56139b2e33151ff95688232c4fe3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60530014
  
    I don't see fundamentally why the broadcast mechanism can't be done as efficiently as task launching itself. Do you have a reproducible workload where this caused a performance regression and we couldn't optimize the broadcast sufficiently?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60472463
  
      [Test build #22196 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22196/consoleFull) for   PR 2933 at commit [`7dfe41e`](https://github.com/apache/spark/commit/7dfe41e7449c1103840a389de87557a84f8d9d9d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60818616
  
      [Test build #486 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/486/consoleFull) for   PR 2933 at commit [`f3e2081`](https://github.com/apache/spark/commit/f3e20814411b36140edb726ac7e6b1dec1a8f939).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60525495
  
    I find this a little bit hacky. If the broadcast implementation has bugs or performance issues, we should just fix them and it will stabalize over time like any other new features we add. Having this mode where we might do one thing and might do another, it will make debugging and measuring things trickier. And we'll expose this configuration option which it seems like ultimately we will want to remove.
    
    IMO this would only be justified if we had a well documented performance issue that we felt we simply can't solve within the broadcast architecture, then you would give a latch here for people to avoid broadcasting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60458041
  
    Do you mind opening a JIRA for this?  You can link it to the other broadcast optimization one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by aarondav <gi...@git.apache.org>.
Github user aarondav commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2933#discussion_r19377173
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -124,6 +123,10 @@ class DAGScheduler(
       /** If enabled, we may run certain actions like take() and first() locally. */
       private val localExecutionEnabled = sc.getConf.getBoolean("spark.localExecution.enabled", false)
     
    +  /** Broadcast the serialized tasks only when they are bigger than it */
    +  private val broadcastTaskMinSize =
    +    sc.getConf.getInt("spark.scheduler.broadcastTaskMinSize", 8) * 1024
    --- End diff --
    
    Perhaps call this broadcastTaskMinSizeKB? Should we document this flag? Either way, there should be some mention that your jobs will literally stop working silently if you change this to be similarly to the akka frame size. It is not clear that this is sent via Akka.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60553581
  
    @JoshRosen I think we still have it (in tests at tonight):
    ```
    [info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11, localhost): java.io.IOException: unexpected exception type
    [info]         java.io.ObjectStreamClass.throwMiscException(ObjectStreamClass.java:1538)
    [info]         java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1025)
    [info]         java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
    [info]         java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    [info]         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    [info]         java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    [info]         java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    [info]         java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    [info]         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    [info]         java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    [info]         org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
    [info]         org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
    [info]         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164)
    [info]         java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    [info]         java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    [info]         java.lang.Thread.run(Thread.java:745)
    [info] Driver stacktrace:
    [info]   at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1192)
    [info]   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1181)
    [info]   at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1180)
    [info]   at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    [info]   at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1180)
    [info]   at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695)
    [info]   at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695)
    [info]   at scala.Option.foreach(Option.scala:236)
    [info]   at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:695)
    [info]   at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1398)
    [info]   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
    [info]   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
    [info]   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
    [info]   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
    [info]   at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
    [info]   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    [info]   at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    [info]   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    [info]   at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-61396546
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22746/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60534648
  
    > We're fighting with the problem of failure during deserialize a task for days (failed in TorrentBroadcast)
    
    I thought we had fixed this issue; can you point me to new occurrences of it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60499988
  
      [Test build #22223 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22223/consoleFull) for   PR 2933 at commit [`f3e2081`](https://github.com/apache/spark/commit/f3e20814411b36140edb726ac7e6b1dec1a8f939).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2933#discussion_r19369080
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -124,6 +123,10 @@ class DAGScheduler(
       /** If enabled, we may run certain actions like take() and first() locally. */
       private val localExecutionEnabled = sc.getConf.getBoolean("spark.localExecution.enabled", false)
     
    +  /** Broadcast the serialized tasks only when they are bigger than it */
    +  private val broadcastTaskMinSize =
    +    sc.getConf.getInt("spark.scheduler.broadcastTaskMinSize", 8) * 1024
    --- End diff --
    
    I think that the serialized task ends up being sent in an Akka message, so there could be problems if a user configures this to be higher than the capacity of the Akka frame.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-67102974
  
    What's the status on this PR / JIRA?  As far as I know, it seems that TorrentBroadcast has been more stable lately, so if the only motivation here was stability then I think we might be able to close this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60807456
  
      [Test build #486 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/486/consoleFull) for   PR 2933 at commit [`f3e2081`](https://github.com/apache/spark/commit/f3e20814411b36140edb726ac7e6b1dec1a8f939).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60499991
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22223/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60553702
  
    This is really strange; I thought that the "unexpected exception type" would have been addressed by https://github.com/apache/spark/pull/2932


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2933#discussion_r19377652
  
    --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala ---
    @@ -124,6 +123,10 @@ class DAGScheduler(
       /** If enabled, we may run certain actions like take() and first() locally. */
       private val localExecutionEnabled = sc.getConf.getBoolean("spark.localExecution.enabled", false)
     
    +  /** Broadcast the serialized tasks only when they are bigger than it */
    +  private val broadcastTaskMinSize =
    +    sc.getConf.getInt("spark.scheduler.broadcastTaskMinSize", 8) * 1024
    --- End diff --
    
    If user change akka frame size to a small one, the jobs also will stop working silently even without this patch.
    
    I think we should have good default values for these, and assume that user know the risk if they want to change some configs, it's not easy to make sure that they are consistant between all possible values for all the configs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60457554
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22169/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60472421
  
      [Test build #427 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/427/consoleFull) for   PR 2933 at commit [`28a9409`](https://github.com/apache/spark/commit/28a9409325d8fe7299dd8e01598ef86e73f70fa9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60453129
  
      [Test build #22169 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22169/consoleFull) for   PR 2933 at commit [`28a9409`](https://github.com/apache/spark/commit/28a9409325d8fe7299dd8e01598ef86e73f70fa9).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60473661
  
      [Test build #427 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/427/consoleFull) for   PR 2933 at commit [`28a9409`](https://github.com/apache/spark/commit/28a9409325d8fe7299dd8e01598ef86e73f70fa9).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60474692
  
    **[Test build #22196 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22196/consoleFull)**     for PR 2933 at commit [`7dfe41e`](https://github.com/apache/spark/commit/7dfe41e7449c1103840a389de87557a84f8d9d9d)     after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies closed the pull request at:

    https://github.com/apache/spark/pull/2933


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60501858
  
    @JoshRosen The motivation is not about performance, it's about stability. Sending tasks to executors is the critical part in spark, it should be as stable as possible. Using broadcast to sending tasks bring much of the complexity (runtime) to it, actually it introduce some problems for us (we did not have them in 1.0). The motivation of this patch is remove the complexity of broadcast in most cases, only using it when broadcast can bring performance benefits (the tasks is large enough). In the future, maybe we could increase broadcastTaskMinSizeKB to 100 or even more.
    
    This bring some complexity for code (not big), but actually simplify the runtime behavior. It also will have some performance gain (no RPC or cache at all), 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by JoshRosen <gi...@git.apache.org>.
Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60450522
  
    @ScrapCodes It looks like the Scala style checks failed due to a line that contained 104 characters, but the scalastyle output didn't list the actual cause of the failure:
    
    ```
    Scalastyle checks failed at following occurrences:
    java.lang.RuntimeException: exists error
    	at scala.sys.package$.error(package.scala:27)
    	at scala.Predef$.error(Predef.scala:142)
    [error] (core/*:scalastyle) exists error
    ```
    
    Any idea why it's not displaying the cause of the failure?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60474042
  
    @JoshRosen 
    #2846 fixes the scalastyle bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60457447
  
      [Test build #418 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/418/consoleFull) for   PR 2933 at commit [`4ca3caa`](https://github.com/apache/spark/commit/4ca3caa3f56e56139b2e33151ff95688232c4fe3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `  case class ReconnectWorker(masterUrl: String) extends DeployMessage`
      * `          throw new SparkException("Failed to load class to register with Kryo", e)`
      * `class RankingMetrics[T: ClassTag](predictionAndLabels: RDD[(Array[T], Array[T])])`
      * `            raise TypeError("Cannot convert a Row class into dict")`
      * `class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)`
      * `class ShimFileSinkDesc(var dir: String, var tableInfo: TableDesc, var compressed: Boolean)`
      * `  case class LogInfo(startTime: Long, endTime: Long, path: String)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: use broadcast for task only when task is large...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2933#issuecomment-60446605
  
      [Test build #22161 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22161/consoleFull) for   PR 2933 at commit [`4ca3caa`](https://github.com/apache/spark/commit/4ca3caa3f56e56139b2e33151ff95688232c4fe3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org