You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by andrewor14 <gi...@git.apache.org> on 2014/08/14 00:37:59 UTC

[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/1931

    [SPARK-3015] Block on cleaning tasks to prevent Akka timeouts

    More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies.
    
    We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads. This PR also logs an error message whenever the queue length exceeds a certain capacity, though I believe this is an unlikely case. Note that this is not a hard cap that limits the number of items in our reference queue but simply a soft threshold.
    
    In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark reference-blocking

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1931.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1931
    
----
commit 0b7e7685c5c1c7dda73c64a9b5ed929ff3484a8d
Author: Andrew Or <an...@gmail.com>
Date:   2014-08-13T22:19:42Z

    Block on cleaning tasks by default + log error on queue full

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1931#discussion_r16222847
  
    --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
    @@ -171,12 +203,44 @@ private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
         }
       }
     
    +  /**
    +   * Log the length of the reference queue through reflection.
    +   * This is an expensive operation and should be called sparingly.
    +   */
    +  private def logQueueLength(): Unit = {
    +    try {
    +      queueLengthAccessor.foreach { field =>
    +        val length = field.getLong(referenceQueue)
    +        logDebug("Reference queue size is " + length)
    +        if (length > queueCapacity) {
    +          logQueueFullErrorMessage()
    +        }
    +      }
    +    } catch {
    +      case e: Exception =>
    +        logDebug("Failed to access reference queue's length through reflection: " + e)
    +    }
    +  }
    +
    +  /**
    +   * Log an error message to indicate that the queue has exceeded its capacity. Do this only once.
    +   */
    +  private def logQueueFullErrorMessage(): Unit = {
    +    if (!queueFullErrorMessageLogged) {
    +      queueFullErrorMessageLogged = true
    +      logError(s"Reference queue size in ContextCleaner has exceeded $queueCapacity! " +
    --- End diff --
    
    I am not sure whether this should be logError. Its not like the system is immediately tipping over because of it reached this capacity. I think it should be a logWarning.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52126569
  
    QA results for PR 1931:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18489/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52378622
  
      [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18652/consoleFull) for   PR 1931 at commit [`d0f7195`](https://github.com/apache/spark/commit/d0f7195b0835c5f2ea8d5374ccee8daf465b3805).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52269810
  
    Yeah, sounds good. I guess we'll use a `ReferenceQueueWithSize` or something instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1931#discussion_r16222794
  
    --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
    @@ -171,12 +203,44 @@ private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
         }
       }
     
    +  /**
    +   * Log the length of the reference queue through reflection.
    +   * This is an expensive operation and should be called sparingly.
    +   */
    +  private def logQueueLength(): Unit = {
    +    try {
    +      queueLengthAccessor.foreach { field =>
    +        val length = field.getLong(referenceQueue)
    +        logDebug("Reference queue size is " + length)
    +        if (length > queueCapacity) {
    +          logQueueFullErrorMessage()
    +        }
    +      }
    +    } catch {
    +      case e: Exception =>
    +        logDebug("Failed to access reference queue's length through reflection: " + e)
    --- End diff --
    
    Add a note on why this is logDebug and not logWarning/logError.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52132300
  
    QA results for PR 1931:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18496/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52121759
  
    QA tests have started for PR 1931. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18489/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52128885
  
    QA tests have started for PR 1931. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18496/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52136264
  
    QA results for PR 1931:<br>- This patch PASSES unit tests.<br>- This patch merges cleanly<br>- This patch adds no public classes<br><br>For more information see test ouptut:<br>https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18507/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/1931


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52378492
  
    I have removed the logic of logging queue length as a warning. This significantly simplifies the PR and fulfills its original purpose as a bug fix. We can add back some notion of warning later on if there is interest.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by pwendell <gi...@git.apache.org>.
Github user pwendell commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52384958
  
    Great - I like this version better!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1931#discussion_r16223115
  
    --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
    @@ -65,11 +66,37 @@ private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
       private val cleaningThread = new Thread() { override def run() { keepCleaning() }}
     
       /**
    +   * Keep track of the reference queue length and log an error if this exceeds a certain capacity.
    +   * Unfortunately, Java's ReferenceQueue exposes neither the queue length nor the enqueue method,
    +   * so we have to do this through reflection. This is expensive, however, so we should access
    +   * this field only once in a while.
    +   */
    +  private val queueCapacity = 10000
    --- End diff --
    
    Well, this is not the capacity. It is just a warning threshold. it should be named accordingly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by witgo <gi...@git.apache.org>.
Github user witgo commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1931#discussion_r16464399
  
    --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
    @@ -66,10 +66,15 @@ private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
     
       /**
        * Whether the cleaning thread will block on cleanup tasks.
    -   * This is set to true only for tests.
    +   *
    +   * Due to SPARK-3015, this is set to true by default. This is intended to be only a temporary
    +   * workaround for the issue, which is ultimately caused by the way the BlockManager actors
    +   * issue inter-dependent blocking Akka messages to each other at high frequencies. This happens,
    +   * for instance, when the driver performs a GC and cleans up all broadcast blocks that are no
    +   * longer in scope.
        */
       private val blockOnCleanupTasks = sc.conf.getBoolean(
    -    "spark.cleaner.referenceTracking.blocking", false)
    --- End diff --
    
    The changes will not solve the problem here. see.
    [BlockManagerMasterActor.scala#L165](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala#L165)
    ```scala
      private def removeShuffle(shuffleId: Int): Future[Seq[Boolean]] = {
        // Nothing to do in the BlockManagerMasterActor data structures
        import context.dispatcher
        val removeMsg = RemoveShuffle(shuffleId)
        Future.sequence(
          blockManagerInfo.values.map { bm =>
            // Here has set the akkaTimeout
            bm.slaveActor.ask(removeMsg)(akkaTimeout).mapTo[Boolean]
          }.toSeq
        )
      }
    ````



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1931#discussion_r16222813
  
    --- Diff: core/src/main/scala/org/apache/spark/ContextCleaner.scala ---
    @@ -65,11 +66,37 @@ private[spark] class ContextCleaner(sc: SparkContext) extends Logging {
       private val cleaningThread = new Thread() { override def run() { keepCleaning() }}
     
       /**
    +   * Keep track of the reference queue length and log an error if this exceeds a certain capacity.
    +   * Unfortunately, Java's ReferenceQueue exposes neither the queue length nor the enqueue method,
    +   * so we have to do this through reflection. This is expensive, however, so we should access
    +   * this field only once in a while.
    +   */
    +  private val queueCapacity = 10000
    +  private var queueFullErrorMessageLogged = false
    +  private val queueLengthAccessor: Option[Field] = {
    +    try {
    +      val f = classOf[ReferenceQueue[AnyRef]].getDeclaredField("queueLength")
    +      f.setAccessible(true)
    +      Some(f)
    +    } catch {
    +      case e: Exception =>
    +        logDebug("Failed to expose java.lang.ref.ReferenceQueue's queueLength field: " + e)
    --- End diff --
    
    Similar to the comment below, add a note.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52379961
  
      [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18652/consoleFull) for   PR 1931 at commit [`d0f7195`](https://github.com/apache/spark/commit/d0f7195b0835c5f2ea8d5374ccee8daf465b3805).
     * This patch **passes** unit tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52133221
  
    QA tests have started for PR 1931. This patch merges cleanly. <br>View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18507/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3015] Block on cleaning tasks to preven...

Posted by tdas <gi...@git.apache.org>.
Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/1931#issuecomment-52149487
  
    Its a little ugly that the ContextCleaner class is being polluted with so many parameters, and all the temporary queue length code. Wouldnt it be much cleaner if we make a custom ReferenceQueue, which has the field length(), that does this reflection on itself to find the queue length. All the iteration counter, queue length checking and error message printing code can go inside that ReferenceQueue implementation, which is cleanly separated from the main context cleaner logic.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org