You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by uncleGen <gi...@git.apache.org> on 2017/01/16 09:17:14 UTC

[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

GitHub user uncleGen opened a pull request:

    https://github.com/apache/spark/pull/16601

    [SPARK-19182][DStream] Optimize the lock in StreamingJobProgressListener to not block UI when generating Streaming jobs

    ## What changes were proposed in this pull request?
    
    When DStreamGraph is generating a job, it will hold a lock and block other APIs. Because StreamingJobProgressListener (numInactiveReceivers, streamName(streamId: Int), streamIds) needs to call DStreamGraph's methods to access some information, the UI may hang if generating a job is very slow (e.g., talking to the slow Kafka cluster to fetch metadata).
    It's better to optimize the locks in DStreamGraph and StreamingJobProgressListener to make the UI not block by job generation.
    
    ## How was this patch tested?
    existing ut
    
    cc @zsxwing 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uncleGen/spark SPARK-19182

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16601.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16601
    
----
commit 46036bf683632e03f970de20f7bcd17b5369d5dc
Author: uncleGen <hu...@gmail.com>
Date:   2017-01-16T09:14:32Z

    Optimize the lock in StreamingJobProgressListener to not block UI when generating Streaming jobs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71492/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16601#discussion_r96550938
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala ---
    @@ -197,17 +197,17 @@ private[spark] class StreamingJobProgressListener(ssc: StreamingContext)
       }
     
       def retainedCompletedBatches: Seq[BatchUIData] = synchronized {
    -    completedBatchUIData.toSeq
    +    completedBatchUIData
    --- End diff --
    
    nit: could you change `toSeq` to `toIndexedSeq` to avoid exposing a mutable collection.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16601#discussion_r96550577
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala ---
    @@ -31,12 +31,15 @@ final private[streaming] class DStreamGraph extends Serializable with Logging {
       private val inputStreams = new ArrayBuffer[InputDStream[_]]()
       private val outputStreams = new ArrayBuffer[DStream[_]]()
     
    +  val inputStreamNameAndID = new ArrayBuffer[(String, Int)]()
    +
       var rememberDuration: Duration = null
       var checkpointInProgress = false
     
       var zeroTime: Time = null
       var startTime: Time = null
       var batchDuration: Duration = null
    +  var numReceivers: Int = 0
    --- End diff --
    
    nit: add `@volatile private` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16601#discussion_r96550717
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala ---
    @@ -31,12 +31,15 @@ final private[streaming] class DStreamGraph extends Serializable with Logging {
       private val inputStreams = new ArrayBuffer[InputDStream[_]]()
       private val outputStreams = new ArrayBuffer[DStream[_]]()
     
    +  val inputStreamNameAndID = new ArrayBuffer[(String, Int)]()
    --- End diff --
    
    nit: change it to `@volatile private var inputStreamNameAndID: Seq[(String, Int)] = Nil` and just set it in `start`. Don't expose a mutable ArrayBuffer to the caller.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    **[Test build #71441 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71441/testReport)** for PR 16601 at commit [`46036bf`](https://github.com/apache/spark/commit/46036bf683632e03f970de20f7bcd17b5369d5dc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by uncleGen <gi...@git.apache.org>.
Github user uncleGen commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    @zsxwing Take a review please!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/71553/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16601#discussion_r96550866
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala ---
    @@ -106,9 +111,9 @@ final private[streaming] class DStreamGraph extends Serializable with Logging {
           .toArray
       }
     
    -  def getInputStreamName(streamId: Int): Option[String] = synchronized {
    -    inputStreams.find(_.id == streamId).map(_.name)
    -  }
    +  def getReceiverNumber: Int = numReceivers
    --- End diff --
    
    nit `getNumReceivers` for consistence.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by uncleGen <gi...@git.apache.org>.
Github user uncleGen commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    also cc @tdas


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    **[Test build #71492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71492/testReport)** for PR 16601 at commit [`eaa7b15`](https://github.com/apache/spark/commit/eaa7b15f19711b27e628cfe366fa819a46d0e450).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

Posted by uncleGen <gi...@git.apache.org>.
Github user uncleGen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16601#discussion_r96358523
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala ---
    @@ -112,12 +112,10 @@ final private[streaming] class DStreamGraph extends Serializable with Logging {
     
       def generateJobs(time: Time): Seq[Job] = {
         logDebug("Generating jobs for time " + time)
    -    val jobs = this.synchronized {
    -      outputStreams.flatMap { outputStream =>
    -        val jobOption = outputStream.generateJob(time)
    -        jobOption.foreach(_.setCallSite(outputStream.creationSite))
    -        jobOption
    -      }
    +    val jobs = getOutputStreams().flatMap { outputStream =>
    --- End diff --
    
    Yes, I have put the question to be too simple


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    LGTM. Merging to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    **[Test build #71553 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71553/testReport)** for PR 16601 at commit [`e51623c`](https://github.com/apache/spark/commit/e51623c007b9faf2ba4fe7c92ad138b0c9c2a8c1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/16601


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    **[Test build #71441 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71441/testReport)** for PR 16601 at commit [`46036bf`](https://github.com/apache/spark/commit/46036bf683632e03f970de20f7bcd17b5369d5dc).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    **[Test build #71492 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71492/testReport)** for PR 16601 at commit [`eaa7b15`](https://github.com/apache/spark/commit/eaa7b15f19711b27e628cfe366fa819a46d0e450).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #16601: [SPARK-19182][DStream] Optimize the lock in Strea...

Posted by zsxwing <gi...@git.apache.org>.
Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16601#discussion_r96354491
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala ---
    @@ -112,12 +112,10 @@ final private[streaming] class DStreamGraph extends Serializable with Logging {
     
       def generateJobs(time: Time): Seq[Job] = {
         logDebug("Generating jobs for time " + time)
    -    val jobs = this.synchronized {
    -      outputStreams.flatMap { outputStream =>
    -        val jobOption = outputStream.generateJob(time)
    -        jobOption.foreach(_.setCallSite(outputStream.creationSite))
    -        jobOption
    -      }
    +    val jobs = getOutputStreams().flatMap { outputStream =>
    --- End diff --
    
    `synchronized` is to make sure `writeObject` never write some intermediate states of `DStreamGraph`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #16601: [SPARK-19182][DStream] Optimize the lock in StreamingJob...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/16601
  
    **[Test build #71553 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/71553/testReport)** for PR 16601 at commit [`e51623c`](https://github.com/apache/spark/commit/e51623c007b9faf2ba4fe7c92ad138b0c9c2a8c1).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org