You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by brkyvz <gi...@git.apache.org> on 2015/10/30 07:19:48 UTC

[GitHub] spark pull request: [SPARK-11419][STREAMING] Par recovery

GitHub user brkyvz opened a pull request:

    https://github.com/apache/spark/pull/9373

    [SPARK-11419][STREAMING] Par recovery

    The support for closing WriteAheadLog files after writes was just merged in. Closing every file after a write is a very expensive operation as it creates many small files on S3. It's not necessary to enable it on HDFS anyway.
    
    However, when you have many small files on S3, recovery takes very long. In addition, files start stacking up pretty quickly, and deletes may not be able to keep up, therefore deletes can also be parallelized.
    
    This PR adds support for the two parallelization steps mentioned above, in addition to a couple more failures I encountered during recovery. 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/brkyvz/spark par-recovery

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9373.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9373
    
----
commit 573b657bca5a77297cafbde489ba380b3372c81c
Author: Burak Yavuz <br...@gmail.com>
Date:   2015-10-29T20:54:02Z

    progress

commit 655f4bff61f1ffa21565c73eb0fe732c8ffada3e
Author: Burak Yavuz <br...@gmail.com>
Date:   2015-10-29T23:53:23Z

    ready for PR

commit 06da0d1389f658a1eb3e9aaaf1750fd7ad85567a
Author: Burak Yavuz <br...@gmail.com>
Date:   2015-10-30T06:12:41Z

    ready for PR

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155976711
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9373#discussion_r44596014
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala ---
    @@ -251,4 +261,23 @@ private[streaming] object FileBasedWriteAheadLog {
           }
         }.sortBy { _.startTime }
       }
    +
    +  /**
    +   * This creates an iterator from a parallel collection, by keeping at most `n` objects in memory
    +   * at any given time, where `n` is the size of the thread pool. This is crucial for use cases
    +   * where we create `FileBasedWriteAheadLogReader`s during parallel recovery. We don't want to
    +   * open up `k` streams altogether where `k` is the size of the Seq that we want to parallelize.
    +   */
    +  def seqToParIterator[I, O](
    +      tpool: ThreadPoolExecutor,
    +      source: Seq[I],
    +      handler: I => Iterator[O]): Iterator[O] = {
    +    val taskSupport = new ThreadPoolTaskSupport(tpool)
    +    val groupSize = math.max(math.max(tpool.getCorePoolSize, tpool.getPoolSize), 8)
    --- End diff --
    
    And you need to `ThreadUtils.newDaemonCachedThreadPool(...,maxThreadNumber)` to set it. Actually, I think you can just set the `maxThreadNumber` to 8 and use it here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155692990
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45614/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155269577
  
    **[Test build #45467 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45467/consoleFull)** for PR 9373 at commit [`c2cafe1`](https://github.com/apache/spark/commit/c2cafe1e948568bfc0e25657d21db0aebc3f32e4).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155273239
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45456/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155252359
  
    @harishreedharan I couldn't test this on HDFS properly. Instead I enabled the parallelization only when `closeFileAfterWrite` is enabled, which is when you actually really need it. Does that sound okay to you?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155556798
  
    **[Test build #45534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45534/consoleFull)** for PR 9373 at commit [`83aa28e`](https://github.com/apache/spark/commit/83aa28e05de4874eebc89be11dce0a0b8213007e).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155253537
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9373#discussion_r44601858
  
    --- Diff: streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala ---
    @@ -198,6 +197,45 @@ class FileBasedWriteAheadLogSuite
     
       import WriteAheadLogSuite._
     
    +  test("FileBasedWriteAheadLog - seqToParIterator") {
    +    /*
    +     If the setting `closeFileAfterWrite` is enabled, we start generating a very large number of
    +     files. This causes recovery to take a very long time. In order to make it quicker, we
    +     parallelized the reading of these files. This test makes sure that we limit the number of
    +     open files to the size of the number of threads in our thread pool rather than the size of
    +     the list of files.
    +     */
    +    val numThreads = 8
    +    val tpool = ThreadUtils.newDaemonFixedThreadPool(numThreads, "wal-test-thread-pool")
    +    class GetMaxCounter {
    +      private val value = new AtomicInteger()
    +      @volatile private var max: Int = 0
    +      def increment(): Unit = synchronized {
    +        val atInstant = value.incrementAndGet()
    +        if (atInstant > max) max = atInstant
    +      }
    +      def decrement(): Unit = synchronized { value.decrementAndGet() }
    +      def get(): Int = synchronized { value.get() }
    +      def getMax(): Int = synchronized { max }
    +    }
    +    try {
    +      val testSeq = 1 to 64
    +      val counter = new GetMaxCounter()
    +      def handle(value: Int): Iterator[Int] = {
    +        new CompletionIterator[Int, Iterator[Int]](Iterator(value)) {
    +          counter.increment()
    +          override def completion() { counter.decrement() }
    +        }
    +      }
    +      val iterator = FileBasedWriteAheadLog.seqToParIterator[Int, Int](tpool, testSeq, handle)
    +      assert(iterator.toSeq === testSeq)
    +      assert(counter.getMax() > 1) // make sure we are doing a parallel computation!
    --- End diff --
    
    Here is the code we discussed to fix:
    ```Scala
        try {
          val latch = new CountDownLatch(1)
          val testSeq = 1 to 1000
          val counter = new GetMaxCounter()
          def handle(value: Int): Iterator[Int] = {
            new CompletionIterator[Int, Iterator[Int]](Iterator(value)) {
              counter.increment()
              latch.await(10, TimeUnit.SECONDS)
              override def completion() { counter.decrement() }
            }
          }
          @volatile var collected: Seq[Int] = Nil
          val t = new Thread() {
            override def run() {
              val iterator = FileBasedWriteAheadLog.seqToParIterator[Int, Int](tpool, testSeq, handle)
              collected = iterator.toSeq
            }
          }
          t.start()
          eventually(Eventually.timeout(10.seconds)) {
            // make sure we are doing a parallel computation!
            assert(counter.getMax() > 1)
          }
          latch.countDown()
          t.join(10000)
          assert(collected === testSeq)
          // make sure we didn't open too many Iterators
          assert(counter.getMax() <= numThreads)
        } finally {
          tpool.shutdownNow()
        }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155845851
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9373#discussion_r43475406
  
    --- Diff: streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobScheduler.scala ---
    @@ -88,8 +88,10 @@ class JobScheduler(val ssc: StreamingContext) extends Logging {
         if (eventLoop == null) return // scheduler has already been stopped
         logDebug("Stopping JobScheduler")
     
    -    // First, stop receiving
    -    receiverTracker.stop(processAllReceivedData)
    +    if (receiverTracker != null) {
    +      // First, stop receiving
    +      receiverTracker.stop(processAllReceivedData)
    --- End diff --
    
    NPE thrown when streaming context is stopped before recovery is complete


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155905167
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155904930
  
    **[Test build #45668 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45668/consoleFull)** for PR 9373 at commit [`ccf7f5b`](https://github.com/apache/spark/commit/ccf7f5b56f822cc41345e29321106afbe5670b7f).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155253553
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155254915
  
    **[Test build #45457 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45457/consoleFull)** for PR 9373 at commit [`0b7279f`](https://github.com/apache/spark/commit/0b7279fdda081e8f4557cc0fc0366331380e79e0).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-156270843
  
    **[Test build #45781 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45781/consoleFull)** for PR 9373 at commit [`79e9b03`](https://github.com/apache/spark/commit/79e9b03e55382d64607ee39ac1a66a102574409a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by zsxwing <gi...@git.apache.org>.

Github user zsxwing commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9373#discussion_r44595545
  
    --- Diff: streaming/src/test/scala/org/apache/spark/streaming/util/WriteAheadLogSuite.scala ---
    @@ -582,6 +620,9 @@ object WriteAheadLogSuite {
           allowBatching: Boolean): Seq[String] = {
         val wal = createWriteAheadLog(logDirectory, closeFileAfterWrite, allowBatching)
         val data = wal.readAll().asScala.map(byteBufferToString).toSeq
    +    // The thread pool for parallel recovery gets killed with wal.close(). Therefore we need to
    +    // eagerly compute data, otherwise the lazy computation will fail.
    +    data.length
    --- End diff --
    
    Could you just change `toSeq` to `toArray`? `toArray` will drain the Iterator at once.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-156013324
  
    **[Test build #45712 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45712/consoleFull)** for PR 9373 at commit [`dbb31e3`](https://github.com/apache/spark/commit/dbb31e372d178c70bb3a6f8b18c931ce0867d4b2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155976702
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-156020249
  
    Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155516461
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9373


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by brkyvz <gi...@git.apache.org>.

Github user brkyvz commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-152867985
  
    @harishreedharan Here are some benchmark results:
    For reference, the driver was a r3.2xlarge EC2 instance.
    
    ![image](https://cloud.githubusercontent.com/assets/5243515/10871515/54c14846-809e-11e5-91e6-2ac3605d98b7.png)
    
    |Num Threads |	Rate (ms / file) |	Speed-up|
    |-------------------|--------------------------|-------------------|
    |50|	5.556101934|	9.004997951|
    |25|	5.99898194|	8.340196225|
    |8|	8.692144733|	5.756080699|
    |4	|14.1162362	|3.544336169|
    |1	|50.03268653	|1|



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155690177
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155516430
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155677200
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-152469638
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44666/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155981153
  
    @brkyvz Few more comments, and one pending comment from before about adding more unit tests.
    @zsxwing please take a look once again.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155848316
  
    **[Test build #45648 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45648/consoleFull)** for PR 9373 at commit [`1ba8340`](https://github.com/apache/spark/commit/1ba834000b40f0d4cf39be5972cb585f9bbb9006).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155266954
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/45457/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155677122
  
     Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by tdas <gi...@git.apache.org>.

Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9373#discussion_r43949361
  
    --- Diff: streaming/src/test/scala/org/apache/spark/streaming/ReceivedBlockTrackerSuite.scala ---
    @@ -207,6 +207,87 @@ class ReceivedBlockTrackerSuite
         tracker1.isWriteAheadLogEnabled should be (false)
       }
     
    +  test("parallel file deletion in FileBasedWriteAheadLog is robust to deletion error") {
    +    val manualClock = new ManualClock
    +    conf.set("spark.streaming.driver.writeAheadLog.rollingIntervalSecs", "1")
    +    require(WriteAheadLogUtils.getRollingIntervalSecs(conf, isDriver = true) === 1)
    +    val tracker = createTracker(clock = manualClock)
    +
    +    val addBlocks = generateBlockInfos()
    +    val batch1 = addBlocks.slice(0, 1)
    +    val batch2 = addBlocks.slice(1, 3)
    +    val batch3 = addBlocks.slice(3, 6)
    +
    +    def advanceTime(): Unit = manualClock.advance(1000)
    +
    +    assert(getWriteAheadLogFiles().length === 0)
    +
    --- End diff --
    
    Can you added inline comments to explain each step, so that the reader can understand whats going on.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155690192
  
    Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155266838
  
    **[Test build #45457 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45457/consoleFull)** for PR 9373 at commit [`0b7279f`](https://github.com/apache/spark/commit/0b7279fdda081e8f4557cc0fc0366331380e79e0).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-155987716
  
    **[Test build #45700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45700/consoleFull)** for PR 9373 at commit [`7e1829b`](https://github.com/apache/spark/commit/7e1829b87e4809a2096b969cdb8de73f03afd616).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request: [SPARK-11419][STREAMING] Parallel recovery for...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9373#issuecomment-152469252
  
    **[Test build #44666 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44666/consoleFull)** for PR 9373 at commit [`7f8cfe3`](https://github.com/apache/spark/commit/7f8cfe340010e867ab73c40fc0ba39b0d0144695).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org