You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by marsishandsome <gi...@git.apache.org> on 2015/02/11 07:52:57 UTC

[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

GitHub user marsishandsome opened a pull request:

    https://github.com/apache/spark/pull/4525

    [SPARK-5522] Accelerate the Histroty Server start

    When starting the history server, all the log files will be fetched and parsed in order to get the applications' meta data e.g. App Name, Start Time, Duration, etc. In our production cluster, there exist 2600 log files (160G) in HDFS and it costs 3 hours to restart the history server, which is a little bit too long for us.
    
    It would be better, if the history server can show logs with missing information during start-up and fill the missing information after fetching and parsing a log file.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/marsishandsome/spark Spark5522

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4525.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4525
    
----
commit be5670c937d163b3c8238248c80bcf472333678f
Author: guliangliang <gu...@qiyi.com>
Date:   2015-02-11T06:45:01Z

    [SPARK-5522] Accelerate the Histroty Server start

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74048395
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27343/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74048388
  
      [Test build #27343 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27343/consoleFull) for   PR 4525 at commit [`c1637e3`](https://github.com/apache/spark/commit/c1637e3335620ed8aac39dfeb3f9fe1252abfadd).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74001562
  
    BTW, if following my suggestion, take a look at [this executor](http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/util/concurrent/MoreExecutors.html#sameThreadExecutor()) for tests, since the tests expect loading of new apps to be a synchronous operation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by marsishandsome <gi...@git.apache.org>.
Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76505657
  
    @andrewor14 please check


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74056900
  
      [Test build #27349 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27349/consoleFull) for   PR 4525 at commit [`07e45d0`](https://github.com/apache/spark/commit/07e45d03c010ac4e34e96826220576e1d118b7dd).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25013764
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    +            itor.next()
    +          }
    +        }
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private def replay(logs: Iterator[FileStatus]) {
    +    def addIfAbsent(newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +                    info: FsApplicationHistoryInfo) {
    --- End diff --
    
    style should be
    ```
    def addIfAbsent(
        newApps: ...,
        info: ...): Unit = {
      ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74229516
  
      [Test build #27435 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27435/consoleFull) for   PR 4525 at commit [`d5fac60`](https://github.com/apache/spark/commit/d5fac60389f0c197282a65b4a0c4d00d42e13f85).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76377174
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28064/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by ryan-williams <gi...@git.apache.org>.
Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24613574
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -119,14 +138,79 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
         if (!conf.contains("spark.testing")) {
           logCheckingThread.setDaemon(true)
           logCheckingThread.start()
    +      logLazyReplayThread.setDaemon(true)
    +      logLazyReplayThread.start()
    +    } else {
    +      logLazyReplay()
         }
       }
     
    -  override def getListing() = applications.values
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private[history] def logLazyReplay() {
    +    if(lazyApplications.isEmpty) return
    +
    +    logDebug("start doLazyReplay")
    --- End diff --
    
    `s/doLazyReplay/logLazyReplay/`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74198732
  
    HI @marsishandsome ,
    
    I'm suggesting pretty much the same approach, but without using a shared data structure like you're currently doing. Basically, you'd do something like this (in pseudo-scalaish-code):
    
    val replayExecutor = Executors.newSingleThreadExecutor()
        
        def replay(logs: Seq[FileStatus]): Unit = {
          val newApps = logs.foreach { /* parse log into app info */ }
          this.apps merge(this.apps, newApps)
        }
        
        def checkForLogs(): Unit = {
          val logs = fs.listStatus(path)./* sort, filter, etc */
          while (!logs.isEmpty) {
            val batch = logs.take(20)
            replayExecutor.submit { () => replay(batch) }
          }
        }
    
    You don't need a shared data structure to hold the intermediate data, because the execution queue implicitly holds the data. Single it's a single-threaded executor, logs will be parsed in the order defined by `checkForLogs`, so everything should work like currently.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24547490
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -113,15 +113,55 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             "Logging directory specified is not a directory: %s".format(logDir))
         }
     
    -    checkForLogs()
    +    checkForLogs(_lazy = true)
     
    -    // Disable the background thread during tests.
    -    if (!conf.contains("spark.testing")) {
    -      logCheckingThread.setDaemon(true)
    -      logCheckingThread.start()
    +    doLazyReplay {
    --- End diff --
    
    I don't understand why you're calling `doLazyReplay` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76357565
  
      [Test build #28057 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28057/consoleFull) for   PR 4525 at commit [`8475136`](https://github.com/apache/spark/commit/8475136fb326d4084bb992774d9684c04d1e3c90).
     * This patch **fails Scala style tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24632592
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -31,6 +31,8 @@ import org.apache.spark.scheduler._
     import org.apache.spark.ui.SparkUI
     import org.apache.spark.util.Utils
     
    +import scala.collection.mutable.ArrayBuffer
    --- End diff --
    
    nit: this should be grouped with other scala imports.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24634640
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -119,14 +138,79 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
         if (!conf.contains("spark.testing")) {
           logCheckingThread.setDaemon(true)
           logCheckingThread.start()
    +      logLazyReplayThread.setDaemon(true)
    +      logLazyReplayThread.start()
    +    } else {
    +      logLazyReplay()
         }
       }
     
    -  override def getListing() = applications.values
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private[history] def logLazyReplay() {
    +    if(lazyApplications.isEmpty) return
    +
    +    logDebug("start doLazyReplay")
    +    val mergeSize = 20
    +    val bufferedApps = new ArrayBuffer[FsApplicationHistoryInfo](mergeSize)
    +
    +    def addIfAbsent(newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +                    info: FsApplicationHistoryInfo) {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    +        newApps += (info.id -> info)
    +      }
    +    }
    +
    +    def mergeApps(): mutable.LinkedHashMap[String, FsApplicationHistoryInfo] = {
    +      val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    +      bufferedApps.sortWith(compareAppInfo)
    +
    +      val newIterator = bufferedApps.iterator.buffered
    +      val oldIterator = applications.values.iterator.buffered
    +      while (newIterator.hasNext && oldIterator.hasNext) {
    +        if (compareAppInfo(newIterator.head, oldIterator.head)) {
    +          addIfAbsent(newApps, newIterator.next())
    +        } else {
    +          addIfAbsent(newApps, oldIterator.next())
    +        }
    +      }
    +      newIterator.foreach(addIfAbsent(newApps, _))
    +      oldIterator.foreach(addIfAbsent(newApps, _))
    +
    +      newApps
    +    }
    +
    +    val bus = new ReplayListenerBus()
    +    while(lazyApplications.nonEmpty){
    --- End diff --
    
    So, this feels a little racy, in that this thread might miss things added by the log checking thread. I'd suggest the following:
    
    - Create a single-threaded executor for running the replay tasks
    - Create a list of app infos to parse in the log checking thread, break it down into batches.
    - Submit each batch to the executor
    
    Basically, instead of having `logLazyReplay`, you'd have something like `replay(apps: Seq[LazyAppInfo])`. You don't need `lazyApplications` because that becomes part of the task being submitted to the executor, so you solve another source of contention in the code. And since it's a single-threaded executor, you know there's only a single thread touching `apps`, so it should all be thread-safe.
    
    For testing, you can use Guava's `sameThreadExecutor()` as I mentioned, instead of the single-threaded executor.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76372131
  
      [Test build #28059 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28059/consoleFull) for   PR 4525 at commit [`9244bb6`](https://github.com/apache/spark/commit/9244bb66b92e681e3ec147a225a11c7d95beda00).
     * This patch **fails Spark unit tests**.
     * This patch **does not merge cleanly**.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25014474
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    --- End diff --
    
    also, not your code but can you add the return type `Seq[FileStatus]` to `logInfos` in L187 of the new code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76348298
  
    Hey @marsishandsome I think the latest changes look reasonable. Can you rebase to master and address the latest set of comments? I would like to get this merged soon. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76357567
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28057/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4525


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76391561
  
      [Test build #28069 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28069/consoleFull) for   PR 4525 at commit [`4340c2b`](https://github.com/apache/spark/commit/4340c2b6197d16b0670aa3ce8a49bb39b0439ec1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25014159
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    +            itor.next()
    +          }
    +        }
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private def replay(logs: Iterator[FileStatus]) {
    +    def addIfAbsent(newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +                    info: FsApplicationHistoryInfo) {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    +        newApps += (info.id -> info)
    +      }
    +    }
     
    -        val newIterator = logInfos.iterator.buffered
    -        val oldIterator = applications.values.iterator.buffered
    -        while (newIterator.hasNext && oldIterator.hasNext) {
    -          if (compareAppInfo(newIterator.head, oldIterator.head)) {
    -            addIfAbsent(newIterator.next)
    -          } else {
    -            addIfAbsent(oldIterator.next)
    -          }
    -        }
    -        newIterator.foreach(addIfAbsent)
    -        oldIterator.foreach(addIfAbsent)
    +    def mergeApps(list: Seq[FsApplicationHistoryInfo]) {
    +      val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
     
    -        applications = newApps
    +      val newIterator = list.iterator.buffered
    +      val oldIterator = applications.values.iterator.buffered
    +      while (newIterator.hasNext && oldIterator.hasNext) {
    +        if (compareAppInfo(newIterator.head, oldIterator.head)) {
    +          addIfAbsent(newApps, newIterator.next())
    +        } else {
    +          addIfAbsent(newApps, oldIterator.next())
    +        }
           }
    -    } catch {
    -      case e: Exception => logError("Exception in checking for event log updates", e)
    +      newIterator.foreach(addIfAbsent(newApps, _))
    +      oldIterator.foreach(addIfAbsent(newApps, _))
    +
    +      applications = newApps
         }
    +
    +    val bus = new ReplayListenerBus()
    +    val newApps = logs.flatMap { fileStatus =>
    +      try {
    +        val res = replay(fileStatus, bus)
    +        logInfo("load log " + res.logPath + " successfully")
    --- End diff --
    
    `logInfo(s"Application log ${res.logPath} loaded successfully.")`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by ryan-williams <gi...@git.apache.org>.
Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74157026
  
    my command was `mvn '-Dsuites=*FsHistory*' test` after a `build/sbt -Pyarn -Phadoop-2.3 clean assembly/assembly`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76416431
  
      [Test build #28071 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28071/consoleFull) for   PR 4525 at commit [`a865c11`](https://github.com/apache/spark/commit/a865c119a445e3d566ab7b9e1ce34cbad32a5eb2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76398152
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28069/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74041124
  
      [Test build #27343 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27343/consoleFull) for   PR 4525 at commit [`c1637e3`](https://github.com/apache/spark/commit/c1637e3335620ed8aac39dfeb3f9fe1252abfadd).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25013864
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    +            itor.next()
    +          }
    +        }
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private def replay(logs: Iterator[FileStatus]) {
    +    def addIfAbsent(newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +                    info: FsApplicationHistoryInfo) {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    +        newApps += (info.id -> info)
    +      }
    +    }
     
    -        val newIterator = logInfos.iterator.buffered
    -        val oldIterator = applications.values.iterator.buffered
    -        while (newIterator.hasNext && oldIterator.hasNext) {
    -          if (compareAppInfo(newIterator.head, oldIterator.head)) {
    -            addIfAbsent(newIterator.next)
    -          } else {
    -            addIfAbsent(oldIterator.next)
    -          }
    -        }
    -        newIterator.foreach(addIfAbsent)
    -        oldIterator.foreach(addIfAbsent)
    +    def mergeApps(list: Seq[FsApplicationHistoryInfo]) {
    --- End diff --
    
    need `: Unit` return type here and other places, see https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25013384
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -96,6 +99,11 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
         }
       }
     
    +  /**
    +   * An Executor to fetch and parse log files.
    +   */
    +  private var replayExecutor: ExecutorService = _
    --- End diff --
    
    this doesn't need to be a var. You can just initialize it here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76377165
  
      [Test build #28064 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28064/consoleFull) for   PR 4525 at commit [`af92a5a`](https://github.com/apache/spark/commit/af92a5a640278e4861a9f3ae3adbde7cefb52943).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76372143
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28059/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76850970
  
    Ok I'm merging this in master thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74237983
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27435/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25490359
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +200,74 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
    --- End diff --
    
    this exception message is out of date. We're not replaying the logs here anymore


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-75350662
  
      [Test build #27802 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27802/consoleFull) for   PR 4525 at commit [`10d85b4`](https://github.com/apache/spark/commit/10d85b41d51a4a09075b510fbebbd618a5e2a120).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by marsishandsome <gi...@git.apache.org>.
Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74046962
  
    I've updated the implementation.
    
    Two background threads are used to load the log files:
    1. one thread to check the file list
    2. another to fetch and parse the log files
    
    There my be some race condition problems if a thread pool is used to fetch and parse the log files.
    The following problems must be taken care:
    1. The threads in the pool share a common unparsed file list, which is produced by another thread
    2. The threads in the pool update a common parsed file list
    3. The unparsed file list is sorted by file update time
    4. The parsed file list is sorted by application finish time
    5. The UI thread can at the same time get the content of both unparsed file list and parsed file list
    
    Other reasons why I choose the two-thread implementation are: 
    1. If a thread pool is used, the network will be the next bottleneck.
    2. It's ok for users, at least for me, if the missing meta information will be finished loading in 3 hours. At least they can visit the job detail webpage.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-73840730
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76359067
  
      [Test build #28059 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28059/consoleFull) for   PR 4525 at commit [`9244bb6`](https://github.com/apache/spark/commit/9244bb66b92e681e3ec147a225a11c7d95beda00).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25014079
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    +            itor.next()
    +          }
    +        }
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private def replay(logs: Iterator[FileStatus]) {
    +    def addIfAbsent(newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +                    info: FsApplicationHistoryInfo) {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    +        newApps += (info.id -> info)
    +      }
    +    }
     
    -        val newIterator = logInfos.iterator.buffered
    -        val oldIterator = applications.values.iterator.buffered
    -        while (newIterator.hasNext && oldIterator.hasNext) {
    -          if (compareAppInfo(newIterator.head, oldIterator.head)) {
    -            addIfAbsent(newIterator.next)
    -          } else {
    -            addIfAbsent(oldIterator.next)
    -          }
    -        }
    -        newIterator.foreach(addIfAbsent)
    -        oldIterator.foreach(addIfAbsent)
    +    def mergeApps(list: Seq[FsApplicationHistoryInfo]) {
    --- End diff --
    
    also, can you add some in-line comments on what this is doing? I know it's not your code but right now it's a little hard for people who are not familiar with the code to follow


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24707205
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    +            itor.next()
    +          }
    +        }
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private def replay(logs: Iterator[FileStatus]) {
    --- End diff --
    
    Also, just for clarity, I'd call this method `parseApplicationInfos` or something, since it's doing more than just replaying the logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76357398
  
      [Test build #28057 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28057/consoleFull) for   PR 4525 at commit [`8475136`](https://github.com/apache/spark/commit/8475136fb326d4084bb992774d9684c04d1e3c90).
     * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24634243
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -66,6 +68,11 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
       @volatile private var applications: mutable.LinkedHashMap[String, FsApplicationHistoryInfo]
         = new mutable.LinkedHashMap()
     
    +  // The logCheckingThread appends elements at the end of lazyApplications, while
    +  // the logLazyReplayThread removes elements from the begin of lazyApplications.
    +  private val lazyApplications = new mutable.LinkedHashMap[String, LazyFsApplicationHistoryInfo]()
    +    with mutable.SynchronizedMap[String, LazyFsApplicationHistoryInfo]
    --- End diff --
    
    I've been told that using SynchronizedMap is generally a bad idea.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by marsishandsome <gi...@git.apache.org>.
Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74020260
  
    @vanzin Thanks for your advice. I will improve the implementation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24635360
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -66,6 +68,11 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
       @volatile private var applications: mutable.LinkedHashMap[String, FsApplicationHistoryInfo]
         = new mutable.LinkedHashMap()
     
    +  // The logCheckingThread appends elements at the end of lazyApplications, while
    +  // the logLazyReplayThread removes elements from the begin of lazyApplications.
    +  private val lazyApplications = new mutable.LinkedHashMap[String, LazyFsApplicationHistoryInfo]()
    +    with mutable.SynchronizedMap[String, LazyFsApplicationHistoryInfo]
    --- End diff --
    
    though here since we need to maintain the order, we might have to use `ConcurrentSkipListMap` instead


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24685986
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    --- End diff --
    
    nit: `if (`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25013894
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    +            itor.next()
    +          }
    +        }
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Fetch and Parse the log files
    --- End diff --
    
    is this true? Didn't we already fetch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24685884
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -18,6 +18,9 @@
     package org.apache.spark.deploy.history
     
     import java.io.{BufferedInputStream, FileNotFoundException, InputStream}
    +import java.util.concurrent.{ExecutorService, Executors}
    +
    +import com.google.common.util.concurrent.MoreExecutors
    --- End diff --
    
    nit: this should go after `scala.*` imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24717604
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -113,6 +121,12 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             "Logging directory specified is not a directory: %s".format(logDir))
         }
     
    +    if (!conf.contains("spark.testing")) {
    +      replayExecutor = Executors.newSingleThreadExecutor()
    --- End diff --
    
    Ah, please use `Utils.namedThreadFactory` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-75223672
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27773/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74047341
  
      [Test build #27342 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27342/consoleFull) for   PR 4525 at commit [`2e59eb7`](https://github.com/apache/spark/commit/2e59eb73086ad2c9243f6d1f340707841190d2d3).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-73910687
  
    ok to test. Thanks for working on this!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-73994082
  
    nit: typo in the PR title


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-73993809
  
    So, I'm all for the feature, but I'm not sold on the approach here. It makes the code a little confusing, since you're trying to keep the code working in two different modes: "lazy" for the startup check, and "synchronous" for the subsequent checks.
    
    Instead, why not always do it lazily? Have a thread pool with a few worker threads, and have `checkForLogs` feed requests for parsing logs to that pool. `checkForLogs` will only list the files, regardless of when it's executed (startup vs. not).
    
    That looks like it would be easier to understand, at least to me, would provide performance improvements for all subsequent checks (not just the initial one), and would simplify the code a lot (not having to deal with different types for lazy vs. not lazy app info, for one).
    
    What do you think?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76431586
  
      [Test build #28071 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28071/consoleFull) for   PR 4525 at commit [`a865c11`](https://github.com/apache/spark/commit/a865c119a445e3d566ab7b9e1ce34cbad32a5eb2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-75215693
  
      [Test build #27773 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27773/consoleFull) for   PR 4525 at commit [`783097a`](https://github.com/apache/spark/commit/783097aea566d1c3c9e8fd751c04fa9dcb89ab3d).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25490648
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +200,74 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map { case (_, file) => file }
    +
    +      logInfos.sliding(20, 20).foreach { batch =>
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = mergeApplicationListing(batch)
    +        })
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    --- End diff --
    
    please add back this comment


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by marsishandsome <gi...@git.apache.org>.
Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76342809
  
    Hi @andrewor14, is there anything I can do for this PR?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74039909
  
      [Test build #27342 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27342/consoleFull) for   PR 4525 at commit [`2e59eb7`](https://github.com/apache/spark/commit/2e59eb73086ad2c9243f6d1f340707841190d2d3).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25014623
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    +      while (itor.hasNext) {
    +        val batchSize = 20
    +        val batch = itor.take(batchSize)
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = replay(batch)
    +        })
    +        for (i <- 1 to batchSize) {
    +          if(itor.hasNext) {
    +            itor.next()
    +          }
    +        }
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private def replay(logs: Iterator[FileStatus]) {
    +    def addIfAbsent(newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +                    info: FsApplicationHistoryInfo) {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    +        newApps += (info.id -> info)
    +      }
    +    }
     
    -        val newIterator = logInfos.iterator.buffered
    -        val oldIterator = applications.values.iterator.buffered
    -        while (newIterator.hasNext && oldIterator.hasNext) {
    -          if (compareAppInfo(newIterator.head, oldIterator.head)) {
    -            addIfAbsent(newIterator.next)
    -          } else {
    -            addIfAbsent(oldIterator.next)
    -          }
    -        }
    -        newIterator.foreach(addIfAbsent)
    -        oldIterator.foreach(addIfAbsent)
    +    def mergeApps(list: Seq[FsApplicationHistoryInfo]) {
    +      val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
     
    -        applications = newApps
    +      val newIterator = list.iterator.buffered
    +      val oldIterator = applications.values.iterator.buffered
    +      while (newIterator.hasNext && oldIterator.hasNext) {
    +        if (compareAppInfo(newIterator.head, oldIterator.head)) {
    +          addIfAbsent(newApps, newIterator.next())
    +        } else {
    +          addIfAbsent(newApps, oldIterator.next())
    +        }
           }
    -    } catch {
    -      case e: Exception => logError("Exception in checking for event log updates", e)
    +      newIterator.foreach(addIfAbsent(newApps, _))
    +      oldIterator.foreach(addIfAbsent(newApps, _))
    +
    +      applications = newApps
         }
    +
    +    val bus = new ReplayListenerBus()
    +    val newApps = logs.flatMap { fileStatus =>
    +      try {
    +        val res = replay(fileStatus, bus)
    +        logInfo("load log " + res.logPath + " successfully")
    +        Some(res)
    +      } catch {
    +        case e: Exception => None
    --- End diff --
    
    should we log some error here? Otherwise we silently fail.
    ```
    logError(s"Exception encountered when attempting to load application log ${fileStatus.getPath}")
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by marsishandsome <gi...@git.apache.org>.
Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74078446
  
    Hi @vanzin, the failed test passed on my local environment. I have no idea why it failed. Would please check it for me?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24547591
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -365,3 +424,16 @@ private class FsApplicationHistoryInfo(
         sparkUser: String,
         completed: Boolean = true)
       extends ApplicationHistoryInfo(id, name, startTime, endTime, lastUpdated, sparkUser, completed)
    +
    +private class LazyFsApplicationHistoryInfo(
    +                                            val eventLog: FileStatus,
    --- End diff --
    
    Indentation here is all wrong. tabs vs. spaces?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74070536
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27349/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74047349
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27342/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74179242
  
    Hi, before I comment on the code itself, there's a bunch of style things that you'll need to fix here:
    
    - `while (foo)`, `if (foo)`, *not* `while(foo)`, `if(foo)`
    - `foreach { x =>`, not `foreach(x=>`
    - `def foo(): Unit = {`, not `def foo() {`
    
    There are probably others. Please follow the style guide posted on the [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by markhamstra <gi...@git.apache.org>.
Github user markhamstra commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24634767
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -66,6 +68,11 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
       @volatile private var applications: mutable.LinkedHashMap[String, FsApplicationHistoryInfo]
         = new mutable.LinkedHashMap()
     
    +  // The logCheckingThread appends elements at the end of lazyApplications, while
    +  // the logLazyReplayThread removes elements from the begin of lazyApplications.
    +  private val lazyApplications = new mutable.LinkedHashMap[String, LazyFsApplicationHistoryInfo]()
    +    with mutable.SynchronizedMap[String, LazyFsApplicationHistoryInfo]
    --- End diff --
    
    That's true.  SynchronizedMap is deprecated in Scala 2.11: "Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentHashMap as an alternative."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74305962
  
    @marsishandsome this looks a lot cleaner, thanks! Left a couple of comments, but other than that LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-75353533
  
      [Test build #27802 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27802/consoleFull) for   PR 4525 at commit [`10d85b4`](https://github.com/apache/spark/commit/10d85b41d51a4a09075b510fbebbd618a5e2a120).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74070528
  
    **[Test build #27349 timed out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27349/consoleFull)**     for PR 4525 at commit [`07e45d0`](https://github.com/apache/spark/commit/07e45d03c010ac4e34e96826220576e1d118b7dd)     after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-75223664
  
      [Test build #27773 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27773/consoleFull) for   PR 4525 at commit [`783097a`](https://github.com/apache/spark/commit/783097aea566d1c3c9e8fd751c04fa9dcb89ab3d).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-75353536
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27802/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25490429
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +200,74 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map { case (_, file) => file }
    --- End diff --
    
    I think you can replace these three cases (the flatMap, sortWith, and map) with a single one.
    ```
    val logInfos: Seq[FileStatus] = statusList
      .filter { ... }
      .sortWith { case (entry1, entry2) =>
        val mod1 = getModificationTime(entry1).getOrElse(-1)
        val mod2 = getModificationTime(entry1).getOrElse(-1)
        mod1 >= mod2
      }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74237979
  
      [Test build #27435 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27435/consoleFull) for   PR 4525 at commit [`d5fac60`](https://github.com/apache/spark/commit/d5fac60389f0c197282a65b4a0c4d00d42e13f85).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by ryan-williams <gi...@git.apache.org>.
Github user ryan-williams commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24613618
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -119,14 +138,79 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
         if (!conf.contains("spark.testing")) {
           logCheckingThread.setDaemon(true)
           logCheckingThread.start()
    +      logLazyReplayThread.setDaemon(true)
    +      logLazyReplayThread.start()
    +    } else {
    +      logLazyReplay()
         }
       }
     
    -  override def getListing() = applications.values
    +  /**
    +   * Fetch and Parse the log files
    +   */
    +  private[history] def logLazyReplay() {
    +    if(lazyApplications.isEmpty) return
    +
    +    logDebug("start doLazyReplay")
    +    val mergeSize = 20
    +    val bufferedApps = new ArrayBuffer[FsApplicationHistoryInfo](mergeSize)
    +
    +    def addIfAbsent(newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +                    info: FsApplicationHistoryInfo) {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    +        newApps += (info.id -> info)
    +      }
    +    }
    +
    +    def mergeApps(): mutable.LinkedHashMap[String, FsApplicationHistoryInfo] = {
    +      val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    +      bufferedApps.sortWith(compareAppInfo)
    +
    +      val newIterator = bufferedApps.iterator.buffered
    +      val oldIterator = applications.values.iterator.buffered
    +      while (newIterator.hasNext && oldIterator.hasNext) {
    +        if (compareAppInfo(newIterator.head, oldIterator.head)) {
    +          addIfAbsent(newApps, newIterator.next())
    +        } else {
    +          addIfAbsent(newApps, oldIterator.next())
    +        }
    +      }
    +      newIterator.foreach(addIfAbsent(newApps, _))
    +      oldIterator.foreach(addIfAbsent(newApps, _))
    +
    +      newApps
    +    }
    +
    +    val bus = new ReplayListenerBus()
    +    while(lazyApplications.nonEmpty){
    +      lazyApplications.iterator.take(mergeSize).foreach(keyValue => {
    +        try{
    +          val lazyInfo = keyValue._2
    +          val info = replay(lazyInfo.eventLog, bus)
    +          bufferedApps += info
    +          logDebug("replay application " + lazyInfo.id + " successfully")
    +        } catch {
    +          case e: Exception =>
    +        }
    +      })
    +      applications = mergeApps()
    +      for(i <- 1 to bufferedApps.size) lazyApplications.remove(lazyApplications.head._1)
    +      bufferedApps.clear()
    +    }
    +    logDebug("finish doLazyReplay")
    --- End diff --
    
    ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by ryan-williams <gi...@git.apache.org>.
Github user ryan-williams commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74156847
  
    fwiw @marsishandsome I ran `FsHistoryProviderSuite` on this PR and got a similar failure:
    
    ```
    FsHistoryProviderSuite:
    - Parse new and old application logs *** FAILED ***
      ApplicationHistoryInfo(old1,old-app-complete,2,3,1423776491000,test,true) was not equal to ApplicationHistoryInfo(new1,new-app-complete,1,4,1423776491000,test,true) (FsHistoryProviderSuite.scala:97)
    ```
    
    The failure you referenced above was:
    
    ```
    [info] - Parse new and old application logs *** FAILED *** (32 milliseconds)
    [info]   ApplicationHistoryInfo(old2,old-app-incomplete,2,-1,1423744927000,test,false) was not equal to ApplicationHistoryInfo(old1,old-app-complete,2,3,1423744927000,test,true) (FsHistoryProviderSuite.scala:99)
    [info]   org.scalatest.exceptions.TestFailedException:
    [info]   at org.scalatest.MatchersHelper$.newTestFailedException(MatchersHelper.scala:160)
    [info]   at org.scalatest.Matchers$ShouldMethodHelper$.shouldMatcher(Matchers.scala:6231)
    [info]   at org.scalatest.Matchers$AnyShouldWrapper.should(Matchers.scala:6265)
    [info]   at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$3.apply$mcV$sp(FsHistoryProviderSuite.scala:99)
    [info]   at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$3.apply(FsHistoryProviderSuite.scala:48)
    [info]   at org.apache.spark.deploy.history.FsHistoryProviderSuite$$anonfun$3.apply(FsHistoryProviderSuite.scala:48)
    [info]   at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
    [info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
    [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
    [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
    [info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
    [info]   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
    [info]   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
    [info]   at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
    [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
    [info]   at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scaAttempting to post to Github...
    ```
    
    Could the applications be getting read in a nondeterministic order somehow?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25490585
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +200,74 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map { case (_, file) => file }
    +
    +      logInfos.sliding(20, 20).foreach { batch =>
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = mergeApplicationListing(batch)
    +        })
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Replay the log files in the list and merge the list of old applications with new ones
    +   */
    +  private def mergeApplicationListing(logs: Seq[FileStatus]): Unit = {
    +    def addIfAbsent(
    +        newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +        info: FsApplicationHistoryInfo): Unit = {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    --- End diff --
    
    I personally find this `a || b && c` a little hard to read. I believe it always evaluates to `a || (b && c)`, so maybe it makes sense to rewrite this in a way that makes this more obvious:
    ```
    val appJustFinished = newApps(info.id).logPath.endsWith(...) && !info.logPath.endsWith(...)
    if (!newApps.contains(info.id) || appJustFinished) {
      ...
    }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by marsishandsome <gi...@git.apache.org>.
Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74197007
  
    @vanzin To my opinion, it's a typical Producer-Consumer Problem. 
    
    I'm a little confused with your approach. Would you please explain it in detail for me? Will a shared Container be used in your approach? If not, how to pass data from Producer to Consumer? If yes, what's the differences between your approach and mine?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25649771
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -189,48 +201,69 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
                   false
               }
             }
    -        .flatMap { entry =>
    -          try {
    -            Some(replay(entry, new ReplayListenerBus()))
    -          } catch {
    -            case e: Exception =>
    -              logError(s"Failed to load application log data from $entry.", e)
    -              None
    -          }
    -        }
    -        .sortWith(compareAppInfo)
    +        .flatMap { entry => Some(entry) }
    +        .sortWith { case (entry1, entry2) =>
    +          val mod1 = getModificationTime(entry1).getOrElse(-1L)
    +          val mod2 = getModificationTime(entry2).getOrElse(-1L)
    +          mod1 >= mod2
    +      }
    --- End diff --
    
    indent. I will fix when I merge


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25014413
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    --- End diff --
    
    Can you make this more readable
    ```
    .map { case (_, file) => file }
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by vanzin <gi...@git.apache.org>.
Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r24686318
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +201,78 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map(_._2)
    +
    +      val itor = logInfos.iterator
    --- End diff --
    
    This is probably more cleany expressed as:
    
        logInfos.sliding(20, 20).foreach {
          replayExecutor.submit(...)
        }
    
        


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-73926814
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27290/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-73926806
  
      [Test build #27290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27290/consoleFull) for   PR 4525 at commit [`69a2cf2`](https://github.com/apache/spark/commit/69a2cf2fce02ad14397e72e62df6fb32affed1ea).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by marsishandsome <gi...@git.apache.org>.
Github user marsishandsome commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-74231723
  
    I've updated my implementation according @vanzin 's advice.
    Thanks @vanzin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76368660
  
      [Test build #28064 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28064/consoleFull) for   PR 4525 at commit [`af92a5a`](https://github.com/apache/spark/commit/af92a5a640278e4861a9f3ae3adbde7cefb52943).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-73911610
  
      [Test build #27290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27290/consoleFull) for   PR 4525 at commit [`69a2cf2`](https://github.com/apache/spark/commit/69a2cf2fce02ad14397e72e62df6fb32affed1ea).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76431601
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28071/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4525#issuecomment-76398141
  
      [Test build #28069 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/28069/consoleFull) for   PR 4525 at commit [`4340c2b`](https://github.com/apache/spark/commit/4340c2b6197d16b0670aa3ce8a49bb39b0439ec1).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5522] Accelerate the Histroty Server st...

Posted by andrewor14 <gi...@git.apache.org>.
Github user andrewor14 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4525#discussion_r25490619
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -187,47 +200,74 @@ private[history] class FsHistoryProvider(conf: SparkConf) extends ApplicationHis
             }
             .flatMap { entry =>
               try {
    -            Some(replay(entry, new ReplayListenerBus()))
    +            Some((getModificationTime(entry).get, entry))
               } catch {
                 case e: Exception =>
                   logError(s"Failed to load application log data from $entry.", e)
                   None
               }
             }
    -        .sortWith(compareAppInfo)
    +        .sortWith(_._1 >= _._1)
    +        .map { case (_, file) => file }
    +
    +      logInfos.sliding(20, 20).foreach { batch =>
    +        replayExecutor.submit(new Runnable {
    +          override def run(): Unit = mergeApplicationListing(batch)
    +        })
    +      }
     
           lastModifiedTime = newLastModifiedTime
    +    } catch {
    +      case e: Exception => logError("Exception in checking for event log updates", e)
    +    }
    +  }
     
    -      // When there are new logs, merge the new list with the existing one, maintaining
    -      // the expected ordering (descending end time). Maintaining the order is important
    -      // to avoid having to sort the list every time there is a request for the log list.
    -      if (!logInfos.isEmpty) {
    -        val newApps = new mutable.LinkedHashMap[String, FsApplicationHistoryInfo]()
    -        def addIfAbsent(info: FsApplicationHistoryInfo) = {
    -          if (!newApps.contains(info.id) ||
    -              newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    -              !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    -            newApps += (info.id -> info)
    -          }
    -        }
    +  /**
    +   * Replay the log files in the list and merge the list of old applications with new ones
    +   */
    +  private def mergeApplicationListing(logs: Seq[FileStatus]): Unit = {
    +    def addIfAbsent(
    +        newApps: mutable.LinkedHashMap[String, FsApplicationHistoryInfo],
    +        info: FsApplicationHistoryInfo): Unit = {
    +      if (!newApps.contains(info.id) ||
    +        newApps(info.id).logPath.endsWith(EventLoggingListener.IN_PROGRESS) &&
    +          !info.logPath.endsWith(EventLoggingListener.IN_PROGRESS)) {
    +        newApps += (info.id -> info)
    +      }
    +    }
     
    -        val newIterator = logInfos.iterator.buffered
    -        val oldIterator = applications.values.iterator.buffered
    -        while (newIterator.hasNext && oldIterator.hasNext) {
    -          if (compareAppInfo(newIterator.head, oldIterator.head)) {
    -            addIfAbsent(newIterator.next)
    -          } else {
    -            addIfAbsent(oldIterator.next)
    -          }
    -        }
    -        newIterator.foreach(addIfAbsent)
    -        oldIterator.foreach(addIfAbsent)
    +    def mergeApps(list: Seq[FsApplicationHistoryInfo]): Unit = {
    --- End diff --
    
    there's no reason for this to be in its own sub-method, since it's only called once at the end of `mergeApplicationListing`. I would just extract the content of this method out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org