You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by jianjianjiao <gi...@git.apache.org> on 2018/09/17 22:09:30 UTC

[GitHub] spark pull request #22444: implement incremental loading and add a flag to l...

GitHub user jianjianjiao opened a pull request:

    https://github.com/apache/spark/pull/22444

    implement incremental loading and add a flag to load incomplete or not

    ## What changes were proposed in this pull request?
    
    1.  Instead of loading all event logs in every loading, load only a certain amount of event logs. That is because if there are tens of thousands of event logs, loading all of them take long time. 
    2.  If we run Spark on Yarn, Spark jobs information can be obtained by Yarn Application master, this is no need to load incomplete applications, so add a flag not to load them. 
    
    ## How was this patch tested?
    This is tested manually in our production cluster.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jianjianjiao/spark speedUpSparkHistoryLoading

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22444.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22444
    
----
commit 1190ffcb109025bd62c909059b0cf16e6a748de9
Author: Rong Tang <ro...@...>
Date:   2018-09-17T22:00:23Z

    implement incremental loading and add a flag to load incomplete or not

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

Posted by jianjianjiao <gi...@git.apache.org>.

Github user jianjianjiao commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    @squito  Yes, you are correct. I was trying to make the applications running during the scan be picked up quicker.  It turns out the SPARK-6951 has done great job in achieving this.  
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22444: [SPARK-25409][Core]Speed up Spark History loading...

Posted by jianjianjiao <gi...@git.apache.org>.

Github user jianjianjiao commented on a diff in the pull request:

https://github.com/apache/spark/pull/22444#discussion_r218292773

--- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -465,20 +475,31 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
}
} catch {
case _: NoSuchElementException =>
- // If the file is currently not being tracked by the SHS, add an entry for it and try
- // to parse it. This will allow the cleaner code to detect the file as stale later on
- // if it was not possible to parse it.
- listing.write(LogInfo(entry.getPath().toString(), newLastScanTime, None, None,
- entry.getLen()))
--- End diff --

Hi, @squito thanks for looking into this PR.

When Spark history starts, it will scan event logs folder, and using multi-threads to handle. it will not do next scan before the first finishes. That is the problem, in our cluster, there are about 20K event-log files(often bigger than 1G), including like 1K .inprogress files, it takes about 2 and a half hours to do the first scan. that means, during this 2.5 hours, if an user submit a spark application, and it finishes, user cannot find it via the spark history UI, and has to wait for the next scan.

That is why I add a limit of how much to scan each time, like set to 3K. That means no matter how many log files in the event-logs folder, it will first scan the first 3K and handle them, and then do the second scan, let's assume that during the first scan, there are 5 applications scanned, and there are another 10 applications updated. then the second scan will handle these 15 applications and another 2885 files ( from 3001 to 5885) in the event folder.

checkForLogs scan event-log folders, and only handles files that are updated or not handled.

---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    > history server startup needs to go through all these logs before being usable, so any server restart results in hours of downtime, just from scanning.
    
    I don't think this is true. The first scan may take a long time, but i think the SHS is usable even during that time.  As soon as a scan makes it through some file, that file is added the listing.
    
    But if I understand correctly, the advantage here is that as more applications are run during that 2.5 hour scan, you will pick those up more quickly.
    
    > 1. would it make sense for the initial scans to go for the most recent logs first, because that 2.5 hour time to scan all files is still there.
    > 2. would you want the UI and rest api to indicate that the scan was still in progress, and not to worry if the listing was incomplete?
    
    I think both of these already happen.
    
    @jianjianjiao again its been a while since I've looked at this code -- does that sound correct?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

Posted by steveloughran <gi...@git.apache.org>.

Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    I see the reasoning here
    
    * @jianjianjiao has a very large cluster with many thousands of history files of past (successful) jobs.
    * history server startup needs to go through all these logs before being usable, so any server restart results in hours of downtime, just from scanning.
    * this patch breaks things up to be incremental.
    
    I don't have any opinions on the patch itself; I've not looked at that code for so long my reviews are probably dangerous.
    
    Two thought: 
    
    1. would it make sense for the initial scans to go for the most recent logs first, because that 2.5 hour time to scan all files is still there. 
    1. would you want the UI and rest api to indicate that the scan was still in progress, and not to worry if the listing was incomplete?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

Posted by jianjianjiao <gi...@git.apache.org>.

Github user jianjianjiao commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    Add @vanzin @steveloughran  @squito who made changes to related code.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: implement incremental loading and add a flag to load inc...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: implement incremental loading and add a flag to load inc...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22444: [SPARK-25409][Core]Speed up Spark History loading...

Posted by squito <gi...@git.apache.org>.

Github user squito commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22444#discussion_r218279175
  
    --- Diff: core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
    @@ -465,20 +475,31 @@ private[history] class FsHistoryProvider(conf: SparkConf, clock: Clock)
                 }
               } catch {
                 case _: NoSuchElementException =>
    -              // If the file is currently not being tracked by the SHS, add an entry for it and try
    -              // to parse it. This will allow the cleaner code to detect the file as stale later on
    -              // if it was not possible to parse it.
    -              listing.write(LogInfo(entry.getPath().toString(), newLastScanTime, None, None,
    -                entry.getLen()))
    --- End diff --
    
    if you don't do this here for all entries, I think the cleaning around line 522 isn't going to work.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

Posted by jianjianjiao <gi...@git.apache.org>.

Github user jianjianjiao commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    @vanzin   Really thanks for you suggestions. It becomes much faster loading event logs. from more than 2.5 hours, to 19 minutes, loading 17K event logs, some of them are larger than 10G.
    
    1. To enable SHS V2 to caching things on disk. We are using Windows, there is a small "posix.permissions not supported in windows" issue, I create a new PR here https://github.com/apache/spark/pull/22520 , could you please take a look?  This change doesn't speed up loading very much, but it improves other part. 
    
    2. Tried 2.4, and also tried applying  SPARK-6951 to 2.3.  this is the critical part improving the speed.
    
    I will close this PR, as it is useless now.  Thanks again.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: implement incremental loading and add a flag to load inc...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22444: [SPARK-25409][Core]Speed up Spark History loading...

Posted by jianjianjiao <gi...@git.apache.org>.

Github user jianjianjiao closed the pull request at:

    https://github.com/apache/spark/pull/22444


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22444: [SPARK-25409][Core]Speed up Spark History loading via in...

Posted by vanzin <gi...@git.apache.org>.

Github user vanzin commented on the issue:

    https://github.com/apache/spark/pull/22444
  
    > so any server restart results in hours of downtime, just from scanning.
    
    Well, that's why 2.3 supports caching things on disk. Also, 2.4 has SPARK-6951 which should make this a lot faster even without disk caching. @jianjianjiao have you tried out 2.4?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org