You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2016/11/14 11:15:58 UTC

[jira] [Updated] (SPARK-18010) Remove unneeded heavy work performed by FsHistoryProvider for building up the application listing UI page

     [ https://issues.apache.org/jira/browse/SPARK-18010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated SPARK-18010:
------------------------------
    Fix Version/s: 2.0.3

> Remove unneeded heavy work performed by FsHistoryProvider for building up the application listing UI page
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18010
>                 URL: https://issues.apache.org/jira/browse/SPARK-18010
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, Web UI
>    Affects Versions: 1.6.2, 2.0.1, 2.1.0
>            Reporter: Vinayak Joshi
>            Assignee: Vinayak Joshi
>             Fix For: 2.0.3, 2.1.0
>
>
> There are known complaints/cribs about History Server's Application List not updating quickly enough when the event log files that need replay are huge. Currently, the FsHistoryProvider design causes the entire event log file to be replayed when building the initial application listing (refer the method mergeApplicationListing(fileStatus: FileStatus) ). The process of replay involves:
>  - each line in the event log being read as a string,
>  - parsing the string to a Json structure
>  - converting the Json to the corresponding Scala classes with nested structures
> Particularly the part involving parsing string to Json and then to Scala classes is expensive. Tests show that majority of time spent in replay is in doing this work. 
> When the replay is performed for building the application listing, the only two events that the code really cares for are "SparkListenerApplicationStart" and "SparkListenerApplicationEnd" - since the only listener attached to the ReplayListenerBus at that point is the ApplicationEventListener. This means that when processing an event log file with a huge number (hundreds of thousands, can be more) of events, the work done to deserialize all of these event,  and then replay them is not needed. Only two events are what we're interested in, and this can be used to ensure that when replay is performed for the purpose of building the application list, we only make the effort to replay these two events and not others. 
> My tests show that this drastically improves application list load time. For a 150MB event log from a user, with over 100,000 events, the load time (local on my mac) comes down from about 16 secs to under 1 second using this approach. For customers that typically execute applications with large event logs, and thus have multiple large event logs present, this can speed up how soon the history server UI lists the apps considerably.
> I will be updating a pull request with take at fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org