You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/12/18 13:08:00 UTC
[jira] [Commented] (SPARK-33841) Jobs disappear intermittently from the SHS under high load

    [ https://issues.apache.org/jira/browse/SPARK-33841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17251750#comment-17251750 ] 

Apache Spark commented on SPARK-33841:
--------------------------------------

User 'vladhlinsky' has created a pull request for this issue:
https://github.com/apache/spark/pull/30842

> Jobs disappear intermittently from the SHS under high load
> ----------------------------------------------------------
>
>                 Key: SPARK-33841
>                 URL: https://issues.apache.org/jira/browse/SPARK-33841
>             Project: Spark
>          Issue Type: Task
>          Components: Spark Core
>    Affects Versions: 3.0.0, 3.0.1
>         Environment: SHS is running locally on Ubuntu 19.04
>  
>            Reporter: Vladislav Glinskiy
>            Priority: Major
>
> Ran into an issue when a particular job was displayed in the SHS and disappeared after some time, but then, in several minutes showed up again.
> The issue is caused by SPARK-29043, which is designated to improve the concurrent performance of the History Server. The [change|https://github.com/apache/spark/pull/25797/files#] breaks the ["app deletion" logic|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R563] because of missing proper synchronization for {{processing}} event log entries. Since SHS now [filters out|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462] all {{processing}} event log entries, such entries do not have a chance to be [updated with the new {{lastProcessed}}|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R472] time and thus any entity that completes processing right after [filtering|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R462] and before [the check for stale entities|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R560] will be identified as stale and will be deleted from the UI until the next {{checkForLogs}} run. This is because [updated {{lastProcessed}} time is used as criteria|https://github.com/apache/spark/pull/25797/files#diff-128a6af0d78f4a6180774faedb335d6168dfc4defff58f5aa3021fc1bd767bc0R557], and event log entries that missed to be updated with a new time, will match that criteria.
> The issue can be reproduced by generating a big number of event logs and uploading them to the SHS event log directory on S3. Essentially, around 800(82.6 MB) copies of an event log file were created using [shs-monitor|https://github.com/vladhlinsky/shs-monitor] script. Strange behavior of SHS counting the total number of applications was noticed - at first, the number was increasing as expected, but with the next page refresh, the total number of applications decreased. No errors were logged by SHS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org