You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dmitry Buzolin (JIRA)" <ji...@apache.org> on 2016/12/05 14:46:58 UTC

[jira] [Comment Edited] (SPARK-18085) Better History Server scalability for many / large applications

    [ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722425#comment-15722425 ] 

Dmitry Buzolin edited comment on SPARK-18085 at 12/5/16 2:45 PM:
-----------------------------------------------------------------

I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of information in logs are repeated key names. This reliance on JSON is a dead end (perhaps compression may alleviate this at some extent) for such distributed architecture as Spark and it would be great if this changed to normal O/S like logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more intelligent not to produce such logs, as they slow down the UI, impact performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability and information it conveys. Just two examples:
- Many times when Spark application finishes and both Yarn and Spark report application as completed via calls into top level endpoint - yet the log file is not available via Spark REST API and returns "no such app" message when one queries executors or jobs details. This leaves one guessing and waiting before query the status of the application.
- When Spark app is running one can clearly see vCores and allocatedMemory for running application. However once application completes these parameters are reset to -1. Why? Perhaps to indicate that application no longer running and occupying any cluster resources. But there are already flags telling us about this: "state" and "finalStatus", so why make things more difficult to find out how many resource were used for apps which already completed?


was (Author: dbuzolin):
I would like add my observations after working with SHS:

1. The JSON format for logs storage is inefficient and redundant - about 70% of information in logs are repeated key names. This reliance on JSON is a dead end (perhaps compression may alleviate this at some extent) for such distributed architecture as Spark and it would be great if this changed to normal O/S like logging or storing logs in a database.

2. The amount of logging in Spark is directly proportional to the number of tasks. I've seen 50+ GB log files sitting in HDFS. The design has to be more intelligent not to produce such logs, as they slow down the UI, impact performance or REST API and can occupy lot of space in HDFS.

3. The Spark REST API should be consistent with regards to log availability. Many times when Spark application finishes and both Yarn and Spark report application as completed via calls into top level endpoint - yet the log file is not available via Spark REST API and returns "no such app" message when one queries executors or jobs details. This leaves one guessing and waiting before query the status of the application.

> Better History Server scalability for many / large applications
> ---------------------------------------------------------------
>
>                 Key: SPARK-18085
>                 URL: https://issues.apache.org/jira/browse/SPARK-18085
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Spark Core, Web UI
>    Affects Versions: 2.0.0
>            Reporter: Marcelo Vanzin
>         Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. I'll be attaching a document shortly describing the issues and suggesting a path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org