You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/03/10 21:13:00 UTC

[jira] [Closed] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications

     [ https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun closed SPARK-22783.
---------------------------------

> event log directory(spark-history) filled by large .inprogress files for spark streaming applications
> -----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22783
>                 URL: https://issues.apache.org/jira/browse/SPARK-22783
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>    Affects Versions: 1.6.0, 2.1.0
>         Environment: Linux(Generic)
>            Reporter: omkar kankalapati
>            Priority: Major
>
> When running long running streaming applications, the HDFS storage gets filled up with large  *.inprogress files in hdfs://spark-history/  directory
> For example:
>  hadoop fs -du -h /spark-history
> 234     /spark-history/<Application_1_ID>.inprogress
> 46.6 G  /spark-history/<Application_2_ID>.inprogress
> Instead of continuing to write to a very large (multi GB) .inprogress file,  Spark should instead rotate the current log file when it reaches a size (for example:  100 MB) or interval
> and perhaps expose a configuration parameter for the size/interval.
> This is also mentioned in SPARK-12140 as a concern.
> It is very important and useful to support rotating the log files because users may have limited HDFS quota and these large files consume the available limited quota.
> Also the users do not have a viable workaround
> 1) Can not move the files to an another location because the moving  the file causes the event logging to stop
> 2) Trying to copy the .inprogress file to another location and truncate the .inprogress file fails because the file is still opened by EventLoggingListener for writing
> hdfs dfs -truncate -w 0 /spark-history/<application_id>.inprogress
> truncate: Failed to TRUNCATE_FILE /spark-history/<application_id>.inprogress for DFSClient_NONMAPREDUCE_<#ID>on <IP> because this file lease is currently owned by DFSClient_NONMAPREDUCE_<#ID> on <IP>
> The only workaround available is to disable the event logging for streaming applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org