You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stavros Kontopoulos (JIRA)" <ji...@apache.org> on 2019/04/15 19:41:00 UTC

[jira] [Commented] (SPARK-24717) Split out min retain version of state for memory in HDFSBackedStateStoreProvider

    [ https://issues.apache.org/jira/browse/SPARK-24717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818326#comment-16818326 ] 

Stavros Kontopoulos commented on SPARK-24717:
---------------------------------------------

[~tdas] What is the point of having `spark.sql.streaming.minBatchesToRetain` set to 100 by default? 

Wouldnt that create problems with large states when it comes to external storage?

 

> Split out min retain version of state for memory in HDFSBackedStateStoreProvider
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-24717
>                 URL: https://issues.apache.org/jira/browse/SPARK-24717
>             Project: Spark
>          Issue Type: Bug
>          Components: Structured Streaming
>    Affects Versions: 2.4.0
>            Reporter: Jungtaek Lim
>            Assignee: Jungtaek Lim
>            Priority: Major
>             Fix For: 2.4.0
>
>
> HDFSBackedStateStoreProvider has only one configuration for minimum versions to retain of state which applies to both memory cache and files. As default version of "spark.sql.streaming.minBatchesToRetain" is set to high (100), which doesn't require strictly 100x of memory, but I'm seeing 10x ~ 80x of memory consumption for various workloads. In addition, in some cases, requiring 2x of memory is even unacceptable, so we should split out configuration for memory and let users adjust to trade-off memory usage vs cache miss.
> In normal case, default value '2' would cover both cases: success and restoring failure with less than or around 2x of memory usage, and '1' would only cover success case but no longer require more than 1x of memory. In extreme case, user can set the value to '0' to completely disable the map cache to maximize executor memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org