You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/11/06 07:39:00 UTC

[jira] [Assigned] (SPARK-37224) Optimize write path on RocksDB state store provider

     [ https://issues.apache.org/jira/browse/SPARK-37224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Apache Spark reassigned SPARK-37224:
------------------------------------

    Assignee: Apache Spark

> Optimize write path on RocksDB state store provider
> ---------------------------------------------------
>
>                 Key: SPARK-37224
>                 URL: https://issues.apache.org/jira/browse/SPARK-37224
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 3.3.0
>            Reporter: Jungtaek Lim
>            Assignee: Apache Spark
>            Priority: Major
>
> We figured out that RocksDB class does additional lookup on the key for write operations (put/delete) to track the number of rows. This is required to fulfill the metric of the state store, but after benchmarking it turns out performance hit is significant.
> We can't find a good alternative to retain the number of rows without additional lookup, so we are proposing a new config to flag tracking the number of rows.
>  * *config name: spark.sql.streaming.stateStore.rocksdb.trackTotalNumberOfRows*
>  * *default value: true* (since we already serve the number and we want to avoid breaking change)
> *We will give "0" for the number of keys in the state store metric when the config is turned off.* The ideal value seems to be a negative one, but currently SQL metric doesn't allow negative value and there seems to be some technical/historical issue not to.
> *We will also handle the case the config is flipped during restart* - this will enable the way end users enjoy the benefit but also not lose the chance to know the number of state rows. End users can turn off the flag to maximize the performance, and turn on the flag (restart required) when they want to see the actual number of keys (for observability/debugging/etc).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org