You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Paul Staab (Jira)" <ji...@apache.org> on 2022/08/20 14:13:00 UTC
[jira] [Created] (SPARK-40155) Optionally use a serialized storage level for DataFrame.localCheckpoint()
Paul Staab created SPARK-40155:
----------------------------------
Summary: Optionally use a serialized storage level for DataFrame.localCheckpoint()
Key: SPARK-40155
URL: https://issues.apache.org/jira/browse/SPARK-40155
Project: Spark
Issue Type: New Feature
Components: Spark Core
Affects Versions: 3.3.0
Reporter: Paul Staab
In PySpark 3.3.0 `DataFrame.localCheckpoint()` stores the RDD checkpoints using the "Disk Memory *Deserialized* 1x Replicated" storage level. Looking through the Python code and the documentation, I haven't found any possibility to change this.
As serialized RDDs are often a lot smaller than deserialized ones - I have seen examples where a 40GB deserialized RDD shrank to 200MB when serialized - I would usually like to create local checkpoints that are stored in serialized instead of deserialized format.
To make this possible, we could e.g. add an optional `storage_level` argument to `DataFrame.localCheckpoint()` similar to `DataFrame.persist()` or add a global configuration option similar to `spark.checkpoint.compress`.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org