You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Paul Staab (Jira)" <ji...@apache.org> on 2022/08/20 14:13:00 UTC

[jira] [Created] (SPARK-40155) Optionally use a serialized storage level for DataFrame.localCheckpoint()

Paul Staab created SPARK-40155:
----------------------------------

             Summary: Optionally use a serialized storage level for DataFrame.localCheckpoint()
                 Key: SPARK-40155
                 URL: https://issues.apache.org/jira/browse/SPARK-40155
             Project: Spark
          Issue Type: New Feature
          Components: Spark Core
    Affects Versions: 3.3.0
            Reporter: Paul Staab


In PySpark 3.3.0 `DataFrame.localCheckpoint()` stores the RDD checkpoints using the "Disk Memory *Deserialized* 1x Replicated" storage level. Looking through the Python code and the documentation, I haven't found any possibility to change this.

As serialized RDDs are often a lot smaller than deserialized ones - I have seen examples where a 40GB deserialized RDD shrank to 200MB when serialized - I would usually like to create local checkpoints that are stored in serialized instead of deserialized format.

To make this possible, we could e.g. add an optional `storage_level` argument to `DataFrame.localCheckpoint()` similar to `DataFrame.persist()` or add a global configuration option similar to `spark.checkpoint.compress`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org