You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by grp <gp...@villanova.edu> on 2019/09/16 00:07:19 UTC

Conflicting PySpark Storage Level Defaults?

Hi There Spark Users,

Curious what is going on here.  Not sure if possible bug or missing something.  Extra eyes are much appreciated.

Spark UI (Python API 2.4.3) by default is reporting persisted data-frames to be de-serialized MEMORY_AND_DISK however I always thought they were serialized for Python by default according to official documentation.
However when explicitly changing the storage level to default … ex => df.persist(StorageLevel.MEMORY_AND_DISK) … the Spark UI returns the expected serialized data-frame under Storage Tab, but not when just calling … df.cache().

Do we have to explicitly set to … StorageLevel.MEMORY_AND_DISK … to get the serialized benefit in Python (which I thought was automatic)?  Or is the Spark UI incorrect?

SO post with specific example/details => https://stackoverflow.com/questions/56926337/conflicting-pyspark-storage-level-defaults

Thank you for your time and research!
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: [EXTERNAL] Re: Conflicting PySpark Storage Level Defaults?

Posted by grp <gp...@villanova.edu>.

Running a simple test - here is the stack overflow code snippet using .count() as the action.  You can see the differences between the storage levels.

print(spark.version)
2.4.3

# id 3 => using default storage level for df (memory_and_disk) and unsure why storage level is not serialized since i am using pyspark
df = spark.range(10)
print(type(df))
df.cache().count()
print(df.storageLevel)

# id 15 => using default storage level for rdd (memory_only) and makes sense why it is serialized
rdd = df.rdd
print(type(rdd))
rdd.cache().collect()

# id 19 => manually configuring to (memory_and_disk) which makes the storage level serialized
df2 = spark.range(100)
from pyspark import StorageLevel
print(type(df2))
df2.persist(StorageLevel.MEMORY_AND_DISK).count()
print(df2.storageLevel)

<class 'pyspark.sql.dataframe.DataFrame'>
Disk Memory Deserialized 1x Replicated
<class 'pyspark.rdd.RDD'>
<class 'pyspark.sql.dataframe.DataFrame'>
Disk Memory Serialized 1x Replicated

> On Sep 16, 2019, at 2:02 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> I don’t know your full source code but you may missing an action so that it is indeed persisted.
> 
>> Am 16.09.2019 um 02:07 schrieb grp <gp...@villanova.edu>:
>> 
>> Hi There Spark Users,
>> 
>> Curious what is going on here.  Not sure if possible bug or missing something.  Extra eyes are much appreciated.
>> 
>> Spark UI (Python API 2.4.3) by default is reporting persisted data-frames to be de-serialized MEMORY_AND_DISK however I always thought they were serialized for Python by default according to official documentation.
>> However when explicitly changing the storage level to default … ex => df.persist(StorageLevel.MEMORY_AND_DISK) … the Spark UI returns the expected serialized data-frame under Storage Tab, but not when just calling … df.cache().
>> 
>> Do we have to explicitly set to … StorageLevel.MEMORY_AND_DISK … to get the serialized benefit in Python (which I thought was automatic)?  Or is the Spark UI incorrect?
>> 
>> SO post with specific example/details => https://stackoverflow.com/questions/56926337/conflicting-pyspark-storage-level-defaults
>> 
>> Thank you for your time and research!
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

Re: Conflicting PySpark Storage Level Defaults?

Posted by Jörn Franke <jo...@gmail.com>.

I don’t know your full source code but you may missing an action so that it is indeed persisted.

> Am 16.09.2019 um 02:07 schrieb grp <gp...@villanova.edu>:
> 
> Hi There Spark Users,
> 
> Curious what is going on here.  Not sure if possible bug or missing something.  Extra eyes are much appreciated.
> 
> Spark UI (Python API 2.4.3) by default is reporting persisted data-frames to be de-serialized MEMORY_AND_DISK however I always thought they were serialized for Python by default according to official documentation.
> However when explicitly changing the storage level to default … ex => df.persist(StorageLevel.MEMORY_AND_DISK) … the Spark UI returns the expected serialized data-frame under Storage Tab, but not when just calling … df.cache().
> 
> Do we have to explicitly set to … StorageLevel.MEMORY_AND_DISK … to get the serialized benefit in Python (which I thought was automatic)?  Or is the Spark UI incorrect?
> 
> SO post with specific example/details => https://stackoverflow.com/questions/56926337/conflicting-pyspark-storage-level-defaults
> 
> Thank you for your time and research!
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org