You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Andrew Davidson <ae...@ucsc.edu.INVALID> on 2022/01/05 23:26:59 UTC

Newbie pyspark memory mgmt question

Hi

I am running into OOM problems. My cluster should be much bigger than I need. I wonder if it has to do with the way I am writing my code. Below are three style cases. I wonder if they cause memory to be leaked?

Case 1 :

df1 = spark.read.load( cvs file)

df1 = df1.someTransform()

df1 = df1.sometranform()

df1.write(csv file)



I assume lazy evaluation. First action is write. So does not  leak memory



Case 2.

I added actions to make it easier to debug



df1 = spark.read.load( cvs file)

print( df.count() )



df1 = df1.someTransform()

print( df.count() )



df1 = df1.sometranform()

print( df.count() )



df1.write(csv file)

Does this leak memory?

Case 3.
If you remove the debug actions. You have the original version of my code.

For f in listOfFiles

df1 = spark.read.load( cvs file)

df1  = df.select( [“a”, “b”] )

print( df1.count() )

                        df1.createOrReplaceTempView( "df1" )



                        from \n\
                               retDF as rc, \n\
                               sample  \n\
                            where \n\
                                rc.Name == df1.Name \n'.format(“a”)
 if i == 0 :
                retDF = df1
            else :
                retDF = self.spark.sql( sqlStmt )

                           print( retDF.count() )
   retDF.createOrReplaceTempView( "retDF" )


Does this leak memory? Is there some sort of destroy(), delete(), ??? function I should be calling ?

I wonder if I would be better off using the dataframe version of join() ?

Kind regards

Andy

Re: Newbie pyspark memory mgmt question

Posted by Andrew Davidson <ae...@ucsc.edu.INVALID>.

Thanks Sean

Andy

From: Sean Owen <sr...@gmail.com>
Date: Wednesday, January 5, 2022 at 3:38 PM
To: Andrew Davidson <ae...@ucsc.edu>, Nicholas Gustafson <nj...@gmail.com>
Cc: "user @spark" <us...@spark.apache.org>
Subject: Re: Newbie pyspark memory mgmt question

There is no memory leak, no. You can .cache() or .persist() DataFrames, and that can use memory until you .unpersist(), but you're not doing that and they are garbage collected anyway.
Hard to say what's running out of memory without knowing more about your data size, partitions, cluster size, etc

On Wed, Jan 5, 2022 at 5:27 PM Andrew Davidson <ae...@ucsc.edu.invalid> wrote:
Hi

I am running into OOM problems. My cluster should be much bigger than I need. I wonder if it has to do with the way I am writing my code. Below are three style cases. I wonder if they cause memory to be leaked?

Case 1 :

df1 = spark.read.load( cvs file)

df1 = df1.someTransform()

df1 = df1.sometranform()

df1.write(csv file)

I assume lazy evaluation. First action is write. So does not  leak memory

Case 2.

I added actions to make it easier to debug

df1 = spark.read.load( cvs file)

print( df.count() )

df1 = df1.someTransform()

print( df.count() )

df1 = df1.sometranform()

print( df.count() )

df1.write(csv file)

Does this leak memory?

Case 3.
If you remove the debug actions. You have the original version of my code.

For f in listOfFiles

df1 = spark.read.load( cvs file)

df1  = df.select( [“a”, “b”] )

print( df1.count() )

                        df1.createOrReplaceTempView( "df1" )

                        from \n\
                               retDF as rc, \n\
                               sample  \n\
                            where \n\
                                rc.Name == df1.Name \n'.format(“a”)
 if i == 0 :
                retDF = df1
            else :
                retDF = self.spark.sql( sqlStmt )

                           print( retDF.count() )
   retDF.createOrReplaceTempView( "retDF" )

Does this leak memory? Is there some sort of destroy(), delete(), ??? function I should be calling ?

I wonder if I would be better off using the dataframe version of join() ?

Kind regards

Andy

Re: Newbie pyspark memory mgmt question

Posted by Sean Owen <sr...@gmail.com>.

There is no memory leak, no. You can .cache() or .persist() DataFrames, and
that can use memory until you .unpersist(), but you're not doing that and
they are garbage collected anyway.
Hard to say what's running out of memory without knowing more about your
data size, partitions, cluster size, etc

On Wed, Jan 5, 2022 at 5:27 PM Andrew Davidson <ae...@ucsc.edu.invalid>
wrote:

> Hi
>
>
>
> I am running into OOM problems. My cluster should be much bigger than I
> need. I wonder if it has to do with the way I am writing my code. Below are
> three style cases. I wonder if they cause memory to be leaked?
>
>
>
> Case 1 :
>
> df1 = spark.read.load( cvs file)
>
> df1 = df1.someTransform()
>
> df1 = df1.sometranform()
>
> df1.write(csv file)
>
>
>
> I assume lazy evaluation. First action is write. So does not  leak memory
>
>
>
> Case 2.
>
> I added actions to make it easier to debug
>
>
>
> df1 = spark.read.load( cvs file)
>
> print( df.count() )
>
>
>
> df1 = df1.someTransform()
>
> print( df.count() )
>
>
>
> df1 = df1.sometranform()
>
> print( df.count() )
>
>
>
> df1.write(csv file)
>
>
>
> Does this leak memory?
>
>
>
> Case 3.
>
> If you remove the debug actions. You have the original version of my code.
>
>
>
> For f in listOfFiles
>
> df1 = spark.read.load( cvs file)
>
> df1  = df.select( [“a”, “b”] )
>
> print( df1.count() )
>
>                         df1.createOrReplaceTempView( "df1" )
>
>
>
>                         from \n\
>
>                                retDF as *rc*, \n\
>
>                                sample  \n\
>
>                             where \n\
>
>                                 rc.Name == df1.Name \n'.format(“a”)
>
>  if i == 0 :
>
>                 retDF = df1
>
>             else :
>
>                 retDF = self.spark.sql( sqlStmt )
>
>
>
>                            print( retDF.count() )
>
>    retDF.createOrReplaceTempView( "retDF" )
>
>
>
>
>
> Does this leak memory? Is there some sort of destroy(), delete(), ???
> function I should be calling ?
>
>
>
> I wonder if I would be better off using the dataframe version of join() ?
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>