You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andrew Davidson <ae...@ucsc.edu.INVALID> on 2021/12/26 18:43:58 UTC
Pyspark garbage collection and cache management best practices
Hi
Below is typical pseudo code I find myself writing over and over again. There is only a single action at the very end of the program. The early narrow transformations potentially hold on to a lot of needless data. I have a for loop over join. (ie wide transformation). Followed by a bunch more narrow transformations. Will setting my lists to None improve performance?
What are best practices?
Kind regards
Andy
def run():
listOfDF = []
for filePath in listOfFiles:
df = spark.read.load( filePath, ...)
listOfDF.append(df)
list2OfDF = []
for df in listOfDF:
df2 = df.select( .... )
lsit2OfDF.append( df2 )
# will setting to list to None free cache?
# or just driver memory
listOfDF = None
df3 = list2OfDF[0]
for i in range( 1, len(list2OfDF) ):
df = list2OfDF[i]
df3 = df3.join(df ...)
# will setting to list to None free cache?
# or just driver memory
List2OfDF = None
lots of narrow transformations on d3
return df3
def main() :
df = run()
df.write()