You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Paul Wais <pa...@gmail.com> on 2019/10/20 10:18:20 UTC

pyspark - memory leak leading to OOM after submitting 100 jobs?

Dear List,

I've observed some sort of memory leak when using pyspark to run ~100
jobs in local mode.  Each job is essentially a create RDD -> create DF
-> write DF sort of flow.  The RDD and DFs go out of scope after each
job completes, hence I call this issue a "memory leak."  Here's
pseudocode:

```
row_rdds = []
for i in range(100):
  row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
  row_rdds.append(row_rdd)

for row_rdd in row_rdds:
  df = spark.createDataFrame(row_rdd)
  df.persist()
  print(df.count())
  df.write.save(...) # Save parquet
  df.unpersist()

  # Does not help:
  # del df
  # del row_rdd
```

In my real application:
 * rows are much larger, perhaps 1MB each
 * row_rdds are sized to fit available RAM

I observe that after 100 or so iterations of the second loop (each of
which creates a "job" in the Spark WebUI), the following happens:
 * pyspark workers have fairly stable resident and virtual RAM usage
 * java process eventually approaches resident RAM cap (8GB standard)
but virtual RAM usage keeps ballooning.

Eventually the machine runs out of RAM and the linux OOM killer kills
the java process, resulting in an "IndexError: pop from an empty
deque" error from py4j/java_gateway.py .


Does anybody have any ideas about what's going on?  Note that this is
local mode.  I have personally run standalone masters and submitted a
ton of jobs and never seen something like this over time.  Those were
very different jobs, but perhaps this issue is bespoke to local mode?

Emphasis: I did try to del the pyspark objects and run python GC.
That didn't help at all.

pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)

12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).

Cheers,
-Paul

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Posted by Paul Wais <pa...@gmail.com>.
Well, dumb question:

Given the workflow outlined above, should Local Mode keep running?  Or
is the leak a known issue?  I just wanted to check because I can't
recall seeing this issue with a non-local master, though it's possible
there were task failures that hid the issue.

If this issue looks new, what's the easiest way to record memory dumps
or do profiling?  Can I put something in my spark-defaults.conf ?

The code is open source and the run is reproducible, although the
specific test currently requires a rather large (but public) dataset.

On Sun, Oct 20, 2019 at 6:24 PM Jungtaek Lim
<ka...@gmail.com> wrote:
>
> Honestly I'd recommend you to spend you time to look into the issue, via taking memory dump per some interval and compare differences (at least share these dump files to community with redacting if necessary). Otherwise someone has to try to reproduce without reproducer and even couldn't reproduce even they spent their time. Memory leak issue is not really easy to reproduce, unless it leaks some objects without any conditions.
>
> - Jungtaek Lim (HeartSaVioR)
>
> On Sun, Oct 20, 2019 at 7:18 PM Paul Wais <pa...@gmail.com> wrote:
>>
>> Dear List,
>>
>> I've observed some sort of memory leak when using pyspark to run ~100
>> jobs in local mode.  Each job is essentially a create RDD -> create DF
>> -> write DF sort of flow.  The RDD and DFs go out of scope after each
>> job completes, hence I call this issue a "memory leak."  Here's
>> pseudocode:
>>
>> ```
>> row_rdds = []
>> for i in range(100):
>>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>>   row_rdds.append(row_rdd)
>>
>> for row_rdd in row_rdds:
>>   df = spark.createDataFrame(row_rdd)
>>   df.persist()
>>   print(df.count())
>>   df.write.save(...) # Save parquet
>>   df.unpersist()
>>
>>   # Does not help:
>>   # del df
>>   # del row_rdd
>> ```
>>
>> In my real application:
>>  * rows are much larger, perhaps 1MB each
>>  * row_rdds are sized to fit available RAM
>>
>> I observe that after 100 or so iterations of the second loop (each of
>> which creates a "job" in the Spark WebUI), the following happens:
>>  * pyspark workers have fairly stable resident and virtual RAM usage
>>  * java process eventually approaches resident RAM cap (8GB standard)
>> but virtual RAM usage keeps ballooning.
>>
>> Eventually the machine runs out of RAM and the linux OOM killer kills
>> the java process, resulting in an "IndexError: pop from an empty
>> deque" error from py4j/java_gateway.py .
>>
>>
>> Does anybody have any ideas about what's going on?  Note that this is
>> local mode.  I have personally run standalone masters and submitted a
>> ton of jobs and never seen something like this over time.  Those were
>> very different jobs, but perhaps this issue is bespoke to local mode?
>>
>> Emphasis: I did try to del the pyspark objects and run python GC.
>> That didn't help at all.
>>
>> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
>>
>> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
>>
>> Cheers,
>> -Paul
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Posted by Jungtaek Lim <ka...@gmail.com>.
Honestly I'd recommend you to spend you time to look into the issue, via
taking memory dump per some interval and compare differences (at least
share these dump files to community with redacting if necessary).
Otherwise someone has to try to reproduce without reproducer and even
couldn't reproduce even they spent their time. Memory leak issue is not
really easy to reproduce, unless it leaks some objects without any
conditions.

- Jungtaek Lim (HeartSaVioR)

On Sun, Oct 20, 2019 at 7:18 PM Paul Wais <pa...@gmail.com> wrote:

> Dear List,
>
> I've observed some sort of memory leak when using pyspark to run ~100
> jobs in local mode.  Each job is essentially a create RDD -> create DF
> -> write DF sort of flow.  The RDD and DFs go out of scope after each
> job completes, hence I call this issue a "memory leak."  Here's
> pseudocode:
>
> ```
> row_rdds = []
> for i in range(100):
>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>   row_rdds.append(row_rdd)
>
> for row_rdd in row_rdds:
>   df = spark.createDataFrame(row_rdd)
>   df.persist()
>   print(df.count())
>   df.write.save(...) # Save parquet
>   df.unpersist()
>
>   # Does not help:
>   # del df
>   # del row_rdd
> ```
>
> In my real application:
>  * rows are much larger, perhaps 1MB each
>  * row_rdds are sized to fit available RAM
>
> I observe that after 100 or so iterations of the second loop (each of
> which creates a "job" in the Spark WebUI), the following happens:
>  * pyspark workers have fairly stable resident and virtual RAM usage
>  * java process eventually approaches resident RAM cap (8GB standard)
> but virtual RAM usage keeps ballooning.
>
> Eventually the machine runs out of RAM and the linux OOM killer kills
> the java process, resulting in an "IndexError: pop from an empty
> deque" error from py4j/java_gateway.py .
>
>
> Does anybody have any ideas about what's going on?  Note that this is
> local mode.  I have personally run standalone masters and submitted a
> ton of jobs and never seen something like this over time.  Those were
> very different jobs, but perhaps this issue is bespoke to local mode?
>
> Emphasis: I did try to del the pyspark objects and run python GC.
> That didn't help at all.
>
> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
>
> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
>
> Cheers,
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Posted by Holden Karau <ho...@pigscanfly.ca>.
On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris <ni...@riseup.net>
wrote:

> have you deactivated the spark.ui ?
> I have read several thread explaining the ui can lead to OOM because it
> stores 1000 dags by default
>
>
> On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:
> > Dear List,
> >
> > I've observed some sort of memory leak when using pyspark to run ~100
> > jobs in local mode.  Each job is essentially a create RDD -> create DF
> > -> write DF sort of flow.  The RDD and DFs go out of scope after each
> > job completes, hence I call this issue a "memory leak."  Here's
> > pseudocode:
> >
> > ```
> > row_rdds = []
> > for i in range(100):
> >   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in
> range(1000)])
> >   row_rdds.append(row_rdd)
> >
> > for row_rdd in row_rdds:
> >   df = spark.createDataFrame(row_rdd)
> >   df.persist()
> >   print(df.count())
> >   df.write.save(...) # Save parquet
> >   df.unpersist()
> >
> >   # Does not help:
> >   # del df
> >   # del row_rdd
> > ```

The connection between Python GC/del and JVM GC is perhaps a bit weaker
than we might like. There certainly could be a problem here, but it still
shouldn’t be getting to the OOM state.

>
> >
> > In my real application:
> >  * rows are much larger, perhaps 1MB each
> >  * row_rdds are sized to fit available RAM
> >
> > I observe that after 100 or so iterations of the second loop (each of
> > which creates a "job" in the Spark WebUI), the following happens:
> >  * pyspark workers have fairly stable resident and virtual RAM usage
> >  * java process eventually approaches resident RAM cap (8GB standard)
> > but virtual RAM usage keeps ballooning.
> >

Can you share what flags the JVM is launching with? Also which JVM(s) are
ballooning?

>
> > Eventually the machine runs out of RAM and the linux OOM killer kills
> > the java process, resulting in an "IndexError: pop from an empty
> > deque" error from py4j/java_gateway.py .
> >
> >
> > Does anybody have any ideas about what's going on?  Note that this is
> > local mode.  I have personally run standalone masters and submitted a
> > ton of jobs and never seen something like this over time.  Those were
> > very different jobs, but perhaps this issue is bespoke to local mode?
> >
> > Emphasis: I did try to del the pyspark objects and run python GC.
> > That didn't help at all.
> >
> > pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
> >
> > 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
> >
> > Cheers,
> > -Paul
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
>
> --
> nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

Posted by Nicolas Paris <ni...@riseup.net>.
have you deactivated the spark.ui ?
I have read several thread explaining the ui can lead to OOM because it
stores 1000 dags by default


On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote:
> Dear List,
> 
> I've observed some sort of memory leak when using pyspark to run ~100
> jobs in local mode.  Each job is essentially a create RDD -> create DF
> -> write DF sort of flow.  The RDD and DFs go out of scope after each
> job completes, hence I call this issue a "memory leak."  Here's
> pseudocode:
> 
> ```
> row_rdds = []
> for i in range(100):
>   row_rdd = spark.sparkContext.parallelize([{'a': i} for i in range(1000)])
>   row_rdds.append(row_rdd)
> 
> for row_rdd in row_rdds:
>   df = spark.createDataFrame(row_rdd)
>   df.persist()
>   print(df.count())
>   df.write.save(...) # Save parquet
>   df.unpersist()
> 
>   # Does not help:
>   # del df
>   # del row_rdd
> ```
> 
> In my real application:
>  * rows are much larger, perhaps 1MB each
>  * row_rdds are sized to fit available RAM
> 
> I observe that after 100 or so iterations of the second loop (each of
> which creates a "job" in the Spark WebUI), the following happens:
>  * pyspark workers have fairly stable resident and virtual RAM usage
>  * java process eventually approaches resident RAM cap (8GB standard)
> but virtual RAM usage keeps ballooning.
> 
> Eventually the machine runs out of RAM and the linux OOM killer kills
> the java process, resulting in an "IndexError: pop from an empty
> deque" error from py4j/java_gateway.py .
> 
> 
> Does anybody have any ideas about what's going on?  Note that this is
> local mode.  I have personally run standalone masters and submitted a
> ton of jobs and never seen something like this over time.  Those were
> very different jobs, but perhaps this issue is bespoke to local mode?
> 
> Emphasis: I did try to del the pyspark objects and run python GC.
> That didn't help at all.
> 
> pyspark 2.4.4 on java 1.8 on ubuntu bionic (tensorflow docker image)
> 
> 12-core i7 with 16GB of ram and 22GB swap file (swap is *on*).
> 
> Cheers,
> -Paul
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 

-- 
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org