You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Raj Raj (Jira)" <ji...@apache.org> on 2021/05/07 13:54:00 UTC
[jira] [Updated] (SPARK-35336) Pyspark - Using importlib + filter +
named function + take causes pyspark to restart continuously until machine
runs out of memory
[ https://issues.apache.org/jira/browse/SPARK-35336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raj Raj updated SPARK-35336:
----------------------------
Description:
Repo to reproduce issue
[https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]
Expected behavour:
Program runs and exits cleanly
Actual behaviour:
Program runs forever, eating up all the memory on the machine
Steps to reproduce:
```
pip install -r requirements.txt
python run.py
```
The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.
Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).
The problem only occurs if you call a named function (i.e., when you use `def func`).
So these break:
```
def func(stuff):
return True
dataset.filter(func)
```
```
def func(stuff):
return True
dataset.filter(lambda s: func(s))
```
The problem does *NOT* occur if you do this:
```
dataset.filter(lambda x: True)
```
```
dataset.filter(lambda x: x == 'stuff')
```
was:
Repo to reproduce issue
[https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]
Expected behavour:
Program runs and exits cleanly
Actual behaviour:
Program runs forever, eating up all the memory on the machine
Steps to reproduce:
```
pip install -r requirements.txt
python run.py
```
The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.
Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).
The problem only occurs if you call a named function (i.e., when you use `def func`).
So these break:
```
def func(x):
return True
dataset.filter(func)
```
```
def func(x):
return True
dataset.filter(lambda x: func(x))
```
The problem does *NOT* occur if you do this:
```
dataset.filter(lambda x: True)
```
```
dataset.filter(lambda x: x == 'stuff')
```
> Pyspark - Using importlib + filter + named function + take causes pyspark to restart continuously until machine runs out of memory
> ----------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-35336
> URL: https://issues.apache.org/jira/browse/SPARK-35336
> Project: Spark
> Issue Type: Bug
> Components: PySpark
> Affects Versions: 3.0.0, 3.1.1
> Reporter: Raj Raj
> Priority: Major
>
> Repo to reproduce issue
> [https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]
>
> Expected behavour:
> Program runs and exits cleanly
>
> Actual behaviour:
> Program runs forever, eating up all the memory on the machine
>
> Steps to reproduce:
> ```
> pip install -r requirements.txt
> python run.py
> ```
> The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.
> Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).
> The problem only occurs if you call a named function (i.e., when you use `def func`).
> So these break:
> ```
> def func(stuff):
> return True
>
> dataset.filter(func)
> ```
>
> ```
> def func(stuff):
> return True
>
> dataset.filter(lambda s: func(s))
> ```
>
> The problem does *NOT* occur if you do this:
> ```
> dataset.filter(lambda x: True)
> ```
> ```
> dataset.filter(lambda x: x == 'stuff')
> ```
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org