You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Raj Raj (Jira)" <ji...@apache.org> on 2021/05/07 13:54:00 UTC
[jira] [Updated] (SPARK-35336) Pyspark - Using importlib + filter + named function + take causes pyspark to restart continuously until machine runs out of memory

     [ https://issues.apache.org/jira/browse/SPARK-35336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raj Raj updated SPARK-35336:
----------------------------
    Description: 
Repo to reproduce issue

[https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]

 

Expected behavour:

Program runs and exits cleanly

 

Actual behaviour:

Program runs forever, eating up all the memory on the machine

 

Steps to reproduce:

```

pip install -r requirements.txt

python run.py

``` 

The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.

Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).

The problem only occurs if you call a named function (i.e., when you use `def func`).

So these break:

```

def func(stuff):

    return True

 

dataset.filter(func)

 ```

 

```

def func(stuff):

    return True

 

dataset.filter(lambda s: func(s))

 ```

 

The problem does *NOT* occur if you do this:

```

dataset.filter(lambda x: True)

```

```

dataset.filter(lambda x: x == 'stuff')

 ```

  was:
Repo to reproduce issue

[https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]

 

Expected behavour:

Program runs and exits cleanly

 

Actual behaviour:

Program runs forever, eating up all the memory on the machine

 

Steps to reproduce:

```

pip install -r requirements.txt

python run.py

``` 

The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.

Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).

The problem only occurs if you call a named function (i.e., when you use `def func`).

So these break:

```

def func(x):

    return True

 

dataset.filter(func)

 ```

 

```

def func(x):

    return True

 

dataset.filter(lambda x: func(x))

 ```

 

The problem does *NOT* occur if you do this:

```

dataset.filter(lambda x: True)

```

```

dataset.filter(lambda x: x == 'stuff')

 ```


> Pyspark - Using importlib + filter + named function + take causes pyspark to restart continuously until machine runs out of memory
> ----------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35336
>                 URL: https://issues.apache.org/jira/browse/SPARK-35336
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.0, 3.1.1
>            Reporter: Raj Raj
>            Priority: Major
>
> Repo to reproduce issue
> [https://github.com/CanaryWharf/pyspark-mem-importlib-bug-reproduction]
>  
> Expected behavour:
> Program runs and exits cleanly
>  
> Actual behaviour:
> Program runs forever, eating up all the memory on the machine
>  
> Steps to reproduce:
> ```
> pip install -r requirements.txt
> python run.py
> ``` 
> The problem only occurs if you run the code via `importlib`. The problem does not occur running `sparky.py` directly.
> Furthermore, the problem occurs if you replace filter with map or flatMap (anything that takes in a lambda function).
> The problem only occurs if you call a named function (i.e., when you use `def func`).
> So these break:
> ```
> def func(stuff):
>     return True
>  
> dataset.filter(func)
>  ```
>  
> ```
> def func(stuff):
>     return True
>  
> dataset.filter(lambda s: func(s))
>  ```
>  
> The problem does *NOT* occur if you do this:
> ```
> dataset.filter(lambda x: True)
> ```
> ```
> dataset.filter(lambda x: x == 'stuff')
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org