You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ivan Sadikov (Jira)" <ji...@apache.org> on 2022/05/02 05:45:00 UTC
[jira] [Updated] (SPARK-39084) df.rdd.isEmpty() results in unexpected executor failure and JVM crash

     [ https://issues.apache.org/jira/browse/SPARK-39084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ivan Sadikov updated SPARK-39084:
---------------------------------
    Description: 
It was discovered that a particular data distribution in a DataFrame with groupBy clause could result in a JVM crash when calling {{{}df.rdd.isEmpty{}}}.

For example,
{code:java}
data = []
for t in range(0, 10000):
    id = str(uuid.uuid4())
    if t == 0:
        for i in range(0, 99):
            data.append((id,))
    elif t < 10:
        for i in range(0, 75):
            data.append((id,))
    elif t < 100:
        for i in range(0, 50):
            data.append((id,))
    elif t < 1000:
        for i in range(0, 25):
            data.append((id,))
    else:
        for i in range(0, 10):
            data.append((id,))

df = self.spark.createDataFrame(data, ["col"])
df.coalesce(1).write.parquet(tmpPath)

res = self.spark.read.parquet(tmpPath).groupBy("col").count()
print(res.rdd.isEmpty()) # crashes JVM{code}
Reproducible 100% on this dataset.

The ticket is related to (can be thought of as a follow-up for) https://issues.apache.org/jira/browse/SPARK-33277. We need to patch one more place to make sure Python iterator is in sync with Java iterator and is terminated whenever the task is marked as completed.

Note that all other operations appear to work fine: {{{}count{}}}, {{{}collect{}}}.

> df.rdd.isEmpty() results in unexpected executor failure and JVM crash
> ---------------------------------------------------------------------
>
>                 Key: SPARK-39084
>                 URL: https://issues.apache.org/jira/browse/SPARK-39084
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.2.0, 3.2.1
>            Reporter: Ivan Sadikov
>            Priority: Major
>             Fix For: 3.3.0
>
>
> It was discovered that a particular data distribution in a DataFrame with groupBy clause could result in a JVM crash when calling {{{}df.rdd.isEmpty{}}}.
> For example,
> {code:java}
> data = []
> for t in range(0, 10000):
>     id = str(uuid.uuid4())
>     if t == 0:
>         for i in range(0, 99):
>             data.append((id,))
>     elif t < 10:
>         for i in range(0, 75):
>             data.append((id,))
>     elif t < 100:
>         for i in range(0, 50):
>             data.append((id,))
>     elif t < 1000:
>         for i in range(0, 25):
>             data.append((id,))
>     else:
>         for i in range(0, 10):
>             data.append((id,))
> df = self.spark.createDataFrame(data, ["col"])
> df.coalesce(1).write.parquet(tmpPath)
> res = self.spark.read.parquet(tmpPath).groupBy("col").count()
> print(res.rdd.isEmpty()) # crashes JVM{code}
> Reproducible 100% on this dataset.
> The ticket is related to (can be thought of as a follow-up for) https://issues.apache.org/jira/browse/SPARK-33277. We need to patch one more place to make sure Python iterator is in sync with Java iterator and is terminated whenever the task is marked as completed.
> Note that all other operations appear to work fine: {{{}count{}}}, {{{}collect{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org