You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stephan Reiling (JIRA)" <ji...@apache.org> on 2017/08/01 15:31:00 UTC
[jira] [Updated] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

     [ https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephan Reiling updated SPARK-21595:
------------------------------------
    Description: 
My pyspark code has the following statement:


{code:java}
# assign row key for tracking
df = df.withColumn(
        'association_idx',
        sqlf.row_number().over(
            Window.orderBy('uid1', 'uid2')
        )
    )
{code}


where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates one large window for the whole dataframe to sort over.
In spark 2.1 this works without problem, in spark 2.2 this fails either with out of memory exception or too many open files exception, depending on memory settings (which is what I tried first to fix this).
Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 creates >110,000 files.
In the log I see the following messages (110,000 of these):

{noformat}
17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords crossed the threshold 4096
17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB to disk (0  time so far)
17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords crossed the threshold 4096
17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB to disk (1  time so far)
{noformat}

So I started hunting for clues in UnsafeExternalSorter, without luck. What I had missed was this one message:

{noformat}
17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
{noformat}

Which allowed me to track down the issue. 
By changing the configuration to include:

{code:java}
spark.sql.windowExec.buffer.spill.threshold	2097152
{code}

I got it to work again and with the same performance as spark 2.1.
I have workflows where I use windowing functions that do not fail, but took a performance hit due to the excessive spilling when using the default of 4096.
I think to make it easier to track down these issues this config variable should be included in the configuration documentation. 
Maybe 4096 is too small of a default value?


  was:
My pyspark code has the following statement:


{code:java}
# assign row key for tracking
df = df.withColumn(
        'association_idx',
        sqlf.row_number().over(
            Window.orderBy('uid1', 'uid2')
        )
    )
{code}


where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates one large window for the whole dataframe to sort over.
In spark 2.1 this works without problem, in spark 2.2 this fails either with out of memory exception or too many open files exception, depending on memory settings (which is what I tried first to fix this).
Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 creates >110,000 files.
In the log I see the following messages (110,000 of these):

{noformat}
17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords crossed the threshold 4096
17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB to disk (0  time so far)
17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords crossed the threshold 4096
17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB to disk (1  time so far)
{noformat}

So I started hunting for clues in UnsafeExternalSorter, without luck. What I had missed was this one message:

{noformat}
17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
{noformat}

Which allowed me to track down the issue. 
By changing the configuration to include:

{code:java}
spark.sql.windowExec.buffer.spill.threshold	2097152
{code}

I got it to work again and with the same performance as spark 2.1.
I have workflows where I use windowing functions that do not fail, but took a performance hit due to the excessive spilling when using the default of 4096.
I think to make it easier to track down these issues this config variable should be included the configuration documentation. 
Maybe 4096 is too small of a default value?



> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-21595
>                 URL: https://issues.apache.org/jira/browse/SPARK-21595
>             Project: Spark
>          Issue Type: Bug
>          Components: Documentation, PySpark
>    Affects Versions: 2.2.0
>         Environment: pyspark on linux
>            Reporter: Stephan Reiling
>            Priority: Minor
>              Labels: documentation, regression
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
>         'association_idx',
>         sqlf.row_number().over(
>             Window.orderBy('uid1', 'uid2')
>         )
>     )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with out of memory exception or too many open files exception, depending on memory settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill threshold of 4096 rows, switching to org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold	2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org