You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brady Bickel (Jira)" <ji...@apache.org> on 2023/08/22 14:17:00 UTC

[jira] [Created] (SPARK-44912) Spark 3.4 multi-column sum slows with many columns

Brady Bickel created SPARK-44912:
------------------------------------

             Summary: Spark 3.4 multi-column sum slows with many columns
                 Key: SPARK-44912
                 URL: https://issues.apache.org/jira/browse/SPARK-44912
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 3.4.1, 3.4.0
            Reporter: Brady Bickel


The code below is a minimal reproducible example of an issue I discovered with Pyspark 3.4.x. I want to sum the values of multiple columns and put the sum of those columns (per row) into a new column. This code works and returns in a reasonable amount of time in Pyspark 3.3.x, but is extremely slow in Pyspark 3.4.x when the number of columns grows. See below for execution timing summary as N varies.
{code:java}
import pyspark.sql.functions as F
import random
import string
from functools import reduce
from operator import add
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# generate a dataframe N columns by M rows with random 8 digit column 
# names and random integers in [-5,10]
N = 30
M = 100
columns = [''.join(random.choices(string.ascii_uppercase +
                                  string.digits, k=8))
           for _ in range(N)]
data = [tuple([random.randint(-5,10) for _ in range(N)])
        for _ in range(M)]

df = spark.sparkContext.parallelize(data).toDF(columns)
# 3 ways to add a sum column, all of them slow for high N in spark 3.4
df = df.withColumn("col_sum1", sum(df[col] for col in columns))
df = df.withColumn("col_sum2", reduce(add, [F.col(col) for col in columns]))
df = df.withColumn("col_sum3", F.expr("+".join(columns))) {code}
Timing results for Spark 3.3:
||N||Exe Time (s)||
|5|0.514|
|10|0.248|
|15|0.327|
|20|0.403|
|25|0.279|
|30|0.322|
|50|0.430|

Timing results for Spark 3.4:
||N||Exe Time (s)||
|5|0.379|
|10|0.318|
|15|0.405|
|20|1.32|
|25|28.8|
|30|448|
|50|>10000 (did not finish)|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org