You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Maziyar PANAHI (Jira)" <ji...@apache.org> on 2021/04/14 07:13:00 UTC

[jira] [Created] (SPARK-35066) Spark 3.1.1 is slower than 3.0.2 by 4-5 times

Maziyar PANAHI created SPARK-35066:
--------------------------------------

             Summary: Spark 3.1.1 is slower than 3.0.2 by 4-5 times
                 Key: SPARK-35066
                 URL: https://issues.apache.org/jira/browse/SPARK-35066
             Project: Spark
          Issue Type: Bug
          Components: ML, SQL
    Affects Versions: 3.1.1
         Environment: Spark/PySpark: 3.1.1

Language: Python 3.7.x / Scala 12

OS: macOS, Linux, and Windows

Cloud: Databricks 7.3 for 3.0.1 and 8 for 3.1.1
            Reporter: Maziyar PANAHI


Hi,

The following snippet code runs 4-5 times slower when it's used in Apache Spark or PySpark 3.1.1 compare to Apache Spark or PySpark 3.0.2:

 
{code:java}
spark = SparkSession.builder \
        .master("local[*]") \
        .config("spark.driver.memory", "16G") \
        .config("spark.driver.maxResultSize", "0") \
        .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \
        .config("spark.kryoserializer.buffer.max", "2000m") \
        .getOrCreate()

Toys = spark.read \
  .parquet('./toys-cleaned').repartition(12)

# tokenize the text
regexTokenizer = RegexTokenizer(inputCol="reviewText",
outputCol="all_words", pattern="\\W")
toys_with_words = regexTokenizer.transform(Toys)

# remove stop words
remover = StopWordsRemover(inputCol="all_words", outputCol="words")
toys_with_tokens = remover.transform(toys_with_words).drop("all_words")

all_words = toys_with_tokens.select(explode("words").alias("word"))
# group by, sort and limit to 50k
top50k =
all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)

top50k.show()
{code}
 

Some debugging on my side revealed that in Spark/PySpark 3.0.2 the 12 partitions are respected in a way that all 12 tasks are being processed all together. However, in Spark/PySpark 3.1.1 even though we have 12 tasks, 10 of them finish immediately and only 2 are being processed. (I've tried to disable a couple of configs related to something similar, but none of them worked)

Screenshot of spark 3.1.1 task: [Spark UI 3.1.1|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009725-af969e00-9863-11eb-8e5b-07ce53e8f5f3.png]

Screenshot of spark 3.0.2 task: [Spark UI 3.0.2|http://apache-spark-user-list.1001560.n3.nabble.com/file/t8277/114009712-ac9bad80-9863-11eb-9e55-c797833bdbba.png]

For a longer discussion: [Spark User List|http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-Spark-3-0-x-faster-than-Spark-3-1-x-td39979.html]

 

You can reproduce this big difference of performance between Spark 3.1.1 and Spark 3.0.2 by using the shared code with any dataset that is large enough to take longer than a minute. Not sure if this is related to SQL, any Spark config being enabled in 3.x but not really into action before 3.1.1, or it's about .transform in Spark ML.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org