You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Luca Canali (JIRA)" <ji...@apache.org> on 2018/06/07 14:35:00 UTC

[jira] [Created] (SPARK-24486) Slow performance reading ArrayType columns

Luca Canali created SPARK-24486:
-----------------------------------

             Summary: Slow performance reading ArrayType columns
                 Key: SPARK-24486
                 URL: https://issues.apache.org/jira/browse/SPARK-24486
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 2.3.0
            Reporter: Luca Canali


We have found an issue of slow performance in one of our applications when running on Spark 2.3.0 (the same workload does not have a performance issue on Spark 2.2.1). We suspect a regression in the area of handling columns of ArrayType. I have built a simplified test case showing a manifestation of the issue to help with troubleshooting:

 

 
{code:java}
// prepare test data
val stringListValues=Range(1,30000).mkString(",")
sql(s"select 1 as myid, Array($stringListValues) as myarray from range(20000)").repartition(1).write.parquet("file:///tmp/deleteme1")

// run test
spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}

Performance measurements:

 

On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec on Spark 2.3.0

 

Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 only 2 records are read by this workload, while on Spark 2.3.0 all rows in the file are read, which appears anomalous.

Example:
{code:java}
bin/spark-shell --master local[*] --driver-memory 2g --packages ch.cern.sparkmeasure:spark-measure_2.11:0.11
val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
{code}
 

 

Selected metrics from Spark 2.3.0 run:

 
{noformat}
elapsedTime => 17849 (18 s)
sum(numTasks) => 11
sum(recordsRead) => 20000
sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
 

 

From Spark 2.2.1 run:

 
{noformat}
elapsedTime => 1329 (1 s)
sum(numTasks) => 2
sum(recordsRead) => 2
sum(bytesRead) => 269162610 (256.0 MB)
{noformat}
 

Note: Using Spark built from master (as I write this, June 7th 2018) shows the same behavior as found in Spark 2.3.0

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org