You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yuming Wang (JIRA)" <ji...@apache.org> on 2018/09/23 08:39:00 UTC
[jira] [Comment Edited] (SPARK-24486) Slow performance reading
ArrayType columns
[ https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16624723#comment-16624723 ]
Yuming Wang edited comment on SPARK-24486 at 9/23/18 8:38 AM:
--------------------------------------------------------------
[~lucacanali] May be caused by [SPARK-23023|https://issues.apache.org/jira/browse/SPARK-23023]. Cloud you use {{collect()}} to test you case. Below is my benchmark:
code:
{code:java}
val benchmark = new Benchmark("read parquet", 1, minNumIters = 10)
benchmark.addCase("array", 5) { _ =>
spark.read.parquet("file:///tmp/deleteme1").limit(1).show()
}
benchmark.addCase("array", 5) { _ =>
spark.read.parquet("file:///tmp/deleteme1").limit(1).collect()
}
benchmark.run()
{code}
{noformat}
Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6
Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
read parquet: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
array 14305 / 15775 0.0 14304996947.0 1.0X
array 269 / 364 0.0 269060518.0 53.2X
{noformat}
was (Author: q79969786):
I can reproduce this issue. I'm working on.
> Slow performance reading ArrayType columns
> ------------------------------------------
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
> Issue Type: Bug
> Components: Spark Core, SQL
> Affects Versions: 2.3.0
> Reporter: Luca Canali
> Priority: Minor
>
> We have found an issue of slow performance in one of our applications when running on Spark 2.3.0 (the same workload does not have a performance issue on Spark 2.2.1). We suspect a regression in the area of handling columns of ArrayType. I have built a simplified test case showing a manifestation of the issue to help with troubleshooting:
>
>
> {code:java}
> // prepare test data
> val stringListValues=Range(1,30000).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from range(20000)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec on Spark 2.3.0
>
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 only 2 records are read by this workload, while on Spark 2.3.0 all rows in the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark)
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>
>
> Selected metrics from Spark 2.3.0 run:
>
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 20000
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>
>
> From Spark 2.2.1 run:
>
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>
> Note: Using Spark built from master (as I write this, June 7th 2018) shows the same behavior as found in Spark 2.3.0
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org