You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sanjeev Mishra (Jira)" <ji...@apache.org> on 2020/06/29 15:54:00 UTC
[jira] [Created] (SPARK-32130) Spark 3.0 json load performance is
unacceptable in comparison of Spark 2.4
Sanjeev Mishra created SPARK-32130:
--------------------------------------
Summary: Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4
Key: SPARK-32130
URL: https://issues.apache.org/jira/browse/SPARK-32130
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 3.0.0
Environment: 20/06/29 07:52:19 WARN Utils: Your hostname, sanjeevs-MacBook-Pro-2.local resolves to a loopback address: 127.0.0.1; using 10.0.0.8 instead (on interface en0)
20/06/29 07:52:19 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
20/06/29 07:52:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/06/29 07:52:26 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://10.0.0.8:4041
Spark context available as 'sc' (master = local[*], app id = local-1593442346864).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.0
/_/
Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.
Reporter: Sanjeev Mishra
We are planning to move to Spark 3 but the read performance of our json files is unacceptable. Following is the performance numbers when compared to Spark 2.4
Spark 2.4
scala> spark.time(spark.read.json("/data/20200528"))
Time taken: {color:#ff0000}19691 ms{color}
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
scala> spark.time(res61.count())
Time taken: {color:#0000ff}7113 ms{color}
res64: Long = 2605349
Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: {color:#ff0000}849652 ms{color}
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
scala> spark.time(res0.count())
Time taken: {color:#0000ff}8201 ms{color}
res2: Long = 2605349
I am attaching a sample data (please delete is once you are able to reproduce the issue) that is much smaller than the actual size but the performance comparison can still be verified.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org