You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jarred Li (Jira)" <ji...@apache.org> on 2020/08/10 13:55:00 UTC
[jira] [Created] (SPARK-32582) Spark SQL Infer Schema Performance
Jarred Li created SPARK-32582:
---------------------------------
Summary: Spark SQL Infer Schema Performance
Key: SPARK-32582
URL: https://issues.apache.org/jira/browse/SPARK-32582
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Jarred Li
When infer schema is enabled, it tries to list all the files in the table, however only one of the file is used to read schema informaiton. The performance is impacted due to list all the files in the table when the number of partitions is larger.
See the code in "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]", all the files in the table are input, however only one of the file's schema is used to infer schema.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org