You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jarred Li (Jira)" <ji...@apache.org> on 2020/08/10 13:55:00 UTC

[jira] [Created] (SPARK-32582) Spark SQL Infer Schema Performance

Jarred Li created SPARK-32582:
---------------------------------

             Summary: Spark SQL Infer Schema Performance
                 Key: SPARK-32582
                 URL: https://issues.apache.org/jira/browse/SPARK-32582
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.0.0, 2.4.6
            Reporter: Jarred Li


When infer schema is enabled, it tries to list all the files in the table, however only one of the file is used to read schema informaiton. The performance is impacted due to list all the files in the table when the number of partitions is larger.

 

See the code in "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]", all the files in the table are input, however only one of the file's schema is used to infer schema.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org