You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jarred Li (Jira)" <ji...@apache.org> on 2020/08/11 06:59:00 UTC
[jira] [Updated] (SPARK-32582) Spark SQL Infer Schema Performance

     [ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jarred Li updated SPARK-32582:
------------------------------
    Description: 
When infer schema is enabled, it tries to list all the files in the table, however only one of the file is used to read schema informaiton. The performance is impacted due to list all the files in the table when the number of partitions is larger.

 

See the code in "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]", all the files in the table are input, however only one of the file's schema is used to infer schema.

 

  was:
When infer schema is enabled, it tries to list all the files in the table, however only one of the file is used to read schema informaiton. The performance is impacted due to list all the files in the table when the number of partitions is larger.

 

See the code in "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]", all the files in the table are input, however only one of the file's schema is used to infer schema.

 


> Spark SQL Infer Schema Performance
> ----------------------------------
>
>                 Key: SPARK-32582
>                 URL: https://issues.apache.org/jira/browse/SPARK-32582
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.6, 3.0.0
>            Reporter: Jarred Li
>            Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, however only one of the file is used to read schema informaiton. The performance is impacted due to list all the files in the table when the number of partitions is larger.
>  
> See the code in "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]", all the files in the table are input, however only one of the file's schema is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org