You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Jarred Li (Jira)" <ji...@apache.org> on 2020/08/11 07:00:02 UTC

[jira] [Comment Edited] (SPARK-32582) Spark SQL Infer Schema Performance

    [ https://issues.apache.org/jira/browse/SPARK-32582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175165#comment-17175165 ] 

Jarred Li edited comment on SPARK-32582 at 8/11/20, 6:59 AM:
-------------------------------------------------------------

The performance I mentioned here is not the read file, but "LIST" the files. For example, one table have 1000 partitions,  the files in that 1000 partitions are listed first. However only one file is read for schema inference.  The "LIST" operation is time consumping especially for object store such as S3.

 

See the list files code: [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L300|https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#300]

 

 


was (Author: leejianwei):
The performance I mentioned here is not the read file, but "LIST" the files. For example, one table have 1000 partitions,  the files in that 1000 partitions are listed first. However only one file is read for schema inference.  The "LIST" operation is time consumping especially for object store such as S3.

 

See the list files code: [https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#300]

 

 

> Spark SQL Infer Schema Performance
> ----------------------------------
>
>                 Key: SPARK-32582
>                 URL: https://issues.apache.org/jira/browse/SPARK-32582
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.6, 3.0.0
>            Reporter: Jarred Li
>            Priority: Major
>
> When infer schema is enabled, it tries to list all the files in the table, however only one of the file is used to read schema informaiton. The performance is impacted due to list all the files in the table when the number of partitions is larger.
>  
> See the code in "[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L88|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#88]", all the files in the table are input, however only one of the file's schema is used to infer schema.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org