You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "lithiumlee-_- (Jira)" <ji...@apache.org> on 2020/05/26 09:47:00 UTC

[jira] [Updated] (SPARK-31822) Cost too much resources when read orc hive table for infer schema

     [ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

lithiumlee-_- updated SPARK-31822:
----------------------------------
    Summary: Cost too much resources when read orc hive table for infer schema  (was: Cost too much resources when read orc hive table to infer schema)

> Cost too much resources when read orc hive table for infer schema
> -----------------------------------------------------------------
>
>                 Key: SPARK-31822
>                 URL: https://issues.apache.org/jira/browse/SPARK-31822
>             Project: Spark
>          Issue Type: Improvement
>          Components: Input/Output, SQL
>    Affects Versions: 2.4.3
>            Reporter: lithiumlee-_-
>            Priority: Major
>
> When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema. 
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>  
> And I think it can improve by pass  *_partitionFilters_* to *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
>   .inferSchema(
>     sparkSession,
>     options,
>     fileIndex.listFiles(Nil, Nil).flatMap(_.files))
>   .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org