You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "lithiumlee-_- (Jira)" <ji...@apache.org> on 2020/05/26 09:47:00 UTC
[jira] [Updated] (SPARK-31822) Cost too much resources when read
orc hive table for infer schema
[ https://issues.apache.org/jira/browse/SPARK-31822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lithiumlee-_- updated SPARK-31822:
----------------------------------
Summary: Cost too much resources when read orc hive table for infer schema (was: Cost too much resources when read orc hive table to infer schema)
> Cost too much resources when read orc hive table for infer schema
> -----------------------------------------------------------------
>
> Key: SPARK-31822
> URL: https://issues.apache.org/jira/browse/SPARK-31822
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output, SQL
> Affects Versions: 2.4.3
> Reporter: lithiumlee-_-
> Priority: Major
>
> When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files for infer schema.
> Other settings: native orc mode ; _convertMetastoreOrc = true._
>
> And I think it can improve by pass *_partitionFilters_* to *_fileIndex.listFiles_*.
> {code:java}
> // code placeholder
> // org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
> val inferredSchema = fileFormat
> .inferSchema(
> sparkSession,
> options,
> fileIndex.listFiles(Nil, Nil).flatMap(_.files))
> .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org