You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "lithiumlee-_- (Jira)" <ji...@apache.org> on 2020/05/26 09:24:00 UTC

[jira] [Created] (SPARK-31822) Cost too much resources when read orc hive table to infer schema

lithiumlee-_- created SPARK-31822:
-------------------------------------

             Summary: Cost too much resources when read orc hive table to infer schema
                 Key: SPARK-31822
                 URL: https://issues.apache.org/jira/browse/SPARK-31822
             Project: Spark
          Issue Type: Improvement
          Components: Input/Output
    Affects Versions: 2.4.3
            Reporter: lithiumlee-_-


When read a hive orc partitioned table without spark schema properties , spark read all partitions and all files to infer schema. 

Other settings: native orc mode ; _convertMetastoreOrc = true._

 

And I think it can improve by pass  *_partitionFilters_* to *_fileIndex.listFiles_*.
{code:java}
// code placeholder
// org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:238
val inferredSchema = fileFormat
  .inferSchema(
    sparkSession,
    options,
    fileIndex.listFiles(Nil, Nil).flatMap(_.files))
  .map(mergeWithMetastoreSchema(relation.tableMeta.dataSchema, _))

{code}
I think 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org