You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Gary Li (Jira)" <ji...@apache.org> on 2020/11/25 15:52:00 UTC
[jira] [Resolved] (HUDI-1392) lose partition info when using spark
parameter "basePath"
[ https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gary Li resolved HUDI-1392.
---------------------------
Resolution: Fixed
> lose partition info when using spark parameter "basePath"
> ----------------------------------------------------------
>
> Key: HUDI-1392
> URL: https://issues.apache.org/jira/browse/HUDI-1392
> Project: Apache Hudi
> Issue Type: Bug
> Components: Spark Integration
> Reporter: steven zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.6.1
>
>
> Reproduce the issue with below steps:
> set hoodie.datasource.write.hive_style_partitioning->true
> spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable);
> spark.sql("select * from hudiTable where date>'20200807'").explain();
> print PartitionFilters: []
> the reason is:
> step 1. spark read datasource (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala L 317)
>
> case (dataSource: RelationProvider, None) => dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions) //caseInsensitiveOptions CaseInsensitiveMap type
>
> step 2. hudi create relation
> org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = {
>
> // the type optParams is CaseInsensitiveMap. and parameters type will be converted to Map thought Map ++
> val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
>
> step 3. hudi transform to parquet relation if we query table(cow type) data
> then it will call getBaseFileOnlyView(sqlContext, parameters, schema, readPaths, isBootstrappedTable, globPaths, metaClient)
>
> it will create new Datasource and relation instance with : DataSource.apply(sparkSession = sqlContext.sparkSession,paths = extraReadPaths,userSpecifiedSchema = Option(schema),className = "parquet",options = optParams).resolveRelation()
>
> step 4. spark fetch basePath for infer partition info (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala L196)
> //the parameters come from DataSource #options (map type)
> parameters.get(BASE_PATH_PARAM)
> so parameters.get(BASE_PATH_PARAM) will call Map#get not CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” will return None
> this is a spark bug (fixed at 3.0.1 version https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark v2.4.4
> in order to avoid this spark issure a simple solution is we can not convert the input optParams type(spark already make it CaseInsensitiveMap type) in org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: Map[String, String]…
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)