You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Gary Li (Jira)" <ji...@apache.org> on 2020/11/25 15:52:00 UTC
[jira] [Resolved] (HUDI-1392) lose partition info when using spark parameter "basePath"

     [ https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Li resolved HUDI-1392.
---------------------------
    Resolution: Fixed

> lose partition info when using spark parameter "basePath" 
> ----------------------------------------------------------
>
>                 Key: HUDI-1392
>                 URL: https://issues.apache.org/jira/browse/HUDI-1392
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: Spark Integration
>            Reporter: steven zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.1
>
>
> Reproduce the issue with below steps:
>         set hoodie.datasource.write.hive_style_partitioning->true
>         spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable);
>         spark.sql("select * from hudiTable where date>'20200807'").explain();
>         print PartitionFilters: []
>  the reason is: 
> step 1. spark  read datasource  (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala L 317)
>  
>           case (dataSource: RelationProvider, None) => dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)  //caseInsensitiveOptions CaseInsensitiveMap type
>  
> step 2. hudi  create relation
>          org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: Map[String, String],schema: StructType): BaseRelation = {
>  
>          // the type optParams is CaseInsensitiveMap. and parameters type will be converted to Map thought Map ++
>          val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
>  
> step 3. hudi  transform to parquet relation if we query table(cow type) data
>          then it will call getBaseFileOnlyView(sqlContext, parameters, schema, readPaths, isBootstrappedTable, globPaths, metaClient)
>  
> it will create new Datasource and relation instance with : DataSource.apply(sparkSession = sqlContext.sparkSession,paths = extraReadPaths,userSpecifiedSchema = Option(schema),className = "parquet",options = optParams).resolveRelation()
>  
> step 4. spark fetch basePath for infer partition info (https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala L196)
>            //the parameters come from DataSource #options (map type)
>           parameters.get(BASE_PATH_PARAM)
>           so parameters.get(BASE_PATH_PARAM) will call Map#get not CaseInsensitiveMap#get. and parameters stored “bathpath” . get “bathPath” will return None
> this is a spark bug (fixed at 3.0.1 version https://issues.apache.org/jira/browse/SPARK-32368) hudi current used spark v2.4.4
> in order to avoid this spark issure  a simple solution is we can not convert the input optParams type(spark already make it  CaseInsensitiveMap type) in org.apache.hudi.DefaultSource#createRelation(sqlContext: SQLContext,optParams: Map[String, String]…
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)