You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "steven zhang (Jira)" <ji...@apache.org> on 2020/11/11 07:54:00 UTC
[jira] [Updated] (HUDI-1392) lose partition info when using spark parameter "basePath"

     [ https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

steven zhang updated HUDI-1392:
-------------------------------
    Description: 
Reproduce the issue with below steps:

        set hoodie.datasource.write.hive_style_partitioning->true

        spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable);

        spark.sql("select * from hudiTable where date>'20200807'").explain();

        print PartitionFilters: []

the cause of this issue is org.apache.hudi.DefaultSource#createRelation is call by dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)([https://github.com/apache/spark/blob/954cd9feaa1a3d4ad9a235811ae58e02a63e8386/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala]  L355)

the input optParams is CaseInsensitiveMap type. hudi attached additional parameters such as

val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)

the parameters  type has been converted Map not CaseInsensitiveMap

parquet datasource infer Partition info will fetch basePath value thought parameters.get(BASE_PATH_PARAM) (  [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala] L196) then the get method will not call CaseInsensitiveMap#get. just call Map#get("bathPath") and return None

so it will cause infer nothing partition info.

 

and i found spark 2.4.7 version above ( https://issues.apache.org/jira/browse/SPARK-32364 ) has use caseInsensitiveMap to fetch basePath although the intention of it is not same as this hudi issue. and the lower spark version also has this issue.

so we need using 

val parameters = translateViewTypesToQueryTypes(optParams) ++ Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL)

 

 

 

 

 

 

 

 

> lose partition info when using spark parameter "basePath" 
> ----------------------------------------------------------
>
>                 Key: HUDI-1392
>                 URL: https://issues.apache.org/jira/browse/HUDI-1392
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: steven zhang
>            Priority: Major
>
> Reproduce the issue with below steps:
>         set hoodie.datasource.write.hive_style_partitioning->true
>         spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable);
>         spark.sql("select * from hudiTable where date>'20200807'").explain();
>         print PartitionFilters: []
> the cause of this issue is org.apache.hudi.DefaultSource#createRelation is call by dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)([https://github.com/apache/spark/blob/954cd9feaa1a3d4ad9a235811ae58e02a63e8386/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala]  L355)
> the input optParams is CaseInsensitiveMap type. hudi attached additional parameters such as
> val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
> the parameters  type has been converted Map not CaseInsensitiveMap
> parquet datasource infer Partition info will fetch basePath value thought parameters.get(BASE_PATH_PARAM) (  [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala] L196) then the get method will not call CaseInsensitiveMap#get. just call Map#get("bathPath") and return None
> so it will cause infer nothing partition info.
>  
> and i found spark 2.4.7 version above ( https://issues.apache.org/jira/browse/SPARK-32364 ) has use caseInsensitiveMap to fetch basePath although the intention of it is not same as this hudi issue. and the lower spark version also has this issue.
> so we need using 
> val parameters = translateViewTypesToQueryTypes(optParams) ++ Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL)
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)