You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "steven zhang (Jira)" <ji...@apache.org> on 2020/11/11 07:54:00 UTC
[jira] [Updated] (HUDI-1392) lose partition info when using spark
parameter "basePath"
[ https://issues.apache.org/jira/browse/HUDI-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
steven zhang updated HUDI-1392:
-------------------------------
Description:
Reproduce the issue with below steps:
set hoodie.datasource.write.hive_style_partitioning->true
spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable);
spark.sql("select * from hudiTable where date>'20200807'").explain();
print PartitionFilters: []
the cause of this issue is org.apache.hudi.DefaultSource#createRelation is call by dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)([https://github.com/apache/spark/blob/954cd9feaa1a3d4ad9a235811ae58e02a63e8386/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala] L355)
the input optParams is CaseInsensitiveMap type. hudi attached additional parameters such as
val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
the parameters type has been converted Map not CaseInsensitiveMap
parquet datasource infer Partition info will fetch basePath value thought parameters.get(BASE_PATH_PARAM) ( [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala] L196) then the get method will not call CaseInsensitiveMap#get. just call Map#get("bathPath") and return None
so it will cause infer nothing partition info.
and i found spark 2.4.7 version above ( https://issues.apache.org/jira/browse/SPARK-32364 ) has use caseInsensitiveMap to fetch basePath although the intention of it is not same as this hudi issue. and the lower spark version also has this issue.
so we need using
val parameters = translateViewTypesToQueryTypes(optParams) ++ Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL)
> lose partition info when using spark parameter "basePath"
> ----------------------------------------------------------
>
> Key: HUDI-1392
> URL: https://issues.apache.org/jira/browse/HUDI-1392
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: steven zhang
> Priority: Major
>
> Reproduce the issue with below steps:
> set hoodie.datasource.write.hive_style_partitioning->true
> spark.read().format("org.apache.hudi").option("mergeSchema", true).option("basePath",tablePath).load(tablePath + (nonPartitionedTable ? "/*" : "/*")).createOrReplaceTempView(hudiTable);
> spark.sql("select * from hudiTable where date>'20200807'").explain();
> print PartitionFilters: []
> the cause of this issue is org.apache.hudi.DefaultSource#createRelation is call by dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)([https://github.com/apache/spark/blob/954cd9feaa1a3d4ad9a235811ae58e02a63e8386/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala] L355)
> the input optParams is CaseInsensitiveMap type. hudi attached additional parameters such as
> val parameters = Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL) ++ translateViewTypesToQueryTypes(optParams)
> the parameters type has been converted Map not CaseInsensitiveMap
> parquet datasource infer Partition info will fetch basePath value thought parameters.get(BASE_PATH_PARAM) ( [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala] L196) then the get method will not call CaseInsensitiveMap#get. just call Map#get("bathPath") and return None
> so it will cause infer nothing partition info.
>
> and i found spark 2.4.7 version above ( https://issues.apache.org/jira/browse/SPARK-32364 ) has use caseInsensitiveMap to fetch basePath although the intention of it is not same as this hudi issue. and the lower spark version also has this issue.
> so we need using
> val parameters = translateViewTypesToQueryTypes(optParams) ++ Map(QUERY_TYPE_OPT_KEY -> DEFAULT_QUERY_TYPE_OPT_VAL)
>
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)