You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/06 23:49:05 UTC

[GitHub] [spark] khalidmammadov opened a new pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

khalidmammadov opened a new pull request #35409:
URL: https://github.com/apache/spark/pull/35409


   ### What changes were proposed in this pull request?
   
   
   HiveExternalCatalog.listPartitions method call is failing when a partition column name is upper case and partition value contains dot. It's related to this change https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23
   
   The test case in that PR does not produce the issue as partition column name is lower case.
   
   This change will lowercase the partition column name during comparison to produce expected result, it's is inline with the actual spec transformation i.e. making it lower case for Hive and using the same function
   
    
   
   Below how to reproduce the issue:
   ```
   Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_312)
   Type in expressions to have them evaluated.
   Type :help for more information.
   
   scala> import org.apache.spark.sql.catalyst.TableIdentifier
   import org.apache.spark.sql.catalyst.TableIdentifier
   
   scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY (partCol1 STRING, partCol2 STRING)")
   22/02/06 21:10:45 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 'i.j') VALUES (100, 'John')")
   res1: org.apache.spark.sql.DataFrame = []                                       
   
   scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), Some(Map("partCol2" -> "i.j"))).foreach(println)
   java.util.NoSuchElementException: key not found: partcol2
     at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
     at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
     at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
     at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
     at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
     at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
     at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
     at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
     at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
     at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
     at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
     at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
     at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
     at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
     at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
     at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
     at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
     at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
     at org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
     at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
     at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
     at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
     at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
     ... 47 elided
   
   
   *******AFTER FIX*********
   
   scala> import org.apache.spark.sql.catalyst.TableIdentifier
   import org.apache.spark.sql.catalyst.TableIdentifier
   
   scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY (partCol1 STRING, partCol2 STRING)")
   22/02/06 22:08:11 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
   res1: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 'i.j') VALUES (100, 'John')")
   res2: org.apache.spark.sql.DataFrame = []                                       
   
   scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), Some(Map("partCol2" -> "i.j"))).foreach(println)
   CatalogPartition(
   	Partition Values: [partCol1=CA, partCol2=i.j]
   	Location: file:/home/khalid/dev/oss/test/spark-warehouse/customer/partcol1=CA/partcol2=i.j
   	Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
   	InputFormat: org.apache.hadoop.mapred.TextInputFormat
   	OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
   	Storage Properties: [serialization.format=1]
   	Partition Parameters: {rawDataSize=0, numFiles=1, transient_lastDdlTime=1644185314, totalSize=9, COLUMN_STATS_ACCURATE={"BASIC_STATS":"true"}, numRows=0}
   	Created Time: Sun Feb 06 22:08:34 GMT 2022
   	Last Access: UNKNOWN
   	Partition Statistics: 9 bytes)
   
   ```
   
   
   ### Why are the changes needed?
   It fixes the bug
   
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   
   ### How was this patch tested?
   
   `build/sbt -v -d "test:testOnly *CatalogSuite"`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #35409:
URL: https://github.com/apache/spark/pull/35409#issuecomment-1034027690


   According to the affected version, I backported this to branch-3.1 too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35409:
URL: https://github.com/apache/spark/pull/35409#issuecomment-1031565898


   The fix LGTM. Can we add a test in `ExternalCatalogSuite`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #35409:
URL: https://github.com/apache/spark/pull/35409#issuecomment-1030996776


   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] khalidmammadov commented on pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

Posted by GitBox <gi...@apache.org>.
khalidmammadov commented on pull request #35409:
URL: https://github.com/apache/spark/pull/35409#issuecomment-1032873689


   Added


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #35409:
URL: https://github.com/apache/spark/pull/35409


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #35409: [SPARK-38120][SQL] Fix HiveExternalCatalog.listPartitions when partition column name is upper case and dot in partition value

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #35409:
URL: https://github.com/apache/spark/pull/35409#issuecomment-1033511106


   thanks, merging to master/3.2!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org