You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2021/08/08 13:28:00 UTC
[jira] [Commented] (SPARK-36452) Add the support in Spark for having group by map datatype column for the scenario that works in Hive

    [ https://issues.apache.org/jira/browse/SPARK-36452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395523#comment-17395523 ] 

Apache Spark commented on SPARK-36452:
--------------------------------------

User 'SaurabhChawla100' has created a pull request for this issue:
https://github.com/apache/spark/pull/33679

> Add the support in Spark for having group by map datatype column for the scenario that works in Hive
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-36452
>                 URL: https://issues.apache.org/jira/browse/SPARK-36452
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.0.3, 3.1.2, 3.2.0
>            Reporter: Saurabh Chawla
>            Priority: Major
>
> Add the support in Spark for having group by map datatype column for the scenario that works in Hive.
> In hive the below scenario works 
> {code:java}
> describe extended complex2;
> OK
> id                  string 
> c1                  map<int, string>   
> Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc, createTime:1627994412, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:c1, type:map<int,string>, comment:null)], location:/user/hive/warehouse/complex2, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat,serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1})
> select * from complex2;
> OK
> 1 {1:"u"}
> 2 {1:"u",2:"uo"}
> 1 {1:"u",2:"uo"}
> Time taken: 0.363 seconds, Fetched: 3 row(s)
> select id, c1, count(*) from complex2 group by id, c1;
> OK
> 1 {1:"u"} 1
> 1 {1:"u",2:"uo"} 1
> 2 {1:"u",2:"uo"} 1
> Time taken: 1.621 seconds, Fetched: 3 row(s)
> failed when map type is present in aggregated expression 
> select id, max(c1), count(*) from complex2 group by id, c1; 
> FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or complex type containing map<>.
> {code}
> But in spark this scenario where the group by map column failed for this scenario where the map column is used in the select without any aggregation
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> org.apache.spark.sql.AnalysisException: expression spark_catalog.default.complex2.`c1` cannot be used as a grouping expression because its data type map<int,string> is not an orderable data type.;
> Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
> +- SubqueryAlias spark_catalog.default.complex2
>  +- HiveTableRelation [`default`.`complex2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], Partition Cols: []]
> at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
> {code}
> There is need to add the this scenario where grouping expression can have map type if aggregated expression does not have the that map type reference. This helps in migrating the user from hive to Spark.
> After the code change 
> {code:java}
> scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
> +---+-----------------+--------+                                                
> | id|               c1|count(1)|
> +---+-----------------+--------+
> |  1|         {1 -> u}|       1|
> |  2|{1 -> u, 2 -> uo}|       1|
> |  1|{1 -> u, 2 -> uo}|       1|
> +---+-----------------+--------+
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org