You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saurabh Chawla (Jira)" <ji...@apache.org> on 2021/08/08 13:12:00 UTC

[jira] [Created] (SPARK-36452) Add the support in Spark for having group by map datatype column for the scenario that works in Hive

Saurabh Chawla created SPARK-36452:
--------------------------------------

             Summary: Add the support in Spark for having group by map datatype column for the scenario that works in Hive
                 Key: SPARK-36452
                 URL: https://issues.apache.org/jira/browse/SPARK-36452
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.1.2, 3.0.3, 3.2.0
            Reporter: Saurabh Chawla


Add the support in Spark for having group by map datatype column for the scenario that works in Hive.

In hive the below scenario works 

 
{code:java}
describe extended complex2;
OK
id                  string 
c1                  map<int, string>   
Detailed Table Information Table(tableName:complex2, dbName:default, owner:abc, createTime:1627994412, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:string, comment:null), FieldSchema(name:c1, type:map<int,string>, comment:null)], location:/user/hive/warehouse/complex2, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1})
select * from complex2;
OK
1 {1:"u"}
2 {1:"u",2:"uo"}
1 {1:"u",2:"uo"}
Time taken: 0.363 seconds, Fetched: 3 row(s)
select id, c1, count(*) from complex2 group by id, c1;
OK
1 {1:"u"} 1
1 {1:"u",2:"uo"} 1
2 {1:"u",2:"uo"} 1
Time taken: 1.621 seconds, Fetched: 3 row(s)
failed when map type is present in aggregated expression 
select id, max(c1), count(*) from complex2 group by id, c1; 
FAILED: UDFArgumentTypeException Cannot support comparison of map<> type or complex type containing map<>.
{code}
 

But in spark this scenario where the group by map column failed for this scenario where the map column is used in the select without any aggregation

 
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
org.apache.spark.sql.AnalysisException: expression spark_catalog.default.complex2.`c1` cannot be used as a grouping expression because its data type map<int,string> is not an orderable data type.;
Aggregate [id#1, c1#2], [id#1, c1#2, count(1) AS count(1)#3L]
+- SubqueryAlias spark_catalog.default.complex2
 +- HiveTableRelation [`default`.`complex2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#1, c1#2], Partition Cols: []]
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:50)
{code}
There is need to add the this scenario where grouping expression can have map type if aggregated expression does not have the that map type reference. This helps in migrating the user from hive to Spark.

After the code change 

 
{code:java}
scala> spark.sql("select id,c1, count(*) from complex2 group by id, c1").show
+---+-----------------+--------+                                                
| id|               c1|count(1)|
+---+-----------------+--------+
|  1|         {1 -> u}|       1|
|  2|{1 -> u, 2 -> uo}|       1|
|  1|{1 -> u, 2 -> uo}|       1|
+---+-----------------+--------+
{code}
 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org