You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michail Giannakopoulos (Jira)" <ji...@apache.org> on 2020/03/05 23:40:00 UTC

[jira] [Resolved] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

     [ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michail Giannakopoulos resolved SPARK-31059.
--------------------------------------------
    Resolution: Invalid

Invalid guys, there is no duplicate "Product line - Order method type" key when presenting the results.

> Spark's SQL "group by" local processing operator is broken.
> -----------------------------------------------------------
>
>                 Key: SPARK-31059
>                 URL: https://issues.apache.org/jira/browse/SPARK-31059
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3, 2.4.5
>         Environment: Windows 10.
>            Reporter: Michail Giannakopoulos
>            Priority: Blocker
>         Attachments: SampleFile_GOSales.csv
>
>
> When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case.
> Steps to reproduce:
>  1. Start spark-shell as follows:
>  bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory
>  2. Load the attached csv file:
>  val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
>  3. Create a temp view:
>  gosales.createOrReplaceTempView("gosales")
>  4. Execute the following sql statement:
>  spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show()
> Output: 
>  +-----------------------+--------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +-----------------------+--------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Fax|null|
> |Golf Equipment|Telephone|123.0|
> |Camping Equipment|Special|null|
> |Outdoor Protection|Telephone|34218.19|
> |Mountaineering Eq...|Mail|0.0|
> |Camping Equipment|Web|32469.03|
> |Personal Accessories|Fax|3318.7|
> |Golf Equipment|Sales visit|143.5|
> |Mountaineering Eq...|Telephone|null|
> |Mountaineering Eq...|E-mail|null|
> |Outdoor Protection|Sales visit|20522.42|
> |Outdoor Protection|Fax|5857.54|
> |Personal Accessories|E-mail|26679.640000000003|
> |Mountaineering Eq...|Fax|null|
> |Outdoor Protection|Web|340836.85000000003|
> |Golf Equipment|Special|0.0|
> |Outdoor Protection|E-mail|28505.93|
> |Golf Equipment|Web|3034.0|
> +-----------------------+--------------++----------------------------
> Expected output:
>  +-----------------------+--------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +-----------------------+--------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Golf Equipment|Fax|null|
> |Golf Equipment|Mail|0.0|
> |Golf Equipment|Sales visit|143.5|
> |Golf Equipment|Special|0.0|
> |Golf Equipment|Telephone|123.0|
> |Golf Equipment|Web|3034.0|
> |Camping Equipment|E-mail|1303.3999999999999|
> |Camping Equipment|Fax|null|
> |Camping Equipment|Sales visit|4754.87|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Special|null|
> |Camping Equipment|Telephone|5169.65|
> |Camping Equipment|Web|32469.03|
> |Mountaineering Eq...|E-mail|null|
> |Mountaineering Eq...|Fax|null|
> |Mountaineering Eq...|Mail|0.0|
> |Mountaineering Eq...|Special|null|
> |Mountaineering Eq...|Sales visit|null|
> |Mountaineering Eq...|Telephone|null|
> +-----------------------+--------------++----------------------------
> Notice how in the expected output all the grouping columns should be bucketed together without necessarily being in order, which is not the case with the output that spark produces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org