You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michail Giannakopoulos (Jira)" <ji...@apache.org> on 2020/03/05 21:11:00 UTC

[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local processing operator is broken.

     [ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michail Giannakopoulos updated SPARK-31059:
-------------------------------------------
    Description: 
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case.

Steps to reproduce:
 1. Start spark-shell as follows:
 bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory
 2. Load the attached csv file:
 val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
 3. Create a temp view:
 gosales.createOrReplaceTempView("gosales")
 4. Execute the following sql statement:
 spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
 +---------------------+----------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+---------------------+----------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.640000000003|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.85000000003|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|

+---------------------+----------------++----------------------------

Expected output:
 +---------------------+----------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|

+---------------------+----------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Golf Equipment|Fax|null|
|Golf Equipment|Mail|0.0|
|Golf Equipment|Sales visit|143.5|
|Golf Equipment|Special|0.0|
|Golf Equipment|Telephone|123.0|
|Golf Equipment|Web|3034.0|
|Camping Equipment|E-mail|1303.3999999999999|
|Camping Equipment|Fax|null|
|Camping Equipment|Sales visit|4754.87|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Special|null|
|Camping Equipment|Telephone|5169.65|
|Camping Equipment|Web|32469.03|
|Mountaineering Eq...|E-mail|null|
|Mountaineering Eq...|Fax|null|
|Mountaineering Eq...|Mail|0.0|
|Mountaineering Eq...|Special|null|
|Mountaineering Eq...|Sales visit|null|
|Mountaineering Eq...|Telephone|null|

+---------------------+----------------++----------------------------

Notice how in the expected output all the grouping columns should be bucketed together without necessary being in order, which is not the case with output that spark produces.

  was:
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case.

Steps to reproduce:
1. Start spark-shell as follows:
bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory
2. Load the attached csv file:
val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
3. Create a temp view:
gosales.createOrReplaceTempView("gosales")
4. Execute the following sql statement:
spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show()

Output: 
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Fax| null|
| Golf Equipment| Telephone| 123.0|
| Camping Equipment| Special| null|
| Outdoor Protection| Telephone| 34218.19|
|Mountaineering Eq...| Mail| 0.0|
| Camping Equipment| Web| 32469.03|
|Personal Accessories| Fax| 3318.7|
| Golf Equipment| Sales visit| 143.5|
|Mountaineering Eq...| Telephone| null|
|Mountaineering Eq...| E-mail| null|
| Outdoor Protection| Sales visit| 20522.42|
| Outdoor Protection| Fax| 5857.54|
|Personal Accessories| E-mail| 26679.640000000003|
|Mountaineering Eq...| Fax| null|
| Outdoor Protection| Web| 340836.85000000003|
| Golf Equipment| Special| 0.0|
| Outdoor Protection| E-mail| 28505.93|
| Golf Equipment| Web| 3034.0|
+--------------------+-----------------+----------------------------+

Expected output:
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Golf Equipment| Fax| null|
| Golf Equipment| Mail| 0.0|
| Golf Equipment| Sales visit| 143.5|
| Golf Equipment| Special| 0.0|
| Golf Equipment| Telephone| 123.0|
| Golf Equipment| Web| 3034.0|
| Camping Equipment| E-mail| 1303.3999999999999|
| Camping Equipment| Fax| null|
| Camping Equipment| Sales visit| 4754.87|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Special| null|
| Camping Equipment| Telephone| 5169.65|
| Camping Equipment| Web| 32469.03|
|Mountaineering Eq...| E-mail| null|
|Mountaineering Eq...| Fax| null|
|Mountaineering Eq...| Mail| 0.0|
|Mountaineering Eq...| Special| null|
|Mountaineering Eq...| Sales visit| null|
|Mountaineering Eq...| Telephone| null|
+--------------------+-----------------+----------------------------+

Notice how all the grouping columns should be bucketed together without being in order.


> Spark's SQL "group by" local processing operator is broken.
> -----------------------------------------------------------
>
>                 Key: SPARK-31059
>                 URL: https://issues.apache.org/jira/browse/SPARK-31059
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3, 2.4.5
>         Environment: Windows 10.
>            Reporter: Michail Giannakopoulos
>            Priority: Blocker
>
> When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case.
> Steps to reproduce:
>  1. Start spark-shell as follows:
>  bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory
>  2. Load the attached csv file:
>  val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
>  3. Create a temp view:
>  gosales.createOrReplaceTempView("gosales")
>  4. Execute the following sql statement:
>  spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show()
> Output: 
>  +---------------------+----------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +---------------------+----------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Fax|null|
> |Golf Equipment|Telephone|123.0|
> |Camping Equipment|Special|null|
> |Outdoor Protection|Telephone|34218.19|
> |Mountaineering Eq...|Mail|0.0|
> |Camping Equipment|Web|32469.03|
> |Personal Accessories|Fax|3318.7|
> |Golf Equipment|Sales visit|143.5|
> |Mountaineering Eq...|Telephone|null|
> |Mountaineering Eq...|E-mail|null|
> |Outdoor Protection|Sales visit|20522.42|
> |Outdoor Protection|Fax|5857.54|
> |Personal Accessories|E-mail|26679.640000000003|
> |Mountaineering Eq...|Fax|null|
> |Outdoor Protection|Web|340836.85000000003|
> |Golf Equipment|Special|0.0|
> |Outdoor Protection|E-mail|28505.93|
> |Golf Equipment|Web|3034.0|
> +---------------------+----------------++----------------------------
> Expected output:
>  +---------------------+----------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +---------------------+----------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Golf Equipment|Fax|null|
> |Golf Equipment|Mail|0.0|
> |Golf Equipment|Sales visit|143.5|
> |Golf Equipment|Special|0.0|
> |Golf Equipment|Telephone|123.0|
> |Golf Equipment|Web|3034.0|
> |Camping Equipment|E-mail|1303.3999999999999|
> |Camping Equipment|Fax|null|
> |Camping Equipment|Sales visit|4754.87|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Special|null|
> |Camping Equipment|Telephone|5169.65|
> |Camping Equipment|Web|32469.03|
> |Mountaineering Eq...|E-mail|null|
> |Mountaineering Eq...|Fax|null|
> |Mountaineering Eq...|Mail|0.0|
> |Mountaineering Eq...|Special|null|
> |Mountaineering Eq...|Sales visit|null|
> |Mountaineering Eq...|Telephone|null|
> +---------------------+----------------++----------------------------
> Notice how in the expected output all the grouping columns should be bucketed together without necessary being in order, which is not the case with output that spark produces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org