You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michail Giannakopoulos (Jira)" <ji...@apache.org> on 2020/03/05 21:11:00 UTC
[jira] [Updated] (SPARK-31059) Spark's SQL "group by" local
processing operator is broken.
[ https://issues.apache.org/jira/browse/SPARK-31059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michail Giannakopoulos updated SPARK-31059:
-------------------------------------------
Description:
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case.
Steps to reproduce:
1. Start spark-shell as follows:
bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory
2. Load the attached csv file:
val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
3. Create a temp view:
gosales.createOrReplaceTempView("gosales")
4. Execute the following sql statement:
spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show()
Output:
+---------------------+----------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+---------------------+----------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Fax|null|
|Golf Equipment|Telephone|123.0|
|Camping Equipment|Special|null|
|Outdoor Protection|Telephone|34218.19|
|Mountaineering Eq...|Mail|0.0|
|Camping Equipment|Web|32469.03|
|Personal Accessories|Fax|3318.7|
|Golf Equipment|Sales visit|143.5|
|Mountaineering Eq...|Telephone|null|
|Mountaineering Eq...|E-mail|null|
|Outdoor Protection|Sales visit|20522.42|
|Outdoor Protection|Fax|5857.54|
|Personal Accessories|E-mail|26679.640000000003|
|Mountaineering Eq...|Fax|null|
|Outdoor Protection|Web|340836.85000000003|
|Golf Equipment|Special|0.0|
|Outdoor Protection|E-mail|28505.93|
|Golf Equipment|Web|3034.0|
+---------------------+----------------++----------------------------
Expected output:
+---------------------+----------------++----------------------------
|Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+---------------------+----------------++----------------------------
|Golf Equipment|E-mail|92.25|
|Golf Equipment|Fax|null|
|Golf Equipment|Mail|0.0|
|Golf Equipment|Sales visit|143.5|
|Golf Equipment|Special|0.0|
|Golf Equipment|Telephone|123.0|
|Golf Equipment|Web|3034.0|
|Camping Equipment|E-mail|1303.3999999999999|
|Camping Equipment|Fax|null|
|Camping Equipment|Sales visit|4754.87|
|Camping Equipment|Mail|0.0|
|Camping Equipment|Special|null|
|Camping Equipment|Telephone|5169.65|
|Camping Equipment|Web|32469.03|
|Mountaineering Eq...|E-mail|null|
|Mountaineering Eq...|Fax|null|
|Mountaineering Eq...|Mail|0.0|
|Mountaineering Eq...|Special|null|
|Mountaineering Eq...|Sales visit|null|
|Mountaineering Eq...|Telephone|null|
+---------------------+----------------++----------------------------
Notice how in the expected output all the grouping columns should be bucketed together without necessary being in order, which is not the case with output that spark produces.
was:
When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case.
Steps to reproduce:
1. Start spark-shell as follows:
bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory
2. Load the attached csv file:
val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
3. Create a temp view:
gosales.createOrReplaceTempView("gosales")
4. Execute the following sql statement:
spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show()
Output:
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Fax| null|
| Golf Equipment| Telephone| 123.0|
| Camping Equipment| Special| null|
| Outdoor Protection| Telephone| 34218.19|
|Mountaineering Eq...| Mail| 0.0|
| Camping Equipment| Web| 32469.03|
|Personal Accessories| Fax| 3318.7|
| Golf Equipment| Sales visit| 143.5|
|Mountaineering Eq...| Telephone| null|
|Mountaineering Eq...| E-mail| null|
| Outdoor Protection| Sales visit| 20522.42|
| Outdoor Protection| Fax| 5857.54|
|Personal Accessories| E-mail| 26679.640000000003|
|Mountaineering Eq...| Fax| null|
| Outdoor Protection| Web| 340836.85000000003|
| Golf Equipment| Special| 0.0|
| Outdoor Protection| E-mail| 28505.93|
| Golf Equipment| Web| 3034.0|
+--------------------+-----------------+----------------------------+
Expected output:
+--------------------+-----------------+----------------------------+
| Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
+--------------------+-----------------+----------------------------+
| Golf Equipment| E-mail| 92.25|
| Golf Equipment| Fax| null|
| Golf Equipment| Mail| 0.0|
| Golf Equipment| Sales visit| 143.5|
| Golf Equipment| Special| 0.0|
| Golf Equipment| Telephone| 123.0|
| Golf Equipment| Web| 3034.0|
| Camping Equipment| E-mail| 1303.3999999999999|
| Camping Equipment| Fax| null|
| Camping Equipment| Sales visit| 4754.87|
| Camping Equipment| Mail| 0.0|
| Camping Equipment| Special| null|
| Camping Equipment| Telephone| 5169.65|
| Camping Equipment| Web| 32469.03|
|Mountaineering Eq...| E-mail| null|
|Mountaineering Eq...| Fax| null|
|Mountaineering Eq...| Mail| 0.0|
|Mountaineering Eq...| Special| null|
|Mountaineering Eq...| Sales visit| null|
|Mountaineering Eq...| Telephone| null|
+--------------------+-----------------+----------------------------+
Notice how all the grouping columns should be bucketed together without being in order.
> Spark's SQL "group by" local processing operator is broken.
> -----------------------------------------------------------
>
> Key: SPARK-31059
> URL: https://issues.apache.org/jira/browse/SPARK-31059
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.3, 2.4.5
> Environment: Windows 10.
> Reporter: Michail Giannakopoulos
> Priority: Blocker
>
> When applying "GROUP BY" processing operator (without an "ORDER BY" clause), I expect to see all the grouping columns being grouped together to the same buckets. However, this is not the case.
> Steps to reproduce:
> 1. Start spark-shell as follows:
> bin\spark-shell.cmd --master local[4] --conf spark.sql.catalogImplementation=in-memory
> 2. Load the attached csv file:
> val gosales = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("c:/Users/MichaelGiannakopoulo/Downloads/SampleFile_GOSales.csv")
> 3. Create a temp view:
> gosales.createOrReplaceTempView("gosales")
> 4. Execute the following sql statement:
> spark.sql("SELECT `Product line`, `Order method type`, sum(`Revenue`) FROM `gosales` GROUP BY `Product line`, `Order method type`").show()
> Output:
> +---------------------+----------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +---------------------+----------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Fax|null|
> |Golf Equipment|Telephone|123.0|
> |Camping Equipment|Special|null|
> |Outdoor Protection|Telephone|34218.19|
> |Mountaineering Eq...|Mail|0.0|
> |Camping Equipment|Web|32469.03|
> |Personal Accessories|Fax|3318.7|
> |Golf Equipment|Sales visit|143.5|
> |Mountaineering Eq...|Telephone|null|
> |Mountaineering Eq...|E-mail|null|
> |Outdoor Protection|Sales visit|20522.42|
> |Outdoor Protection|Fax|5857.54|
> |Personal Accessories|E-mail|26679.640000000003|
> |Mountaineering Eq...|Fax|null|
> |Outdoor Protection|Web|340836.85000000003|
> |Golf Equipment|Special|0.0|
> |Outdoor Protection|E-mail|28505.93|
> |Golf Equipment|Web|3034.0|
> +---------------------+----------------++----------------------------
> Expected output:
> +---------------------+----------------++----------------------------
> |Product line|Order method type|sum(CAST(Revenue AS DOUBLE))|
> +---------------------+----------------++----------------------------
> |Golf Equipment|E-mail|92.25|
> |Golf Equipment|Fax|null|
> |Golf Equipment|Mail|0.0|
> |Golf Equipment|Sales visit|143.5|
> |Golf Equipment|Special|0.0|
> |Golf Equipment|Telephone|123.0|
> |Golf Equipment|Web|3034.0|
> |Camping Equipment|E-mail|1303.3999999999999|
> |Camping Equipment|Fax|null|
> |Camping Equipment|Sales visit|4754.87|
> |Camping Equipment|Mail|0.0|
> |Camping Equipment|Special|null|
> |Camping Equipment|Telephone|5169.65|
> |Camping Equipment|Web|32469.03|
> |Mountaineering Eq...|E-mail|null|
> |Mountaineering Eq...|Fax|null|
> |Mountaineering Eq...|Mail|0.0|
> |Mountaineering Eq...|Special|null|
> |Mountaineering Eq...|Sales visit|null|
> |Mountaineering Eq...|Telephone|null|
> +---------------------+----------------++----------------------------
> Notice how in the expected output all the grouping columns should be bucketed together without necessary being in order, which is not the case with output that spark produces.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org