You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bin Wu (JIRA)" <ji...@apache.org> on 2017/03/31 06:10:41 UTC
[jira] [Updated] (SPARK-20169) Groupby Bug with Sparksql
[ https://issues.apache.org/jira/browse/SPARK-20169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bin Wu updated SPARK-20169:
---------------------------
Description:
We find a potential bug int Catalyst optimizer which will cannot correctly process "groupby" command.
You can reproduce the bug by following simple example:
=========================
from pyspark.sql.functions import *
#e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"])
e = spark.read.csv("graph.csv", header=True)
r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
jr = e.join(r1, 'src')
jr.show()
r2 = jr.groupBy('dst').count()
r2.show()
=========================
FYI, "graph.csv" contains the same data with the commented line.
You can find that jr is:
+---+---+-----+
|src|dst|count|
+---+---+-----+
| 3| 1| 1|
| 1| 4| 3|
| 1| 3| 3|
| 1| 2| 3|
| 4| 1| 1|
| 2| 1| 1|
+---+---+-----+
But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:
+---+-----+
|dst|count|
+---+-----+
| 1| 1|
| 4| 1|
| 3| 1|
| 2| 1|
| 1| 1|
| 1| 1|
+---+-----+
If we build jr directly from raw data, this error will not show up. So we suspect that there is a bug in the Catalyst optimizer when multiple joins and groupBy's are being optimized.
was:
We find a potential bug int Catalyst optimizer which will cannot correctly process "groupby" command.
=========================
from pyspark.sql.functions import *
e = spark.read.csv("graph.csv", header=True)
r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
jr = e.join(r1, 'src')
jr.show()
r2 = jr.groupBy('dst').count()
r2.show()
=========================
See, jr is
+---+---+-----+
|src|dst|count|
+---+---+-----+
| 3| 1| 1|
| 1| 4| 3|
| 1| 3| 3|
| 1| 2| 3|
| 4| 1| 1|
| 2| 1| 1|
+---+---+-----+
But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:
+---+-----+
|dst|count|
+---+-----+
| 1| 1|
| 4| 1|
| 3| 1|
| 2| 1|
| 1| 1|
| 1| 1|
+---+-----+
If we build jr directly from raw data, this error will not show up. So we suspect that there is a bug in the Catalyst optimizer when multiple joins and groupBy's are being optimized.
> Groupby Bug with Sparksql
> -------------------------
>
> Key: SPARK-20169
> URL: https://issues.apache.org/jira/browse/SPARK-20169
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.0.0, 2.1.0
> Reporter: Bin Wu
>
> We find a potential bug int Catalyst optimizer which will cannot correctly process "groupby" command.
> You can reproduce the bug by following simple example:
> =========================
> from pyspark.sql.functions import *
> #e=sc.parallelize([(1,2),(1,3),(1,4),(2,1),(3,1),(4,1)]).toDF(["src","dst"])
> e = spark.read.csv("graph.csv", header=True)
> r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
> r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
> jr = e.join(r1, 'src')
> jr.show()
> r2 = jr.groupBy('dst').count()
> r2.show()
> =========================
> FYI, "graph.csv" contains the same data with the commented line.
> You can find that jr is:
> +---+---+-----+
> |src|dst|count|
> +---+---+-----+
> | 3| 1| 1|
> | 1| 4| 3|
> | 1| 3| 3|
> | 1| 2| 3|
> | 4| 1| 1|
> | 2| 1| 1|
> +---+---+-----+
> But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:
> +---+-----+
> |dst|count|
> +---+-----+
> | 1| 1|
> | 4| 1|
> | 3| 1|
> | 2| 1|
> | 1| 1|
> | 1| 1|
> +---+-----+
> If we build jr directly from raw data, this error will not show up. So we suspect that there is a bug in the Catalyst optimizer when multiple joins and groupBy's are being optimized.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org