You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Bin Wu (JIRA)" <ji...@apache.org> on 2017/03/31 06:06:41 UTC

[jira] [Created] (SPARK-20169) Groupby Bug with Sparksql

Bin Wu created SPARK-20169:
------------------------------

             Summary: Groupby Bug with Sparksql
                 Key: SPARK-20169
                 URL: https://issues.apache.org/jira/browse/SPARK-20169
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.0, 2.0.0
            Reporter: Bin Wu


We find a potential bug int Catalyst optimizer which will cannot correctly process "groupby" command.

=========================
from pyspark.sql.functions import *

e = spark.read.csv("graph.csv", header=True)
r = sc.parallelize([(1,),(2,),(3,),(4,)]).toDF(['src'])
r1 = e.join(r, 'src').groupBy('dst').count().withColumnRenamed('dst','src')
jr = e.join(r1, 'src')
jr.show()
r2 = jr.groupBy('dst').count()
r2.show()
=========================

See, jr is
+---+---+-----+
|src|dst|count|
+---+---+-----+
|  3|  1|    1|
|  1|  4|    3|
|  1|  3|    3|
|  1|  2|    3|
|  4|  1|    1|
|  2|  1|    1|
+---+---+-----+

But, after the last groupBy, the 3 rows with dst = 1 are not grouped together:

+---+-----+
|dst|count|
+---+-----+
|  1|    1|
|  4|    1|
|  3|    1|
|  2|    1|
|  1|    1|
|  1|    1|
+---+-----+

If we build jr directly from raw data, this error will not show up.  So we suspect that there is a bug in the Catalyst optimizer when multiple joins and groupBy's are being optimized. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org