You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Saif Addin Ellafi (JIRA)" <ji...@apache.org> on 2016/01/18 20:02:39 UTC
[jira] [Updated] (SPARK-12880) Different results on groupBy after window function

     [ https://issues.apache.org/jira/browse/SPARK-12880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Saif Addin Ellafi updated SPARK-12880:
--------------------------------------
    Description: 
scala> val overVint = Window.partitionBy("product", "bnd", "age").orderBy(asc("yyyymm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2672666129980398E7|
|200509|     0|          0|2.7104834668856387E9|
|200509|     0|          1| 1.151339011298214E8|
+------+------+-----------+--------------------+


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2357681589980595E7|
|200509|     0|          0| 2.709930867575646E9|
|200509|     0|          1|1.1595048973981345E8|
+------+------+-----------+--------------------+

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent
Happens when data is large (This case is 1.4 billion rows. Does not happen if I use limit 100000)


  was:
scala> val overVint = Window.partitionBy("product", "bnd", "age").orderBy(asc("yyyymm"))

scala> val df_data2 = df_data.withColumn("result", lag("baleom", 1).over(overVint))

scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2672666129980398E7|
|200509|     0|          0|2.7104834668856387E9|
|200509|     0|          1| 1.151339011298214E8|
+------+------+-----------+--------------------+


scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show
+------+------+-----------+--------------------+
|yyyymm|closed|ever_closed|              result|
+------+------+-----------+--------------------+
|200509|     1|          1|1.2357681589980595E7|
|200509|     0|          0| 2.709930867575646E9|
|200509|     0|          1|1.1595048973981345E8|
+------+------+-----------+--------------------+

Does NOT happen with columns not of the window function
Happens both in cluster mode and local mode
Before group by operation, data looks good and is consistent



> Different results on groupBy after window function
> --------------------------------------------------
>
>                 Key: SPARK-12880
>                 URL: https://issues.apache.org/jira/browse/SPARK-12880
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.6.0
>            Reporter: Saif Addin Ellafi
>            Priority: Critical
>
> scala> val overVint = Window.partitionBy("product", "bnd", "age").orderBy(asc("yyyymm"))
> scala> val df_data2 = df_data.withColumn("result", lag("baleom", 1).over(overVint))
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show
> +------+------+-----------+--------------------+
> |yyyymm|closed|ever_closed|              result|
> +------+------+-----------+--------------------+
> |200509|     1|          1|1.2672666129980398E7|
> |200509|     0|          0|2.7104834668856387E9|
> |200509|     0|          1| 1.151339011298214E8|
> +------+------+-----------+--------------------+
> scala> df_data2.filter("product = 'MAIN' and bnd = 'High' and yyyymm = 200509").groupBy("yyyymm", "closed", "ever_closed").agg(sum("result").as("result")).show
> +------+------+-----------+--------------------+
> |yyyymm|closed|ever_closed|              result|
> +------+------+-----------+--------------------+
> |200509|     1|          1|1.2357681589980595E7|
> |200509|     0|          0| 2.709930867575646E9|
> |200509|     0|          1|1.1595048973981345E8|
> +------+------+-----------+--------------------+
> Does NOT happen with columns not of the window function
> Happens both in cluster mode and local mode
> Before group by operation, data looks good and is consistent
> Happens when data is large (This case is 1.4 billion rows. Does not happen if I use limit 100000)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org