You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Prasad Chalasani (JIRA)" <ji...@apache.org> on 2015/10/08 20:46:26 UTC

[jira] [Created] (SPARK-11008) Spark window function return inconsistent/wrong results

Prasad Chalasani created SPARK-11008:
----------------------------------------

             Summary: Spark window function return inconsistent/wrong results
                 Key: SPARK-11008
                 URL: https://issues.apache.org/jira/browse/SPARK-11008
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, SQL
    Affects Versions: 1.5.0, 1.4.0
         Environment: Amazon Linux AMI (Amazon Linux version 2015.09)

            Reporter: Prasad Chalasani
            Priority: Blocker


Summary: applying a windowing function on a data-frame, followed by count() results in widely varying results in repeated runs, none exceed the correct value, but of course all but one are wrong. On large data-sets I sometimes get as small as HALF of the correct value.

A minimal reproducible example is here: 
(1) start spark-shell
(2) run these:
    val data = 1.to(100).map(x => (x,1))    
    import sqlContext.implicits._
    val tbl = sc.parallelize(data).toDF("id", "time")
    tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")

(3) exit the shell (this is important)
(4) start spark-shell again
(5) run these:
import org.apache.spark.sql.expressions.Window
val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
val win = Window.partitionBy("id").orderBy("time")

df.select($"id", (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()

I get 98, but the correct result is 100. 
If I re-run the above, then the result get "fixed" and I always get 100.

Note this is only a minimal reproducible example to reproduce the error. In my real application the size of the data is much larger and the window function is not trivial as above (i.e. there are multiple "time" values per "id", etc), and I see results sometimes as small as HALF of the correct value (e.g. 120,000 while the correct value is 250,000). So this is a serious problem.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org