You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Prasad Chalasani (JIRA)" <ji...@apache.org> on 2015/10/08 20:50:26 UTC

[jira] [Updated] (SPARK-11008) Spark window function returns inconsistent/wrong results

     [ https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Prasad Chalasani updated SPARK-11008:
-------------------------------------
    Summary: Spark window function returns inconsistent/wrong results  (was: Spark window function return inconsistent/wrong results)

> Spark window function returns inconsistent/wrong results
> --------------------------------------------------------
>
>                 Key: SPARK-11008
>                 URL: https://issues.apache.org/jira/browse/SPARK-11008
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core, SQL
>    Affects Versions: 1.4.0, 1.5.0
>         Environment: Amazon Linux AMI (Amazon Linux version 2015.09)
>            Reporter: Prasad Chalasani
>            Priority: Blocker
>
> Summary: applying a windowing function on a data-frame, followed by count() gives widely varying results in repeated runs: none exceed the correct value, but of course all but one are wrong. On large data-sets I sometimes get as small as HALF of the correct value.
> A minimal reproducible example is here: 
> (1) start spark-shell
> (2) run these:
>     val data = 1.to(100).map(x => (x,1))    
>     import sqlContext.implicits._
>     val tbl = sc.parallelize(data).toDF("id", "time")
>     tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> (3) exit the shell (this is important)
> (4) start spark-shell again
> (5) run these:
> import org.apache.spark.sql.expressions.Window
> val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt")
> val win = Window.partitionBy("id").orderBy("time")
> df.select($"id", (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count()
> I get 98, but the correct result is 100. 
> If I re-run the above, then the result gets "fixed" and I always get 100.
> Note this is only a minimal reproducible example to reproduce the error. In my real application the size of the data is much larger and the window function is not trivial as above (i.e. there are multiple "time" values per "id", etc), and I see results sometimes as small as HALF of the correct value (e.g. 120,000 while the correct value is 250,000). So this is a serious problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org