You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Thomas Graves (JIRA)" <ji...@apache.org> on 2018/10/01 14:55:00 UTC

[jira] [Updated] (SPARK-25538) incorrect row counts after distinct()

     [ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Thomas Graves updated SPARK-25538:
----------------------------------
    Priority: Blocker  (was: Major)

> incorrect row counts after distinct()
> -------------------------------------
>
>                 Key: SPARK-25538
>                 URL: https://issues.apache.org/jira/browse/SPARK-25538
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>         Environment: Reproduced on a Centos7 VM and from source in Intellij on OS X.
>            Reporter: Steven Rand
>            Priority: Blocker
>              Labels: correctness
>         Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after SPARK-23713. It's possible that other operations are affected as well; {{distinct}} just happens to be the one that we noticed. I believe that this issue was introduced by SPARK-23713 because I can't reproduce it until that commit, and I've been able to reproduce it after that commit as well as with {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. Unfortunately the data used in these examples can't be uploaded to this Jira ticket. I'll try to create test data which also reproduces the issue, and will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = [<redacted>]
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = [<redacted>]
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org