You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mohamed Nadjib Mami <mo...@gmail.com> on 2017/04/06 17:29:26 UTC

df.count() returns one more count than SELECT COUNT()

I paste this right from Spark shell (Spark 2.1.0):


/scala> spark.sql("SELECT count(distinct col) FROM Table").show()//
//+-------------------------+ //
//|count(DISTINCT col)|//
//+-------------------------+//
//|            4697            |//
//+-------------------------+//

//scala> spark.sql("SELECT distinct col FROM Table").count()//
//res8: Long = 4698 /

That is, `dataframe.count()` is returning one more count that the 
in-query `COUNT()` function.

Any explanations?

Cheers,
Mohamed

Re: df.count() returns one more count than SELECT COUNT()

Posted by Mohamed Nadjib MAMI <mo...@gmail.com>.
That was the case. Thanks for the quick and clean answer, Hemanth.

*Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候,
تحياتي.*
*Mohamed Nadjib Mami*
*Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University*
*About me! <http://www.strikingly.com/mohamed-nadjib-mami>*
*LinkedIn <http://fr.linkedin.com/in/mohamednadjibmami/>*

On Thu, Apr 6, 2017 at 7:33 PM, Hemanth Gudela <he...@qvantel.com>
wrote:

> Nulls are excluded with *spark.sql("SELECT count(distinct col) FROM
> Table").show()*
>
> I think it is ANSI SQL behaviour.
>
>
>
> scala> spark.sql("select distinct count(null)").show(false)
>
> +-----------+
>
> |count(NULL)|
>
> +-----------+
>
> |0          |
>
> +-----------+
>
>
>
> scala> spark.sql("select distinct null").count
>
> res1: Long = 1
>
>
>
> Regards,
>
> Hemanth
>
>
>
> *From: *Mohamed Nadjib Mami <mo...@gmail.com>
> *Date: *Thursday, 6 April 2017 at 20.29
> *To: *"user@spark.apache.org" <us...@spark.apache.org>
> *Subject: *df.count() returns one more count than SELECT COUNT()
>
>
>
> *spark.sql("SELECT count(distinct col) FROM Table").show()*
>

Re: df.count() returns one more count than SELECT COUNT()

Posted by Hemanth Gudela <he...@qvantel.com>.
Nulls are excluded with spark.sql("SELECT count(distinct col) FROM Table").show()
I think it is ANSI SQL behaviour.

scala> spark.sql("select distinct count(null)").show(false)
+-----------+
|count(NULL)|
+-----------+
|0          |
+-----------+

scala> spark.sql("select distinct null").count
res1: Long = 1

Regards,
Hemanth

From: Mohamed Nadjib Mami <mo...@gmail.com>
Date: Thursday, 6 April 2017 at 20.29
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: df.count() returns one more count than SELECT COUNT()

spark.sql("SELECT count(distinct col) FROM Table").show()