You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Rishi Shah <ri...@gmail.com> on 2019/06/14 11:05:19 UTC

[pyspark 2.3+] CountDistinct

Hi All,

Recently we noticed that countDistinct on a larger dataframe doesn't always
return the same value. Any idea? If this is the case then what is the
difference between countDistinct & approx_count_distinct?

-- 
Regards,

Rishi Shah

Re: [pyspark 2.3+] CountDistinct

Posted by Abdeali Kothari <ab...@gmail.com>.

I can't exactly reproduce this. Here is what I tried quickly:

import uuid

import findspark
findspark.init()  # noqa
import pyspark
from pyspark.sql import functions as F  # noqa: N812

spark = pyspark.sql.SparkSession.builder.getOrCreate()

df = spark.createDataFrame([
    [str(uuid.uuid4()) for i in range(450000)],
], ['col1'])

print('>>>> Spark version:', spark.sparkContext.version)
print('>>>> Null count:', df.filter(F.col('col1').isNull()).count())
print('>>>> Value count:', df.filter(F.col('col1').isNotNull()).count())
print('>>>> Distinct Count 1:',
df.agg(F.countDistinct(F.col('col1'))).collect()[0][0])
print('>>>> Distinct Count 2:',
df.agg(F.countDistinct(F.col('col1'))).collect()[0][0])

This always returns:
>>>> Spark version: 2.4.0
>>>> Null count: 0
>>>> Value count: 450000
>>>> Distinct Count 1: 450000
>>>> Distinct Count 2: 450000




On Sat, Jun 29, 2019 at 6:51 PM Rishi Shah <ri...@gmail.com> wrote:

> Thanks Abdeali! Please find details below:
>
> df.agg(countDistinct(col('col1'))).show() --> 450089
> df.agg(countDistinct(col('col1'))).show() --> 450076
> df.filter(col('col1').isNull()).count() --> 0
> df.filter(col('col1').isNotNull()).count() --> 450063
>
> col1 is a string
> Spark version 2.4.0
> datasize: ~ 500GB
>
>
> On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari <ab...@gmail.com>
> wrote:
>
>> How large is the data frame and what data type are you counting distinct
>> for?
>> I use count distinct quite a bit and haven't noticed any thing peculiar.
>>
>> Also, which exact version in 2.3.x?
>> And, are performing any operations on the DF before the countDistinct?
>>
>> I recall there was a bug when I did countDistinct(PythonUDF(x)) in the
>> same query which was resolved in one of the minor versions in 2.3.x
>>
>> On Sat, Jun 29, 2019, 10:32 Rishi Shah <ri...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Just wanted to check in to see if anyone has any insight about this
>>> behavior. Any pointers would help.
>>>
>>> Thanks,
>>> Rishi
>>>
>>> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <ri...@gmail.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> Recently we noticed that countDistinct on a larger dataframe doesn't
>>>> always return the same value. Any idea? If this is the case then what is
>>>> the difference between countDistinct & approx_count_distinct?
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Rishi Shah
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>
> --
> Regards,
>
> Rishi Shah
>

Re: [pyspark 2.3+] CountDistinct

Posted by Rishi Shah <ri...@gmail.com>.

Thanks Abdeali! Please find details below:

df.agg(countDistinct(col('col1'))).show() --> 450089
df.agg(countDistinct(col('col1'))).show() --> 450076
df.filter(col('col1').isNull()).count() --> 0
df.filter(col('col1').isNotNull()).count() --> 450063

col1 is a string
Spark version 2.4.0
datasize: ~ 500GB


On Sat, Jun 29, 2019 at 5:33 AM Abdeali Kothari <ab...@gmail.com>
wrote:

> How large is the data frame and what data type are you counting distinct
> for?
> I use count distinct quite a bit and haven't noticed any thing peculiar.
>
> Also, which exact version in 2.3.x?
> And, are performing any operations on the DF before the countDistinct?
>
> I recall there was a bug when I did countDistinct(PythonUDF(x)) in the
> same query which was resolved in one of the minor versions in 2.3.x
>
> On Sat, Jun 29, 2019, 10:32 Rishi Shah <ri...@gmail.com> wrote:
>
>> Hi All,
>>
>> Just wanted to check in to see if anyone has any insight about this
>> behavior. Any pointers would help.
>>
>> Thanks,
>> Rishi
>>
>> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <ri...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> Recently we noticed that countDistinct on a larger dataframe doesn't
>>> always return the same value. Any idea? If this is the case then what is
>>> the difference between countDistinct & approx_count_distinct?
>>>
>>> --
>>> Regards,
>>>
>>> Rishi Shah
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah

Re: [pyspark 2.3+] CountDistinct

Posted by Abdeali Kothari <ab...@gmail.com>.

How large is the data frame and what data type are you counting distinct
for?
I use count distinct quite a bit and haven't noticed any thing peculiar.

Also, which exact version in 2.3.x?
And, are performing any operations on the DF before the countDistinct?

I recall there was a bug when I did countDistinct(PythonUDF(x)) in the same
query which was resolved in one of the minor versions in 2.3.x

On Sat, Jun 29, 2019, 10:32 Rishi Shah <ri...@gmail.com> wrote:

> Hi All,
>
> Just wanted to check in to see if anyone has any insight about this
> behavior. Any pointers would help.
>
> Thanks,
> Rishi
>
> On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <ri...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> Recently we noticed that countDistinct on a larger dataframe doesn't
>> always return the same value. Any idea? If this is the case then what is
>> the difference between countDistinct & approx_count_distinct?
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>
>
> --
> Regards,
>
> Rishi Shah
>

Re: [pyspark 2.3+] CountDistinct

Posted by Rishi Shah <ri...@gmail.com>.

Hi All,

Just wanted to check in to see if anyone has any insight about this
behavior. Any pointers would help.

Thanks,
Rishi

On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah <ri...@gmail.com> wrote:

> Hi All,
>
> Recently we noticed that countDistinct on a larger dataframe doesn't
> always return the same value. Any idea? If this is the case then what is
> the difference between countDistinct & approx_count_distinct?
>
> --
> Regards,
>
> Rishi Shah
>

-- 
Regards,

Rishi Shah