You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Arunkumar Pillai <ar...@gmail.com> on 2016/01/05 10:11:17 UTC

finding distinct count using dataframe

Hi

Is there any   functions to find distinct count of all the variables in
dataframe.

val sc = new SparkContext(conf) // spark context
val options = Map("header" -> "true", "delimiter" -> delimiter,
"inferSchema" -> "true")
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
val datasetDF =
sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)


we are able to get the schema, variable data type. is there any method
to get the distinct count ?



-- 
Thanks and Regards
        Arun

Re: finding distinct count using dataframe

Posted by Kristina Rogale Plazonic <kp...@gmail.com>.

I think it's an expression, rather than a function you'd find in the API
 (as a function you could do   df.select(col).distinct.count)

This will give you the number of distinct rows in both columns:
scala> df.select(countDistinct("name", "age"))
res397: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name,age): bigint]

Whereas this will give you the number of distinct values in each column:
scala> df.select(countDistinct("name"), countDistinct("age"))
res398: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name): bigint,
COUNT(DISTINCT age): bigint]

Of course, when you need many columns at once, this expression becomes
tedious, so I find it easiest to construct an sql statement from column
names, like so:

df.registerTempTable("df")
val sqlstatement = "select "+ df.columns.map( col => s"count (distinct
$col) as ${col}_distinct").mkString(", ") + " from df"
sqlContext.sql(sqlstatement)

But this is not efficient - see this Jira ticket
<https://issues.apache.org/jira/browse/SPARK-4243>and the fix.

On Tue, Jan 5, 2016 at 5:55 AM, Arunkumar Pillai <ar...@gmail.com>
wrote:

> Thanks Yanbo,
>
> Thanks for the help. But I'm not able to find countDistinct ot
> approxCountDistinct. function. These functions are within dataframe or any
> other package
>
> On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang <yb...@gmail.com> wrote:
>
>> Hi Arunkumar,
>>
>> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
>> approxCountDistinct for a approximate result.
>>
>> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <ar...@gmail.com>:
>>
>>> Hi
>>>
>>> Is there any   functions to find distinct count of all the variables in
>>> dataframe.
>>>
>>> val sc = new SparkContext(conf) // spark context
>>> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" -> "true")
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>>> val datasetDF = sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>>
>>>
>>> we are able to get the schema, variable data type. is there any method to get the distinct count ?
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>>         Arun
>>>
>>
>>
>
>
> --
> Thanks and Regards
>         Arun
>

Re: finding distinct count using dataframe

Posted by Arunkumar Pillai <ar...@gmail.com>.

Thanks Yanbo,

Thanks for the help. But I'm not able to find countDistinct ot
approxCountDistinct. function. These functions are within dataframe or any
other package

On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang <yb...@gmail.com> wrote:

> Hi Arunkumar,
>
> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
> approxCountDistinct for a approximate result.
>
> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <ar...@gmail.com>:
>
>> Hi
>>
>> Is there any   functions to find distinct count of all the variables in
>> dataframe.
>>
>> val sc = new SparkContext(conf) // spark context
>> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" -> "true")
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>> val datasetDF = sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>
>>
>> we are able to get the schema, variable data type. is there any method to get the distinct count ?
>>
>>
>>
>> --
>> Thanks and Regards
>>         Arun
>>
>
>


-- 
Thanks and Regards
        Arun

Re: finding distinct count using dataframe

Posted by Yanbo Liang <yb...@gmail.com>.

Hi Arunkumar,

You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
approxCountDistinct for a approximate result.

2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <ar...@gmail.com>:

> Hi
>
> Is there any   functions to find distinct count of all the variables in
> dataframe.
>
> val sc = new SparkContext(conf) // spark context
> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" -> "true")
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
> val datasetDF = sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>
>
> we are able to get the schema, variable data type. is there any method to get the distinct count ?
>
>
>
> --
> Thanks and Regards
>         Arun
>