You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Arunkumar Pillai <ar...@gmail.com> on 2016/01/05 10:11:17 UTC
finding distinct count using dataframe
Hi
Is there any functions to find distinct count of all the variables in
dataframe.
val sc = new SparkContext(conf) // spark context
val options = Map("header" -> "true", "delimiter" -> delimiter,
"inferSchema" -> "true")
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
val datasetDF =
sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
we are able to get the schema, variable data type. is there any method
to get the distinct count ?
--
Thanks and Regards
Arun
Re: finding distinct count using dataframe
Posted by Kristina Rogale Plazonic <kp...@gmail.com>.
I think it's an expression, rather than a function you'd find in the API
(as a function you could do df.select(col).distinct.count)
This will give you the number of distinct rows in both columns:
scala> df.select(countDistinct("name", "age"))
res397: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name,age): bigint]
Whereas this will give you the number of distinct values in each column:
scala> df.select(countDistinct("name"), countDistinct("age"))
res398: org.apache.spark.sql.DataFrame = [COUNT(DISTINCT name): bigint,
COUNT(DISTINCT age): bigint]
Of course, when you need many columns at once, this expression becomes
tedious, so I find it easiest to construct an sql statement from column
names, like so:
df.registerTempTable("df")
val sqlstatement = "select "+ df.columns.map( col => s"count (distinct
$col) as ${col}_distinct").mkString(", ") + " from df"
sqlContext.sql(sqlstatement)
But this is not efficient - see this Jira ticket
<https://issues.apache.org/jira/browse/SPARK-4243>and the fix.
On Tue, Jan 5, 2016 at 5:55 AM, Arunkumar Pillai <ar...@gmail.com>
wrote:
> Thanks Yanbo,
>
> Thanks for the help. But I'm not able to find countDistinct ot
> approxCountDistinct. function. These functions are within dataframe or any
> other package
>
> On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang <yb...@gmail.com> wrote:
>
>> Hi Arunkumar,
>>
>> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
>> approxCountDistinct for a approximate result.
>>
>> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <ar...@gmail.com>:
>>
>>> Hi
>>>
>>> Is there any functions to find distinct count of all the variables in
>>> dataframe.
>>>
>>> val sc = new SparkContext(conf) // spark context
>>> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" -> "true")
>>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>>> val datasetDF = sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>>
>>>
>>> we are able to get the schema, variable data type. is there any method to get the distinct count ?
>>>
>>>
>>>
>>> --
>>> Thanks and Regards
>>> Arun
>>>
>>
>>
>
>
> --
> Thanks and Regards
> Arun
>
Re: finding distinct count using dataframe
Posted by Arunkumar Pillai <ar...@gmail.com>.
Thanks Yanbo,
Thanks for the help. But I'm not able to find countDistinct ot
approxCountDistinct. function. These functions are within dataframe or any
other package
On Tue, Jan 5, 2016 at 3:24 PM, Yanbo Liang <yb...@gmail.com> wrote:
> Hi Arunkumar,
>
> You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
> approxCountDistinct for a approximate result.
>
> 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <ar...@gmail.com>:
>
>> Hi
>>
>> Is there any functions to find distinct count of all the variables in
>> dataframe.
>>
>> val sc = new SparkContext(conf) // spark context
>> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" -> "true")
>> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
>> val datasetDF = sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>>
>>
>> we are able to get the schema, variable data type. is there any method to get the distinct count ?
>>
>>
>>
>> --
>> Thanks and Regards
>> Arun
>>
>
>
--
Thanks and Regards
Arun
Re: finding distinct count using dataframe
Posted by Yanbo Liang <yb...@gmail.com>.
Hi Arunkumar,
You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
approxCountDistinct for a approximate result.
2016-01-05 17:11 GMT+08:00 Arunkumar Pillai <ar...@gmail.com>:
> Hi
>
> Is there any functions to find distinct count of all the variables in
> dataframe.
>
> val sc = new SparkContext(conf) // spark context
> val options = Map("header" -> "true", "delimiter" -> delimiter, "inferSchema" -> "true")
> val sqlContext = new org.apache.spark.sql.SQLContext(sc) // sql context
> val datasetDF = sqlContext.read.format("com.databricks.spark.csv").options(options).load(inputFile)
>
>
> we are able to get the schema, variable data type. is there any method to get the distinct count ?
>
>
>
> --
> Thanks and Regards
> Arun
>