You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by buntu <bu...@gmail.com> on 2014/07/15 20:14:01 UTC

Count distinct with groupBy usage

Hi --

New to Spark and trying to figure out how to do a generate unique counts per
page by date given this raw data:

timestamp,page,userId
1405377264,google,user1
1405378589,google,user2
1405380012,yahoo,user1
..

I can do a groupBy a field and get the count:

val lines=sc.textFile("data.csv")
val csv=lines.map(_.split(","))
// group by page
csv.groupBy(_(1)).count

But not able to see how to do count distinct on userId and also apply
another groupBy on timestamp field. Please let me know how to handle such
cases. 

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Count distinct with groupBy usage

Posted by buntu <bu...@gmail.com>.

Thanks Sean!! Thats what I was looking for -- group by on mulitple fields.

I'm gonna play with it now. Thanks again!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Count distinct with groupBy usage

Posted by Sean Owen <so...@cloudera.com>.

If you are counting per time and per page, then you need to group by
time and page not just page. Something more like:

csv.groupBy(csv => (csv(0),csv(1))) ...

This gives a list of users per (time,page). As Nick suggests, then you
count the distinct values for each key:

... .mapValues(_.distinct.count)

If you can tolerate some approximation, then using
countApproxDistinctByKey will be a lot faster.

csv.groupBy(csv => (csv(0),csv(1))).countApproxDistinctByKey()

On Tue, Jul 15, 2014 at 7:14 PM, buntu <bu...@gmail.com> wrote:
> Hi --
>
> New to Spark and trying to figure out how to do a generate unique counts per
> page by date given this raw data:
>
> timestamp,page,userId
> 1405377264,google,user1
> 1405378589,google,user2
> 1405380012,yahoo,user1
> ..
>
> I can do a groupBy a field and get the count:
>
> val lines=sc.textFile("data.csv")
> val csv=lines.map(_.split(","))
> // group by page
> csv.groupBy(_(1)).count
>
> But not able to see how to do count distinct on userId and also apply
> another groupBy on timestamp field. Please let me know how to handle such
> cases.
>
> Thanks!
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Count distinct with groupBy usage

Posted by buntu <bu...@gmail.com>.

We have CDH 5.0.2 which doesn't include Spark SQL yet and may only be
available in CDH 5.1 which is yet to be released.

If Spark SQL is the only option then I might need to hack around to add it
into the current CDH deployment if thats possible.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Count distinct with groupBy usage

Posted by Zongheng Yang <zo...@gmail.com>.

Sounds like a job for Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html !

On Tue, Jul 15, 2014 at 11:25 AM, Nick Pentreath
<ni...@gmail.com> wrote:
> You can use .distinct.count on your user RDD.
>
> What are you trying to achieve with the time group by?
> —
> Sent from Mailbox
>
>
> On Tue, Jul 15, 2014 at 8:14 PM, buntu <bu...@gmail.com> wrote:
>>
>> Hi --
>>
>> New to Spark and trying to figure out how to do a generate unique counts
>> per
>> page by date given this raw data:
>>
>> timestamp,page,userId
>> 1405377264,google,user1
>> 1405378589,google,user2
>> 1405380012,yahoo,user1
>> ..
>>
>> I can do a groupBy a field and get the count:
>>
>> val lines=sc.textFile("data.csv")
>> val csv=lines.map(_.split(","))
>> // group by page
>> csv.groupBy(_(1)).count
>>
>> But not able to see how to do count distinct on userId and also apply
>> another groupBy on timestamp field. Please let me know how to handle such
>> cases.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: Count distinct with groupBy usage

Posted by buntu <bu...@gmail.com>.

Thats is correct Raffy. Assume I convert the timestamp field to date and in
the required format, is it possible to report it by date?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9790.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Count distinct with groupBy usage

Posted by Raffael Marty <ra...@pixlcloud.com>.

> All I'm attempting is to report number of unique visitors per page by date.

But the way you are doing it currently, you will get a count per second. You have to bucketize your dates by whatever time resolution you want.

  -raffy

Re: Count distinct with groupBy usage

Posted by buntu <bu...@gmail.com>.

Thanks Nick.

All I'm attempting is to report number of unique visitors per page by date.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781p9786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Count distinct with groupBy usage

Posted by Nick Pentreath <ni...@gmail.com>.

You can use .distinct.count on your user RDD.


What are you trying to achieve with the time group by?
—
Sent from Mailbox

On Tue, Jul 15, 2014 at 8:14 PM, buntu <bu...@gmail.com> wrote:

> Hi --
> New to Spark and trying to figure out how to do a generate unique counts per
> page by date given this raw data:
> timestamp,page,userId
> 1405377264,google,user1
> 1405378589,google,user2
> 1405380012,yahoo,user1
> ..
> I can do a groupBy a field and get the count:
> val lines=sc.textFile("data.csv")
> val csv=lines.map(_.split(","))
> // group by page
> csv.groupBy(_(1)).count
> But not able to see how to do count distinct on userId and also apply
> another groupBy on timestamp field. Please let me know how to handle such
> cases. 
> Thanks!
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Count-distinct-with-groupBy-usage-tp9781.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.