You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Avshalom <av...@gmail.com> on 2016/06/16 08:57:42 UTC

Spark SQL Count Distinct

Hi all,

We would like to perform a count distinct query based on a certain filter.
e.g. our data is of the form:

userId, Name, Restaurant name, Restaurant Type
===============================
100,    John,     Pizza Hut,            Pizza
100,    John,     Del Pepe,             Pasta
100,    John,     Hagen Daz,          Ice Cream
100,    John,     Dominos,             Pasta
200,    Mandy,  Del Pepe,             Pasta

And we would like to know the number of distinct Pizza eaters.

The issue is, we have roughly ~200 million entries, so even with a large
cluster, we could still be in a risk of memory overload if the distinct
implementation has to load all of the data into RAM.
The Spark Core implementation which uses reduce to 1 and sum doesn't have
this risk.

I've found this old thread which compares Spark Core and Spark SQL count
distinct performance:

http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
<http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html>  

From reading the source code, seems like the current Spark SQL count
distinct implementation is not based on a Hash Set anymore, but we're still
concerned that it won't be as safe as the Spark Core implementation.
We don't mind waiting a long time for the computation to end, but we don't
want to reach out of memory errors.

Would highly appreciate any input.

Thanks
Avshalom



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Spark SQL Count Distinct

Posted by Reynold Xin <rx...@databricks.com>.
You should be fine in 1.6 onward. Count distinct doesn't require data to
fit in memory there.


On Thu, Jun 16, 2016 at 1:57 AM, Avshalom <av...@gmail.com> wrote:

> Hi all,
>
> We would like to perform a count distinct query based on a certain filter.
> e.g. our data is of the form:
>
> userId, Name, Restaurant name, Restaurant Type
> ===============================
> 100,    John,     Pizza Hut,            Pizza
> 100,    John,     Del Pepe,             Pasta
> 100,    John,     Hagen Daz,          Ice Cream
> 100,    John,     Dominos,             Pasta
> 200,    Mandy,  Del Pepe,             Pasta
>
> And we would like to know the number of distinct Pizza eaters.
>
> The issue is, we have roughly ~200 million entries, so even with a large
> cluster, we could still be in a risk of memory overload if the distinct
> implementation has to load all of the data into RAM.
> The Spark Core implementation which uses reduce to 1 and sum doesn't have
> this risk.
>
> I've found this old thread which compares Spark Core and Spark SQL count
> distinct performance:
>
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/DataFrame-distinct-vs-RDD-distinct-td12098.html
> >
>
> From reading the source code, seems like the current Spark SQL count
> distinct implementation is not based on a Hash Set anymore, but we're still
> concerned that it won't be as safe as the Spark Core implementation.
> We don't mind waiting a long time for the computation to end, but we don't
> want to reach out of memory errors.
>
> Would highly appreciate any input.
>
> Thanks
> Avshalom
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Count-Distinct-tp17935.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>