You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by "Natarajan, Prabakaran 1. (NSN - IN/Bangalore)" <pr...@nsn.com> on 2014/08/06 10:52:37 UTC

High performance Count Distinct - NO Error

Hi

I am looking for high performance count distinct solution on Hive Query.

Regular count distinct is very slow but if I use probabilistic count distinct has more error percentage (if the number of records are small).


Is there is any solution to have exact count distinct but using low memory and without error?

Thanks and Regards
Prabakaran.N

Re: High performance Count Distinct - NO Error

Posted by Edward Capriolo <ed...@gmail.com>.

A simple and parallel way to do this is by breaking the data into ranges or
hashes then do distinct counting on those. Hive should do something like
this automatically.

This is a rather naive way.

SELECT column from source_table_0 where row_key mod 10 = 0;
SELECT column from source_table_1 where row_key mod 10 = 1;

create table all as
select count(dstinct) from source_table_0
union all
select count(distinct) from source_table_1

select count(*) from all;




On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev <se...@gmail.com>
wrote:

> Why do you think that default implementation of COUNT DISTINCT is slow? As
> far as I understand the most famous way to find number of distinct elements
> is to sort them and scan all sorted items consequently excluding duplicated
> elements. Assimptotics of this algoritm is O(n *log n ), I think that there
> is no way to do this faster in general case. I think that Hive should use
> map-reduce sort stage to make items sorted, but probably in your case we
> have only one reduce task because we need to aggregate result on single
> instance.
> 06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore)" <pr...@nsn.com> написал:
> >
> > Hi
> >
> > I am looking for high performance count distinct solution on Hive Query.
> >
> > Regular count distinct is very slow but if I use probabilistic count
> distinct has more error percentage (if the number of records are small).
> >
> >
> > Is there is any solution to have exact count distinct but using low
> memory and without error?
> >
> > Thanks and Regards
> > Prabakaran.N
> >
> >
> >
>

Re: High performance Count Distinct - NO Error

Posted by Edward Capriolo <ed...@gmail.com>.

A simple and parallel way to do this is by breaking the data into ranges or
hashes then do distinct counting on those. Hive should do something like
this automatically.

This is a rather naive way.

SELECT column from source_table_0 where row_key mod 10 = 0;
SELECT column from source_table_1 where row_key mod 10 = 1;

create table all as
select count(dstinct) from source_table_0
union all
select count(distinct) from source_table_1

select count(*) from all;




On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev <se...@gmail.com>
wrote:

> Why do you think that default implementation of COUNT DISTINCT is slow? As
> far as I understand the most famous way to find number of distinct elements
> is to sort them and scan all sorted items consequently excluding duplicated
> elements. Assimptotics of this algoritm is O(n *log n ), I think that there
> is no way to do this faster in general case. I think that Hive should use
> map-reduce sort stage to make items sorted, but probably in your case we
> have only one reduce task because we need to aggregate result on single
> instance.
> 06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore)" <pr...@nsn.com> написал:
> >
> > Hi
> >
> > I am looking for high performance count distinct solution on Hive Query.
> >
> > Regular count distinct is very slow but if I use probabilistic count
> distinct has more error percentage (if the number of records are small).
> >
> >
> > Is there is any solution to have exact count distinct but using low
> memory and without error?
> >
> > Thanks and Regards
> > Prabakaran.N
> >
> >
> >
>

Re: High performance Count Distinct - NO Error

Posted by Edward Capriolo <ed...@gmail.com>.

A simple and parallel way to do this is by breaking the data into ranges or
hashes then do distinct counting on those. Hive should do something like
this automatically.

This is a rather naive way.

SELECT column from source_table_0 where row_key mod 10 = 0;
SELECT column from source_table_1 where row_key mod 10 = 1;

create table all as
select count(dstinct) from source_table_0
union all
select count(distinct) from source_table_1

select count(*) from all;




On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev <se...@gmail.com>
wrote:

> Why do you think that default implementation of COUNT DISTINCT is slow? As
> far as I understand the most famous way to find number of distinct elements
> is to sort them and scan all sorted items consequently excluding duplicated
> elements. Assimptotics of this algoritm is O(n *log n ), I think that there
> is no way to do this faster in general case. I think that Hive should use
> map-reduce sort stage to make items sorted, but probably in your case we
> have only one reduce task because we need to aggregate result on single
> instance.
> 06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore)" <pr...@nsn.com> написал:
> >
> > Hi
> >
> > I am looking for high performance count distinct solution on Hive Query.
> >
> > Regular count distinct is very slow but if I use probabilistic count
> distinct has more error percentage (if the number of records are small).
> >
> >
> > Is there is any solution to have exact count distinct but using low
> memory and without error?
> >
> > Thanks and Regards
> > Prabakaran.N
> >
> >
> >
>

Re: High performance Count Distinct - NO Error

Posted by Edward Capriolo <ed...@gmail.com>.

A simple and parallel way to do this is by breaking the data into ranges or
hashes then do distinct counting on those. Hive should do something like
this automatically.

This is a rather naive way.

SELECT column from source_table_0 where row_key mod 10 = 0;
SELECT column from source_table_1 where row_key mod 10 = 1;

create table all as
select count(dstinct) from source_table_0
union all
select count(distinct) from source_table_1

select count(*) from all;




On Wed, Aug 6, 2014 at 10:23 AM, Sergey Murylev <se...@gmail.com>
wrote:

> Why do you think that default implementation of COUNT DISTINCT is slow? As
> far as I understand the most famous way to find number of distinct elements
> is to sort them and scan all sorted items consequently excluding duplicated
> elements. Assimptotics of this algoritm is O(n *log n ), I think that there
> is no way to do this faster in general case. I think that Hive should use
> map-reduce sort stage to make items sorted, but probably in your case we
> have only one reduce task because we need to aggregate result on single
> instance.
> 06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
> IN/Bangalore)" <pr...@nsn.com> написал:
> >
> > Hi
> >
> > I am looking for high performance count distinct solution on Hive Query.
> >
> > Regular count distinct is very slow but if I use probabilistic count
> distinct has more error percentage (if the number of records are small).
> >
> >
> > Is there is any solution to have exact count distinct but using low
> memory and without error?
> >
> > Thanks and Regards
> > Prabakaran.N
> >
> >
> >
>

Re: High performance Count Distinct - NO Error

Posted by Sergey Murylev <se...@gmail.com>.

Why do you think that default implementation of COUNT DISTINCT is slow? As
far as I understand the most famous way to find number of distinct elements
is to sort them and scan all sorted items consequently excluding duplicated
elements. Assimptotics of this algoritm is O(n *log n ), I think that there
is no way to do this faster in general case. I think that Hive should use
map-reduce sort stage to make items sorted, but probably in your case we
have only one reduce task because we need to aggregate result on single
instance.
06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
IN/Bangalore)" <pr...@nsn.com> написал:
>
> Hi
>
> I am looking for high performance count distinct solution on Hive Query.
>
> Regular count distinct is very slow but if I use probabilistic count
distinct has more error percentage (if the number of records are small).
>
>
> Is there is any solution to have exact count distinct but using low
memory and without error?
>
> Thanks and Regards
> Prabakaran.N
>
>
>

Re: High performance Count Distinct - NO Error

Posted by Sergey Murylev <se...@gmail.com>.

Why do you think that default implementation of COUNT DISTINCT is slow? As
far as I understand the most famous way to find number of distinct elements
is to sort them and scan all sorted items consequently excluding duplicated
elements. Assimptotics of this algoritm is O(n *log n ), I think that there
is no way to do this faster in general case. I think that Hive should use
map-reduce sort stage to make items sorted, but probably in your case we
have only one reduce task because we need to aggregate result on single
instance.
06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
IN/Bangalore)" <pr...@nsn.com> написал:
>
> Hi
>
> I am looking for high performance count distinct solution on Hive Query.
>
> Regular count distinct is very slow but if I use probabilistic count
distinct has more error percentage (if the number of records are small).
>
>
> Is there is any solution to have exact count distinct but using low
memory and without error?
>
> Thanks and Regards
> Prabakaran.N
>
>
>

Re: High performance Count Distinct - NO Error

Posted by Sergey Murylev <se...@gmail.com>.

Why do you think that default implementation of COUNT DISTINCT is slow? As
far as I understand the most famous way to find number of distinct elements
is to sort them and scan all sorted items consequently excluding duplicated
elements. Assimptotics of this algoritm is O(n *log n ), I think that there
is no way to do this faster in general case. I think that Hive should use
map-reduce sort stage to make items sorted, but probably in your case we
have only one reduce task because we need to aggregate result on single
instance.
06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
IN/Bangalore)" <pr...@nsn.com> написал:
>
> Hi
>
> I am looking for high performance count distinct solution on Hive Query.
>
> Regular count distinct is very slow but if I use probabilistic count
distinct has more error percentage (if the number of records are small).
>
>
> Is there is any solution to have exact count distinct but using low
memory and without error?
>
> Thanks and Regards
> Prabakaran.N
>
>
>

Re: High performance Count Distinct - NO Error

Posted by Prasanth Jayachandran <pj...@hortonworks.com>.

Keep an eye on https://issues.apache.org/jira/browse/HIVE-7402 for probabilistic count distinct UDAF.  Prototype implementation is here https://github.com/t3rmin4t0r/hive-hll-udf/
This uses HyperLogLog++ probabilistic algorithm for count distinct estimation. It has almost no error for small number of records (in the order of hundreds or few thousands) since it uses linear counting for
low cardinality. And has a much lesser error for large number of records (mostly <3% for order of millions or billions of records).

If you would like to know the exact error for your specific dataset then you might need to use https://github.com/prasanthj/hyperloglog (the prototype implementation mentioned above uses this but I don’t think it will print the error) cli tool. 
And yes, it uses very low memory when compared to full and accurate count distinct.

Thanks
Prasanth Jayachandran

On Aug 6, 2014, at 1:52 AM, Natarajan, Prabakaran 1. (NSN - IN/Bangalore) <pr...@nsn.com> wrote:

> Hi
>  
> I am looking for high performance count distinct solution on Hive Query.
>  
> Regular count distinct is very slow but if I use probabilistic count distinct has more error percentage (if the number of records are small).
>  
>  
> Is there is any solution to have exact count distinct but using low memory and without error?
>  
> Thanks and Regards
> Prabakaran.N 

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: High performance Count Distinct - NO Error

Posted by Sergey Murylev <se...@gmail.com>.

Why do you think that default implementation of COUNT DISTINCT is slow? As
far as I understand the most famous way to find number of distinct elements
is to sort them and scan all sorted items consequently excluding duplicated
elements. Assimptotics of this algoritm is O(n *log n ), I think that there
is no way to do this faster in general case. I think that Hive should use
map-reduce sort stage to make items sorted, but probably in your case we
have only one reduce task because we need to aggregate result on single
instance.
06 авг. 2014 г. 12:54 пользователь "Natarajan, Prabakaran 1. (NSN -
IN/Bangalore)" <pr...@nsn.com> написал:
>
> Hi
>
> I am looking for high performance count distinct solution on Hive Query.
>
> Regular count distinct is very slow but if I use probabilistic count
distinct has more error percentage (if the number of records are small).
>
>
> Is there is any solution to have exact count distinct but using low
memory and without error?
>
> Thanks and Regards
> Prabakaran.N
>
>
>