You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Nick Pentreath <ni...@gmail.com> on 2013/01/08 15:00:50 UTC

HyperLogLog Approximate Distinct Counting as a Hive UDAF

Hi


I've recently committed an implementation of a Hive UDAF that uses
HyperLogLog for approximate distinct counting (
https://github.com/MLnick/hive-udf), based on Clearspring's stream-lib
library (https://github.com/clearspring/stream-lib).


Perhaps it may prove useful for others. The most interesting use case with
respect to Hive is the ability to aggregate data while keeping an accurate
sketch of distinct counts (say of user id's or some similar column) - thus
allowing further aggregation with accurate distinct counts on the fly,
without having to go back to the original source.


In the case of our data this would result in reduction of rows of data from
hundreds of millions (aggregating up to user id), down to tens of thousands.

If there is interest for inclusion in Hive, I could look at writing the
appropriate tests for inclusion in the Hive generic UDAF suite, and
submitting a ticket.

Thanks

Nick