You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@couchdb.apache.org by "Nick Vatamaniuc (JIRA)" <ji...@apache.org> on 2016/03/21 20:28:25 UTC

[jira] [Commented] (COUCHDB-2971) Provide cardinality estimate (COUNT DISTINCT) as builtin reducer

    [ https://issues.apache.org/jira/browse/COUCHDB-2971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204938#comment-15204938 ] 

Nick Vatamaniuc commented on COUCHDB-2971:
------------------------------------------

That would be neat to have. So if HLLs can be unioned then they satisfy the associativity and commutativity bit. 

So the algorithm could be if rereduce is false, then create an HLL object as an accumulator and add all keys to it, return that.  If it is true, then merge (union) together all HLL registers to produce a single HLL register (the hyper implementation above has that function).  It seems they'd need to be a final transformation of before returning the final result, extract estimated cardinality from the HLL object. Multiple precisions could be hacked together as a selection of <<"_distinct_hll_4">> or <<"_distinct_hll_6"">>, etc variants. 

> Provide cardinality estimate (COUNT DISTINCT) as builtin reducer
> ----------------------------------------------------------------
>
>                 Key: COUCHDB-2971
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2971
>             Project: CouchDB
>          Issue Type: Improvement
>            Reporter: Adam Kocoloski
>
> We’ve seen a number of applications now where a user needs to count the number of unique keys in a view. Currently the recommended approach is to add a trivial reduce function and then count the number of rows in a _list function or client-side application code, but of course that doesn’t scale nicely.
> It seems that in a majority of these cases all that’s required is an approximation of the number of distinct entries, which brings us into the space of hash sets, linear probabilistic counters, and the ever-popular “HyperLogLog” algorithm. Taking HLL specifically, this seems like quite a nice candidate for a builtin reduce. The size of the data structure is independent of the number of input elements and individual HLL filters can be unioned together. There’s already what seems to be a good MIT-licensed implementation on GitHub:
> https://github.com/GameAnalytics/hyper
> One caveat is that this reducer would not work for group_level reductions; it’d only give the correct result for the exact key. I don’t think that should preclude us from evaluating it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)