You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@phoenix.apache.org by "Ethan Wang (JIRA)" <ji...@apache.org> on 2017/06/20 23:13:00 UTC

[jira] [Comment Edited] (PHOENIX-3390) Custom UDAF for HyperLogLogPlus

    [ https://issues.apache.org/jira/browse/PHOENIX-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16056597#comment-16056597 ] 

Ethan Wang edited comment on PHOENIX-3390 at 6/20/17 11:12 PM:
---------------------------------------------------------------

Had a discussion with [~talktoswapna] today. Thanks to her time we clarified some items regarding this feature. Now I'd like to add some more detail to the description of this ticket.

First, HyperLogLog is an algorithm that is only good at one particular task: Count of distinct items in a multi-set. It does not help on generic user-defined aggregation function, such as APPROX_SUM, APPROX_MAX etc. (The APPROX_SUM example Swapna provided above is actually the count of the union of two multi-sets, but it's still count)

Secondly, HyperLogLog is optimizing on "Space usage", so that it is able to do the count on a huge data set (before it is impossible do so, because you have to remember each distinct item in memory). So HLL is a "can't-do" to "can-do" improvement, not a "slow" to "faster" improvement: It still has to full scan through every item, there is no "sampling" going on in the process!

Besides HLL, there do exist other algorithms and their implementations that will achieve other types of user defined approximate aggregation. Those works such as SuperLogLog, KLL(sketching algorithm by Zohar et al), the work by Manku, Rajagopalan et al (1), and the work by Munro and Paterson et al (2).  Among which, Yahoo's library DataSketches(https://datasketches.github.io. thanks for recommending @Rahul G) provides the following Approximate aggregations that may be particularly interesting for Phoenix. Besides HLL based APPROX_COUNT, there are :
    APPROX_MEDIAN
    APPROX_PERCENTILE (like95 Percentile)
    APPROX_MIN/MAX
    APPROX_KMV (the famous Kth Minimum Value)


=================================================================================
Now, everything above are space-optimized approaches. We could, have a different set of user defined functions for approximate aggregation that is time-optimized based.  i.e., run faster. 
Such as a use case below: "I'm able to do select count( * ) from A, but I don't want to, because full scan of A take too long. Can I just count on 10% sample and get a approximate that way?" 
Along this line, the similar approaches would be BlinkDB, PHOENIX-153 tablesample.

Phoenix community please advice the direction we are going. 

[~jamestaylor][~gjacoby]





(1). Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling tech-niques for space efficient online computation of order statistics of large datasets.
(2). J.I. Munro and M.S. Paterson. Selection and sorting with limited storage.


was (Author: aertoria):
Had a discussion with [~swapna] today. Thanks to her time we clarified some items regarding this feature. Now I'd like to add some more detail to the description of this ticket.

First, HyperLogLog is an algorithm that is only good at one particular task: Count of distinct items in a multi-set. It does not help on generic user-defined aggregation function, such as APPROX_SUM, APPROX_MAX etc. (The APPROX_SUM example Swapna provided above is actually the count of the union of two multi-sets, but it's still count)

Secondly, HyperLogLog is optimizing on "Space usage", so that it is able to do the count on a huge data set (before it is impossible do so, because you have to remember each distinct item in memory). So HLL is a "can't-do" to "can-do" improvement, not a "slow" to "faster" improvement: It still has to full scan through every item, there is no "sampling" going on in the process!

Besides HLL, there do exist other algorithms and their implementations that will achieve other types of user defined approximate aggregation. Those works such as SuperLogLog, KLL(sketching algorithm by Zohar et al), the work by Manku, Rajagopalan et al (1), and the work by Munro and Paterson et al (2).  Among which, Yahoo's library DataSketches(https://datasketches.github.io. thanks for recommending @Rahul G) provides the following Approximate aggregations that may be particularly interesting for Phoenix. Besides HLL based APPROX_COUNT, there are :
    APPROX_MEDIAN
    APPROX_PERCENTILE (like95 Percentile)
    APPROX_MIN/MAX
    APPROX_KMV (the famous Kth Minimum Value)


=================================================================================
Now, everything above are space-optimized approaches. We could, have a different set of user defined functions for approximate aggregation that is time-optimized based.  i.e., run faster. 
Such as a use case below: "I'm able to do select count( * ) from A, but I don't want to, because full scan of A take too long. Can I just count on 10% sample and get a approximate that way?" 
Along this line, the similar approaches would be BlinkDB, PHOENIX-153 tablesample.

Phoenix community please advice the direction we are going. 

[~jamestaylor][~gjacoby]





(1). Gurmeet Singh Manku, Sridhar Rajagopalan, and Bruce G. Lindsay. Random sampling tech-niques for space efficient online computation of order statistics of large datasets.
(2). J.I. Munro and M.S. Paterson. Selection and sorting with limited storage.

> Custom UDAF for HyperLogLogPlus
> -------------------------------
>
>                 Key: PHOENIX-3390
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-3390
>             Project: Phoenix
>          Issue Type: New Feature
>            Reporter: Swapna Kasula
>            Priority: Minor
>
> With ref # PHOENIX-2069
> Custome UDAF to aggregate/union of Hyperloglog's of a column and returns a Hyperloglog.
> select hllUnion(col1) from table;  //returns a Hyperloglog, which is the union of all hyperloglog's from all rows for column 'col1'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)