You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Ryan Berti <rb...@netflix.com.INVALID> on 2023/01/11 23:23:58 UTC

Implementation for approx_count_distinct_sketch and associated functions

Hello!

I've recently wanted to write the sketches associated with the
approx_count_distinct function to allow for distinct count re-aggregation. This
2019 databricks post
<https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html>
proposes
the use of spark-alchemy, and I've also seen other discussions which
propose building custom UDAFs/UDFs to achieve the desired effect. Trino
supports re-aggregating HLL sketches
<https://trino.io/docs/current/functions/hyperloglog.html> natively, and I
figured Spark should also provide this functionality natively.

After searching the Spark JIRA and this dev list, I found a few requests
for this functionality:

   - Here's a ticket that was closed (due to inactivity?) in 2019
   <https://issues.apache.org/jira/browse/SPARK-16484>, where there seemed
   to be agreement that adding the requested implementation would be simple
   - Here's a discussion in this dev list from 2015
   <https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>,
   which focuses on implementing the functionality via legacy(?) APIs, and
   interoperability between HLL implementations.

I've implemented two new agg functions and a new misc function
<https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches, and
I'd like to open my implementation up for review. Can someone help me
re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>,
and then I'll move forward with opening a PR against the main spark repo?

Thanks!

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028

Re: Implementation for approx_count_distinct_sketch and associated functions

Posted by Ryan Berti <rb...@netflix.com.INVALID>.

Hello,

Wanted to follow up and link out the Spark PR associated with these changes
<https://github.com/apache/spark/pull/39678>; I'm excited to open up the
implementation for community review!

For reference, I worked with @Daniel Tenedorio
<da...@databricks.com> and the Databricks team on a pre-review
<https://github.com/RyanBerti/spark/pull/1>, which yielded some interesting
discussions about the existing HyperLogLogPlusPlus implementation. I think
we're in agreement that having a cross-compatible HLL++ implementation
would be super valuable, though I didn't attempt to take that work on in
this PR. I've included a format identifier in this implementation's HLL++
sketches to set us up for migrating to a cross-compatible sketch format /
HLL++ implementation in the future.

Thanks

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028

On Wed, Jan 11, 2023 at 3:23 PM Ryan Berti <rb...@netflix.com> wrote:

> Hello!
>
> I've recently wanted to write the sketches associated with the
> approx_count_distinct function to allow for distinct count re-aggregation. This
> 2019 databricks post
> <https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html> proposes
> the use of spark-alchemy, and I've also seen other discussions which
> propose building custom UDAFs/UDFs to achieve the desired effect. Trino
> supports re-aggregating HLL sketches
> <https://trino.io/docs/current/functions/hyperloglog.html> natively, and
> I figured Spark should also provide this functionality natively.
>
> After searching the Spark JIRA and this dev list, I found a few requests
> for this functionality:
>
>    - Here's a ticket that was closed (due to inactivity?) in 2019
>    <https://issues.apache.org/jira/browse/SPARK-16484>, where there
>    seemed to be agreement that adding the requested implementation would be
>    simple
>    - Here's a discussion in this dev list from 2015
>    <https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>,
>    which focuses on implementing the functionality via legacy(?) APIs, and
>    interoperability between HLL implementations.
>
> I've implemented two new agg functions and a new misc function
> <https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches,
> and I'd like to open my implementation up for review. Can someone help me
> re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>,
> and then I'll move forward with opening a PR against the main spark repo?
>
> Thanks!
>
> Ryan Berti
>
> Senior Data Engineer  |  Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>
>