You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Berti <rb...@netflix.com.INVALID> on 2023/01/11 23:23:58 UTC
Implementation for approx_count_distinct_sketch and associated functions
Hello!
I've recently wanted to write the sketches associated with the
approx_count_distinct function to allow for distinct count re-aggregation. This
2019 databricks post
<https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html>
proposes
the use of spark-alchemy, and I've also seen other discussions which
propose building custom UDAFs/UDFs to achieve the desired effect. Trino
supports re-aggregating HLL sketches
<https://trino.io/docs/current/functions/hyperloglog.html> natively, and I
figured Spark should also provide this functionality natively.
After searching the Spark JIRA and this dev list, I found a few requests
for this functionality:
- Here's a ticket that was closed (due to inactivity?) in 2019
<https://issues.apache.org/jira/browse/SPARK-16484>, where there seemed
to be agreement that adding the requested implementation would be simple
- Here's a discussion in this dev list from 2015
<https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>,
which focuses on implementing the functionality via legacy(?) APIs, and
interoperability between HLL implementations.
I've implemented two new agg functions and a new misc function
<https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches, and
I'd like to open my implementation up for review. Can someone help me
re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>,
and then I'll move forward with opening a PR against the main spark repo?
Thanks!
Ryan Berti
Senior Data Engineer | Ads DE
M 7023217573
5808 W Sunset Blvd | Los Angeles, CA 90028
Re: Implementation for approx_count_distinct_sketch and associated functions
Posted by Ryan Berti <rb...@netflix.com.INVALID>.
Hello,
Wanted to follow up and link out the Spark PR associated with these changes
<https://github.com/apache/spark/pull/39678>; I'm excited to open up the
implementation for community review!
For reference, I worked with @Daniel Tenedorio
<da...@databricks.com> and the Databricks team on a pre-review
<https://github.com/RyanBerti/spark/pull/1>, which yielded some interesting
discussions about the existing HyperLogLogPlusPlus implementation. I think
we're in agreement that having a cross-compatible HLL++ implementation
would be super valuable, though I didn't attempt to take that work on in
this PR. I've included a format identifier in this implementation's HLL++
sketches to set us up for migrating to a cross-compatible sketch format /
HLL++ implementation in the future.
Thanks
Ryan Berti
Senior Data Engineer | Ads DE
M 7023217573
5808 W Sunset Blvd | Los Angeles, CA 90028
On Wed, Jan 11, 2023 at 3:23 PM Ryan Berti <rb...@netflix.com> wrote:
> Hello!
>
> I've recently wanted to write the sketches associated with the
> approx_count_distinct function to allow for distinct count re-aggregation. This
> 2019 databricks post
> <https://www.databricks.com/blog/2019/05/08/advanced-analytics-with-apache-spark.html> proposes
> the use of spark-alchemy, and I've also seen other discussions which
> propose building custom UDAFs/UDFs to achieve the desired effect. Trino
> supports re-aggregating HLL sketches
> <https://trino.io/docs/current/functions/hyperloglog.html> natively, and
> I figured Spark should also provide this functionality natively.
>
> After searching the Spark JIRA and this dev list, I found a few requests
> for this functionality:
>
> - Here's a ticket that was closed (due to inactivity?) in 2019
> <https://issues.apache.org/jira/browse/SPARK-16484>, where there
> seemed to be agreement that adding the requested implementation would be
> simple
> - Here's a discussion in this dev list from 2015
> <https://lists.apache.org/thread/pqjopxh897wx9b39y2tg1g4bot2d86df>,
> which focuses on implementing the functionality via legacy(?) APIs, and
> interoperability between HLL implementations.
>
> I've implemented two new agg functions and a new misc function
> <https://github.com/RyanBerti/spark/pull/1> that handle HLL++ sketches,
> and I'd like to open my implementation up for review. Can someone help me
> re-open SPARK-16484 <https://issues.apache.org/jira/browse/SPARK-16484>,
> and then I'll move forward with opening a PR against the main spark repo?
>
> Thanks!
>
> Ryan Berti
>
> Senior Data Engineer | Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd | Los Angeles, CA 90028
>
>