You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2023/01/20 18:05:00 UTC

[jira] [Commented] (SPARK-16484) Incremental Cardinality estimation operations with Hyperloglog

    [ https://issues.apache.org/jira/browse/SPARK-16484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17679252#comment-17679252 ] 

Apache Spark commented on SPARK-16484:
--------------------------------------

User 'RyanBerti' has created a pull request for this issue:
https://github.com/apache/spark/pull/39678

> Incremental Cardinality estimation operations with Hyperloglog
> --------------------------------------------------------------
>
>                 Key: SPARK-16484
>                 URL: https://issues.apache.org/jira/browse/SPARK-16484
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Yongjia Wang
>            Priority: Major
>              Labels: bulk-closed
>
> Efficient cardinality estimation is very important, and SparkSQL has had approxCountDistinct based on Hyperloglog for quite some time. However, there isn't a way to do incremental estimation. For example, if we want to get updated distinct counts of the last 90 days, we need to do the aggregation for the entire window over and over again. The more efficient way involves serializing the counter for smaller time windows (such as hourly) so the counts can be efficiently updated in an incremental fashion for any time window.
> With the support of custom UDAF, Binary DataType and the HyperloglogPlusPlus implementation in the current Spark version, it's easy enough to extend the functionality to include incremental counting, and even other general set operations such as intersection and set difference. Spark API is already as elegant as it can be, but it still takes quite some effort to do a custom implementation of the aforementioned operations which are supposed to be in high demand. I have been searching but failed to find an usable existing solution nor any ongoing effort for this. The closest I got is the following but it does not work with Spark 1.6 due to API changes. 
> https://github.com/collectivemedia/spark-hyperloglog/blob/master/src/main/scala/org/apache/spark/sql/hyperloglog/aggregates.scala
> I wonder if it worth to integrate such operations into SparkSQL. The only problem I see is it depends on serialization of a specific HLL implementation and introduce compatibility issues. But as long as the user is aware of such issue, it should be fine.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org