You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@datasketches.apache.org by "Lee Rhodes (Jira)" <ji...@apache.org> on 2020/07/20 17:49:00 UTC

[jira] [Assigned] (DATASKETCHES-8) HLL doesn't take empty strings as distinct values

     [ https://issues.apache.org/jira/browse/DATASKETCHES-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lee Rhodes reassigned DATASKETCHES-8:
-------------------------------------

    Assignee: Lee Rhodes

> HLL doesn't take empty strings as distinct values
> -------------------------------------------------
>
>                 Key: DATASKETCHES-8
>                 URL: https://issues.apache.org/jira/browse/DATASKETCHES-8
>             Project: Apache Datasketches
>          Issue Type: Bug
>            Reporter: Adam Tamas
>            Assignee: Lee Rhodes
>            Priority: Major
>
> Using ds_hll Hive is not counting empty strings as distinct values for string and varchar columns.
> Example:
> With a t table with the following (string, char(1), varchar(1)) values:
> {code:java}
> +------+------+------+
> | t.s  | t.c  | t.v  |
> +------+------+------+
> |      |      |      |
> | a    | a    | a    |
> |      |      |      |
> | a    | a    | a    |
> | s    | s    | s    |
> | d    | d    | d    |
> +------+------+------+
> {code}
> select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from t;
> {code:java}
> +--------------------+--------------------+--------------------+
> |        _c0         |        _c1         |        _c2         |
> +--------------------+--------------------+--------------------+
> | 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
> +--------------------+--------------------+--------------------+
> {code}
> Could be a problem here: https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
> Char is working because it is filled with spaces up to the limit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@datasketches.apache.org
For additional commands, e-mail: dev-help@datasketches.apache.org