You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Adam Tamas (Jira)" <ji...@apache.org> on 2020/07/31 11:05:00 UTC

[jira] [Closed] (IMPALA-9942) DataSketches HLL shouldn't take empty strings as distinct values

     [ https://issues.apache.org/jira/browse/IMPALA-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Adam Tamas closed IMPALA-9942.
------------------------------
    Resolution: Fixed

> DataSketches HLL shouldn't take empty strings as distinct values
> ----------------------------------------------------------------
>
>                 Key: IMPALA-9942
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9942
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.0
>            Reporter: Gabor Kaszab
>            Assignee: Adam Tamas
>            Priority: Major
>              Labels: newbie, ramp-up
>
> Let's consider a table that has string, char and varchar columns and some of the values in these columns are empty strings.
> {code:java}
> select * from strings;
> +-----+------------+-----+
> | s   | c          | v   |
> +-----+------------+-----+
> |     |            |     |
> | abc | abc        | abc |
> |     |            |     |
> +-----+------------+-----+
> {code}
> If I query the # of distinct values by DataSketches HLL then the empty string add +1 to the end result.
> {code:java}
> select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from strings;
> +------------+----------+-------------+
> | hll_string | hll_char | hll_varchar |
> +------------+----------+-------------+
> | 2          | 2        | 2           |
> +------------+----------+-------------+
> {code}
> However, Hive's implementation omits empty strings so for this particular example above Hive would return 1 for each column.
> I assume omits empty strings because of this line:
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
> First step of this task would be to decide which approach is the correct one, and as a second step do the adjustment in Impala if we decide that way.
> Btw, in Impala this functions updates string to the HLL sketches:
> https://github.com/apache/impala/commit/7e456dfa9d932bcdb317ad6477abc3c399abacf2#diff-cb22c62db38ee853b857c3b2302244dfR1661



--
This message was sent by Atlassian Jira
(v8.3.4#803005)