You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/07/28 18:32:00 UTC

[jira] [Commented] (IMPALA-9942) DataSketches HLL shouldn't take empty strings as distinct values

    [ https://issues.apache.org/jira/browse/IMPALA-9942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166620#comment-17166620 ] 

ASF subversion and git services commented on IMPALA-9942:
---------------------------------------------------------

Commit 21918ef18b166021577770cb55b70bb2ccad0213 in impala's branch refs/heads/master from Adam Tamas
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=21918ef ]

IMPALA-9942: DataSketches HLL shouldn't take empty strings as distinct values

In Hive empty strings doesn't count as separate values when querying
count(distinct) estimates using Apache DataSketches HLL algorithm
on strings and varchars.
For compatibility's sake Impala should not take it either.

Tests:
-added extra tests for hll with empty strings

Change-Id: Ie7648217bbe2f66b817788f131c062f349b1e9ad
Reviewed-on: http://gerrit.cloudera.org:8080/16226
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> DataSketches HLL shouldn't take empty strings as distinct values
> ----------------------------------------------------------------
>
>                 Key: IMPALA-9942
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9942
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 4.0
>            Reporter: Gabor Kaszab
>            Assignee: Adam Tamas
>            Priority: Major
>              Labels: newbie, ramp-up
>
> Let's consider a table that has string, char and varchar columns and some of the values in these columns are empty strings.
> {code:java}
> select * from strings;
> +-----+------------+-----+
> | s   | c          | v   |
> +-----+------------+-----+
> |     |            |     |
> | abc | abc        | abc |
> |     |            |     |
> +-----+------------+-----+
> {code}
> If I query the # of distinct values by DataSketches HLL then the empty string add +1 to the end result.
> {code:java}
> select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), ds_hll_estimate(ds_hll_sketch(v)) from strings;
> +------------+----------+-------------+
> | hll_string | hll_char | hll_varchar |
> +------------+----------+-------------+
> | 2          | 2        | 2           |
> +------------+----------+-------------+
> {code}
> However, Hive's implementation omits empty strings so for this particular example above Hive would return 1 for each column.
> I assume omits empty strings because of this line:
> https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351
> First step of this task would be to decide which approach is the correct one, and as a second step do the adjustment in Impala if we decide that way.
> Btw, in Impala this functions updates string to the HLL sketches:
> https://github.com/apache/impala/commit/7e456dfa9d932bcdb317ad6477abc3c399abacf2#diff-cb22c62db38ee853b857c3b2302244dfR1661



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org