You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Gabor Kaszab (Code Review)" <ge...@cloudera.org> on 2020/05/29 14:32:11 UTC

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16000


Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 494 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578
PS5, Line 1578: void AggregateFunctions::DsHllUpdate(
> I have experimented a bit with upstream Hive and it turned out that (apart 
I have found a way to disable the unsupported parameter types.
Done


http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc
File be/src/exprs/datasketches-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc@37
PS6, Line 37: ctx->SetError("Unable to deserialize sketch
> I would vote on treating this always as error. We can relax this later if i
Sure, changed this to give error unconditionally.


http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@64
PS5, Line 64: ---
> Hive seems to return NULL if there are no non-NULL values. I think that the
Apparently it wasn't complicated to take care.
Done.


http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@121
PS6, Line 121: 
> typo
Dropped this part of the test due to another change.



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Jun 2020 15:47:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6498/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 13:20:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6430/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jun 2020 12:55:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 3:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG@30
PS3, Line 30: In terms of correctness DataSketches HLL is most of the time in 2%
            : range from the correct result but there are occasional spikes where
            : the difference is bigger but never goes out of the range of 5%.
A link to https://datasketches.apache.org/docs/HLL/HLL.html could be useful if someone is curious about the details.

I could be also mentioned that while the HLL implementation can be parametrized, we are using it in a fixed way (HLL_4 with k=11).


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1593
PS3, Line 1593:     // DataSketches doesn't accept 128bit integers so here we can sacrifice precision for
We don't need to use the convenient typed update function, we can simply hash this as a 16 byte string.
The underlying Murmur hash implementation treats all types as byte array + length as far as I know.
 
Do you know how Hive handles this? Even if it doesn't support it we should still mention in a comment that we are emulating Hive here.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1595
PS3, Line 1595:     // Raise this warning only once per query.
I think that this can be raised a few time in a query if more than one instances are working on the table.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1625
PS3, Line 1625:   double val;
              :   const Timezone* tz = ctx->impl()->state()->time_zone_for_unix_time_conversions();
              :   if (tm_src.ToSubsecondUnixTime(tz, &val)) {
Does Hive also convert to double (as seconds from Unix epoch)? Unless we are emulating Hive, I think that this is a not the best way to do it, as this conversion loses precision for timestamps far from 1970-01-01 (the same issue exists for AVG too).


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1650
PS3, Line 1650:   std::string serialized_src_sketch_str(reinterpret_cast<char*>(src.ptr), src.len);
              :   std::stringstream serialized_src_sketch(serialized_src_sketch_str);
optional: I was looking for ways to avoid the 2 copies that occur here and at similar places. There is no istream implementation that uses a simple byte array + length, but it is actually very simple to create one: https://tuttlem.github.io/2014/08/18/getting-istream-to-work-off-a-byte-array.html Something similar should be possible for serialization too.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1684
PS3, Line 1684:   ctx->Free(src.ptr);
It would be nice to add a comment somewhere that describes the way hll_sketch works - e.g. unlike NDV, the StringVal does not contain the HLL array directly, but a wrapper classes that uses heap for the objects it uses, including the HLL array itself.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@18
PS3, Line 18:     ds_hll_estimate(ds_hll_sketch(timestamp_col)) 
whitespace


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@21
PS3, Line 21: 100
Are you sure that this won't be flaky? The value is the same as with count distinct, but NDV is not accurate here (101). Adding a filter that limits the query to a single partition (and thus single file) could ensure that the value is the same regardless of the order of merges.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@90
PS3, Line 90: UDF ERROR: 128bit Decimal precision not supported.
Do you know if this is also the case in Hive?


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@115
PS3, Line 115: # Write sketches to a table as string and get an estimate from the written sketch.
Can you mention that this will change in the future when BINARY will be implemented (IMPALA-9482), as Hive writes BINARY instead of STRING?


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142
PS3, Line 142: # Check that ds_hll_estimate returns null for strings that are not serialized sketches.
Do you know how Hive handles this? I would consider returning an error instead if don't support ds_hll for a type.



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 03 Jun 2020 13:16:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 16:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6099/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 08:57:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 13: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 14:21:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 13:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6093/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 14:21:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 14: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 06:16:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6346/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Jun 2020 12:22:52 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#10).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 477 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/10
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16000/1/tests/query_test/test_datasketches.py
File tests/query_test/test_datasketches.py:

http://gerrit.cloudera.org:8080/#/c/16000/1/tests/query_test/test_datasketches.py@1
PS1, Line 1: import pytest
flake8: F401 'pytest' imported but unused


http://gerrit.cloudera.org:8080/#/c/16000/1/tests/query_test/test_datasketches.py@8
PS1, Line 8: class TestDatasketches(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/16000/1/tests/query_test/test_datasketches.py@24
PS1, Line 24: 
flake8: W391 blank line at end of file



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 May 2020 14:33:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/6163/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 May 2020 15:35:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578
PS5, Line 1578: void AggregateFunctions::DsHllUpdate(
Don't support decimal in the first iteration. Open a Jira for further support.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1611
PS5, Line 1611: void AggregateFunctions::DsHllUpdate(
Don't support timestamp in the first iteration. Open a Jira for further support.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h@210
PS5, Line 210: S_SKETCH_CONFIG = 11
this is 12 in Hive


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142
PS3, Line 142: 
> We currently give an error for types not supported for ds_hll_functions. E.
Update: Implemented that I return null and a warning when abort_on_error=false and return an error otherwise.



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 Jun 2020 15:23:55 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 15:

Apparently, test_ddl.py/test_functions_ddl failed with IllegalStateException because for unsupported functions there were some unset members.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 08:57:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6497/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 13:20:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#2).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 492 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 2
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#4).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=11.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 465 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/4
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6411/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Jun 2020 16:17:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16000/7/tests/query_test/test_datasketches.py
File tests/query_test/test_datasketches.py:

http://gerrit.cloudera.org:8080/#/c/16000/7/tests/query_test/test_datasketches.py@18
PS7, Line 18: from copy import deepcopy
flake8: F401 'copy.deepcopy' imported but unused



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Jun 2020 15:47:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6223/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 05 Jun 2020 16:27:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#3).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 509 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#8).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 471 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 8
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#9).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 477 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/9
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#15).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 505 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/15
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#12).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 503 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/12
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#6).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 452 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/6
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6192/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 3
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Jun 2020 12:33:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#11).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 504 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/11
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 13: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6093/


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 13
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 19:23:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6303/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 12 Jun 2020 14:37:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#7).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
14 files changed, 473 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/7
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 12: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 12
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 14:13:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 16: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 14:01:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/6161/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 29 May 2020 15:17:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578
PS5, Line 1578: void AggregateFunctions::DsHllUpdate(
> Don't support decimal in the first iteration. Open a Jira for further suppo
Removed decimal support, however, Impala converts them automatically to bigintval and still executes update(). There isn't much I can do about it in my opinion.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1611
PS5, Line 1611:   union_sketch.update(src_sketch);
> Don't support timestamp in the first iteration. Open a Jira for further sup
Timestamp support removed.


http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions.h@210
PS5, Line 210: S_SKETCH_CONFIG = 12
> this is 12 in Hive
Changed this to 12.



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 6
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 Jun 2020 11:36:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=12.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Reviewed-on: http://gerrit.cloudera.org:8080/16000
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/Function.java
M testdata/data/README
A testdata/data/hll_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
17 files changed, 505 insertions(+), 8 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Csaba Ringhofer: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 17
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6410/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 7
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 24 Jun 2020 16:13:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16000/11/tests/query_test/test_datasketches.py
File tests/query_test/test_datasketches.py:

http://gerrit.cloudera.org:8080/#/c/16000/11/tests/query_test/test_datasketches.py@38
PS11, Line 38: 
flake8: W391 blank line at end of file



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 11
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Jul 2020 12:52:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 4:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16000/3//COMMIT_MSG@30
PS3, Line 30: In terms of correctness DataSketches HLL is most of the time in 2%
            : range from the correct result but there are occasional spikes where
            : the difference is bigger but never goes out of the range of 5%.
> A link to https://datasketches.apache.org/docs/HLL/HLL.html could be useful
Sure, done.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1593
PS3, Line 1593:     sketch_ptr->update((void*)(&src.val16), sizeof(impala::int128_t));
> We don't need to use the convenient typed update function, we can simply ha
Good point! I don't have to loose precision here for int128 as casting it to void* and giving the length to update() would also work.
I can also get rid of this warning/error functionality then.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1595
PS3, Line 1595: }
> I think that this can be raised a few time in a query if more than one inst
Indeed, I just tried this locally on a bigger dataset and apparently there is one warning per instance. In my opinion we can leave it as it is as the purpose here was not to flood the screen with one error per row.
Modified the comment, though.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1625
PS3, Line 1625: 
              : StringVal AggregateFunctions::DsHllSerialize(FunctionContext* ctx,
              :     const StringVal& src) {
> Does Hive also convert to double (as seconds from Unix epoch)? Unless we ar
Leave this question open until we find out how Hive tackles the same if it does.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1650
PS3, Line 1650:   union_sketch.update(*dst_sketch_ptr);
              : 
> optional: I was looking for ways to avoid the 2 copies that occur here and 
I found that there is another interface for deserialization:
deserialize(const void* bytes, size_t len)
I can use this to get rid of the 2 copies.


http://gerrit.cloudera.org:8080/#/c/16000/3/be/src/exprs/aggregate-functions-ir.cc@1684
PS3, Line 1684:   static const uint32
> It would be nice to add a comment somewhere that describes the way hll_sket
Sure, added a comment about this to DsHllInit where I first allocate memory for hll_sketch.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@18
PS3, Line 18:     ds_hll_estimate(ds_hll_sketch(timestamp_col))
> whitespace
Done


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@21
PS3, Line 21: 100
> Are you sure that this won't be flaky? The value is the same as with count 
I've run this a few times and appeared stable. In theory until a certain size of data HLL will pre-populate its cache and during this phase it will give correct results. It will start to approximate only after this certain size.

Apparently, alltypessmall is small enough to always give the correct result;


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@90
PS3, Line 90: insert into sketch_store
> Do you know if this is also the case in Hive?
This error scenario could be dropped using the update(void* bytes, size_t length) interface.


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@115
PS3, Line 115: where date_string_col = "03/01/09";
> Can you mention that this will change in the future when BINARY will be imp
Done


http://gerrit.cloudera.org:8080/#/c/16000/3/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@142
PS3, Line 142: 
> Do you know how Hive handles this? I would consider returning an error inst
We currently give an error for types not supported for ds_hll_functions. E.g. ds_hll_estimate() only accepts strings and for other types it gives an AnalysisException.
For strings we can decide if it's a serialized sketch only when we read that particular row so we can't return earlier. What I can do here is to write a warning or give an error depending on how ABORT_ON_ERROR is set.
What do you think?



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 4
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 08 Jun 2020 11:53:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 15:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6510/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 15
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 09:25:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 10:

PS10 is rebase with master and conflict resolution.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 10
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jun 2020 12:27:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/5/be/src/exprs/aggregate-functions-ir.cc@1578
PS5, Line 1578: void AggregateFunctions::DsHllUpdate(
> Removed decimal support, however, Impala converts them automatically to big
I have experimented a bit with upstream Hive and it turned out that (apart from TIMESTAMP and DECIMAL) the following types are also not supported:
  - BOOLEAN
  - SMALLINT
  - DATE

I think it would be great if we could disable DECIMAL too, e.g. by adding a new parameter "strict_type_check" to functions. The rationale is that in functions that hash values, implicit casts shouldn't be allowed, as even if the cast is lossless, the change of representation can still change the results. We can go another way if this turns out to be very hard, but I think it worth to take a deeper look.


http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc
File be/src/exprs/datasketches-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/16000/6/be/src/exprs/datasketches-functions-ir.cc@37
PS6, Line 37: if (ctx->impl()->state()->abort_on_error())
I would vote on treating this always as error. We can relax this later if it turns out that there are occasionally corrupt sketches and we want to ignore them.


http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/5/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@64
PS5, Line 64: 0,0
Hive seems to return NULL if there are no non-NULL values. I think that the two different solution could interoperate well, but I would prefer to work the same if it is not very hard to do it.


http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
File testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test:

http://gerrit.cloudera.org:8080/#/c/16000/6/testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test@121
PS6, Line 121: 
typo



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 23 Jun 2020 16:50:58 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 9: Code-Review+1

(1 comment)

Thanks for the changes!

http://gerrit.cloudera.org:8080/#/c/16000/9/fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java
File fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java:

http://gerrit.cloudera.org:8080/#/c/16000/9/fe/src/main/java/org/apache/impala/catalog/AggregateFunction.java@79
PS9, Line 79:   private boolean isUnupported_;
For me it seems more logical to move this to base class Function, even if only aggregates use it right now.



-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jun 2020 11:13:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Hello Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16000

to look at the new patch set (#5).

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................

IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()

These functions can be used to get cardinality estimates of data
using HLL algorithm from Apache DataSketches. ds_hll_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized HLL sketch in string format. This can be written to a
table or be fed directly to ds_hll_estimate() that returns the
cardinality estimate for that sketch.

Comparing to ndv() these functions bring more flexibility as once we
fed data to the sketch it can be written to a table and next time we
can save scanning through the dataset and simply return the estimate
using the sketch. This doesn't come for free, however, as perfomance
measurements show that ndv() is 2x-3.5x faster than sketching. On the
other hand if we query the estimate from an existing sketch then the
runtime is negligible.
Another flexibility with these sketches is that they can be merged
together so e.g. if we had saved a sketch for each of the partitions
of a table then they can be combined with each other based on the
query without touching the actual data.
DataSketches HLL is sensitive for the order of the data fed to the
sketch and as a result running these algorithms in Impala gets
non-deterministic results within the error bounds of the algorithm.
In terms of correctness DataSketches HLL is most of the time in 2%
range from the correct result but there are occasional spikes where
the difference is bigger but never goes out of the range of 5%.
Even though the DataSketches HLL algorithm could be parameterized
currently this implementation hard-codes these parameters and use
HLL_4 and lg_k=11.

For more details about Apache DataSketches' HLL implementation see:
https://datasketches.apache.org/docs/HLL/HLL.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on TPCH25.lineitem to compare perfomance with
   ndv(). Depending on data characteristics ndv() appears 2x-3.5x
   faster. The lower the cardinality of the dataset the bigger the
   difference between the 2 algorithms is.
 - Ran manual tests on TPCH25.lineitem and
   functional_parquet.alltypes to compare correctness with ndv(). See
   results above.

Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
---
M be/src/codegen/impala-ir.cc
M be/src/exprs/CMakeLists.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/exprs/datasketches-functions-ir.cc
A be/src/exprs/datasketches-functions.h
M be/src/exprs/datasketches-test.cc
M be/src/exprs/scalar-expr-evaluator.cc
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
A testdata/workloads/functional-query/queries/QueryTest/datasketches-hll.test
A tests/query_test/test_datasketches.py
12 files changed, 493 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/00/16000/5
-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 5
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 16: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 16
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 14:08:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/6428/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 9
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 26 Jun 2020 10:10:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9632: Implement ds hll sketch() and ds hll estimate()

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16000 )

Change subject: IMPALA-9632: Implement ds_hll_sketch() and ds_hll_estimate()
......................................................................


Patch Set 14:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6098/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16000
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic602cb6eb2bfbeab37e5e4cba11fbf0ca40b03fe
Gerrit-Change-Number: 16000
Gerrit-PatchSet: 14
Gerrit-Owner: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Jul 2020 06:16:39 +0000
Gerrit-HasComments: No