You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Fucun Chu (Code Review)" <ge...@cloudera.org> on 2020/11/20 15:13:05 UTC

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Fucun Chu has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16656


Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................

IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions

These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch_parquet.lineitem to compare perfomance
   with ndv(). Depending on data characteristics ndv() appears 2x-3x
   faster. CPC gives closer estimate than current ndv(). CPC is more
   accurate than HLL in some cases

Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M be/src/exprs/datasketches-common.cc
M be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M be/src/exprs/datasketches-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/cpc_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-cpc.test
M tests/query_test/test_datasketches.py
12 files changed, 398 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/56/16656/4
-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 8
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 May 2021 12:19:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Fucun Chu (Code Review)" <ge...@cloudera.org>.
Fucun Chu has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................

IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions

These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch_parquet.lineitem to compare perfomance
   with ndv(). Depending on data characteristics ndv() appears 2x-3x
   faster. CPC gives closer estimate than current ndv(). CPC is more
   accurate than HLL in some cases

Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M be/src/exprs/datasketches-common.cc
M be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M be/src/exprs/datasketches-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/cpc_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-cpc.test
M tests/query_test/test_datasketches.py
12 files changed, 398 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/56/16656/6
-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 6
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................

IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions

These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch_parquet.lineitem to compare perfomance
   with ndv(). Depending on data characteristics ndv() appears 2x-3x
   faster. CPC gives closer estimate than current ndv(). CPC is more
   accurate than HLL in some cases

Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Reviewed-on: http://gerrit.cloudera.org:8080/16656
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M be/src/exprs/datasketches-common.cc
M be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M be/src/exprs/datasketches-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/cpc_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-cpc.test
M tests/query_test/test_datasketches.py
12 files changed, 398 insertions(+), 0 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 10
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 4: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16656/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16656/4//COMMIT_MSG@28
PS4, Line 28:  Ran manual tests on tpch_parquet.lineitem to compare perfomance
            :    with ndv(). Depending on data characteristics ndv() appears 2x-3x
            :    faster. CPC gives closer estimate than current ndv(). CPC is more
            :    accurate than HLL in some cases
Have you compared CPC and HLL in terms of runtime performance? It would be nice to see if any of them is faster. I see the link for comparison above, I just wanted to see some numbers when we run these algorithms through Impala.

Additionally, could you share more details when you compare CPC an HLL in terms of accuracy. You mention that in some cases CPC is more accurate. Could you mention which are these cases and what is the difference between the algorithms?



-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 02 Dec 2020 08:07:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 5
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Feb 2021 08:51:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6889/


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 5
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Feb 2021 14:37:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/8558/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 8
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Apr 2021 06:43:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7128/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 9
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 May 2021 12:21:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Fucun Chu (Code Review)" <ge...@cloudera.org>.
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 4:

The test is being processed, update the document after completion


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 10 Feb 2021 14:07:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Fucun Chu (Code Review)" <ge...@cloudera.org>.
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 4:

Performance comparison between ds_hll_* and ds_cpc_* functions see: https://issues.apache.org/jira/browse/IMPALA-10500


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Feb 2021 06:58:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Feb 2021 08:51:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 9: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 9
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 May 2021 12:21:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/7700/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Nov 2020 15:36:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Fucun Chu (Code Review)" <ge...@cloudera.org>.
Fucun Chu has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 5:

DataSketches 3.0 has fixed this problem, need to wait for IMPALA-10631 to complete. Will be updated soon.


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 5
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Apr 2021 02:39:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 6:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/8529/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 6
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Apr 2021 05:27:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6889/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 5
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 15 Feb 2021 08:51:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 4:

Hey,
This review has been open for a while now. Do you have any updates on my comment/question? I see you pushed another review to include Theta sketch for cardinality estimates. With that we'll have 3 different algorithms for DataSketches for the same purpose and it makes me wonder which one is better for which purpose.


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 4
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Feb 2021 13:52:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Fucun Chu (Code Review)" <ge...@cloudera.org>.
Fucun Chu has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................

IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions

These functions can be used to get cardinality estimates of data
using CPC algorithm from Apache DataSketches. ds_cpc_sketch()
receives a dataset, e.g. a column from a table, and returns a
serialized CPC sketch in string format. This can be written to a
table or be fed directly to ds_cpc_estimate() that returns the
cardinality estimate for that sketch.

Similar to the HLL sketch, the primary use-case for the CPC sketch
is for counting distinct values as a stream, and then merging
multiple sketches together for a total distinct count.

For more details about Apache DataSketches' CPC see:
http://datasketches.apache.org/docs/CPC/CPC.html
Figures-of-Merit Comparison of the HLL and CPC Sketches see:
https://datasketches.apache.org/docs/DistinctCountMeritComparisons.html

Testing:
 - Added some tests running estimates for small datasets where the
   amount of data is small enough to get the correct results.
 - Ran manual tests on tpch_parquet.lineitem to compare perfomance
   with ndv(). Depending on data characteristics ndv() appears 2x-3x
   faster. CPC gives closer estimate than current ndv(). CPC is more
   accurate than HLL in some cases

Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
M be/src/exprs/datasketches-common.cc
M be/src/exprs/datasketches-common.h
M be/src/exprs/datasketches-functions-ir.cc
M be/src/exprs/datasketches-functions.h
M common/function-registry/impala_functions.py
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M testdata/data/README
A testdata/data/cpc_sketches_from_hive.parquet
A testdata/workloads/functional-query/queries/QueryTest/datasketches-cpc.test
M tests/query_test/test_datasketches.py
12 files changed, 398 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/56/16656/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 8
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 9: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 9
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 May 2021 18:07:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10282: Implement ds cpc sketch() and ds cpc estimate() functions

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/16656 )

Change subject: IMPALA-10282: Implement ds_cpc_sketch() and ds_cpc_estimate() functions
......................................................................


Patch Set 5:

hey, Do you have any plans to take a look at this verify job failure?


-- 
To view, visit http://gerrit.cloudera.org:8080/16656
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I731e66fbadc74bc339c973f4d9337db9b7dd715a
Gerrit-Change-Number: 16656
Gerrit-PatchSet: 5
Gerrit-Owner: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Fucun Chu <ch...@hotmail.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 07 Apr 2021 06:52:59 +0000
Gerrit-HasComments: No