You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2017/11/16 06:29:46 UTC

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Alex Behm has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8569


Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,885 insertions(+), 19 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm <al...@cloudera.com>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#8).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
16 files changed, 3,889 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/8
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 8
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1611/


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 12
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:18:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt
File bin/rat_exclude_files.txt:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24
PS9, Line 24: be/src/thirdparty/mpfit/*
> Sorry, I was unclear. I think it has to be copied to the LICENSE.txt file i
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:07:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt
File bin/rat_exclude_files.txt:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24
PS9, Line 24: be/src/thirdparty/mpfit/*
> The license part of the DISCLAIMER file needs to be put in LICENSE.txt in o
Done


http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh
File bin/run_clang_tidy.sh:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh@58
PS9, Line 58: #
> I think this is leftovers.
Sorry. Yes.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 17:54:37 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 2:

(9 comments)

I'm still looking at the tests, just want to publish what I have so far. Overall the patch (and approach) looks good to me.

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1464
PS1, Line 1464: of a SampledNdvState with 'num_buckets' HLL bu
> Good idea done. I did some more similar cleanup in this class.
thanks, easier to follow.


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1443
PS2, Line 1443: count
IMO 'count' sounds a little generic and overlaps with total_count. May be rename them to "bucket_row_count" and "total_row_count" or something similar?


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1459
PS2, Line 1459:   double sample_perc;
static?


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1465
PS2, Line 1465:   static int64_t GetStateSize(int64_t num_buckets) {
Given num_buckets is always a const, may be make this a static const too like BUCKET size?

static const in64_t STATE_SIZE = NUM_HLL_BUCKETS * BUCKET_SIZE .....


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1530
PS2, Line 1530: ndv
may be call ndv_estimate, just to be clear


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1552
PS2, Line 1552:  int bucket_idx = (i + j) % SampledNdvState::NUM_HLL_BUCKETS;
              :       merged_count += *state->GetCountPtr(bucket_idx);
              :       counts[pidx] = merged_count;
              :       StringVal hll = StringVal(state->GetHllPtr(bucket_idx), HLL_LEN);
              :       HllMerge(ctx, hll, &merged_hll);
              :       ndvs[pidx] = HllFinalEstimate(merged_hll.ptr, HLL_LEN);
Wondering if we can avoid duplicates in data points since hll_to_merge for (i,j) = (j,i). Do they add any additional value?

We could probably optimize the loop a little further to avoid that

for (int i=0, i < NUM_HLL_BUCKETS; ++i) {
  for (int j=i, j < NUM_HLL_BUCKETS, ++j) {

}
}


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1563
PS2, Line 1563:   int64_t max_count = counts[num_points - 1];
not sure I understand this, what invariant guarantees counts[num_points-1] is the max_count?


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@50
PS1, Line 50:   /// Returns true if fitting was successful, false otherwise.
> Done here. I could not find another place, did I miss one?
nope, just here.


http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@506
PS2, Line 506: SAMPED_NDV
nit: typo



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Nov 2017 22:36:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 4: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
> Added comment but did not rename. Not sure if "FittingFunction" would be cl
I just meant to say "fitting function" in the comment where I highlighted, not to actually rename the code.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 4
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 06 Dec 2017 20:13:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3: 
> I'm not sure if it counts as a licence under ASF rules. Maybe it'll get the
Ok let's wait for the JIRA. The text in here sure makes it sound like the authors intended for anyone to do anything with this code.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Thu, 30 Nov 2017 16:51:01 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443
PS3, Line 1443: bucket_
> I don't understand why we have "bucket" as a prefix here.  Both the row_cou
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456
PS3, Line 1456: existing issues with passing constant arguments to all
              :   /// aggregation phases
> what does that mean? is there a JIRA we can just reference?
Yes. IMPALA-6179. Added.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469
PS3, Line 1469:   int64_t* GetCountPtr(int bucket_idx) {
              :     return reinterpret_cast<int64_t*>(GetBucketPtr(bucket_idx));
              :   }
              : 
              :   uint8_t* GetHllPtr(int bucket_idx) {
              :     return GetBucketPtr(bucket_idx) + sizeof(int64_t);
              :   }
              : 
              :   uint8_t* GetBucketPtr(int bucket_idx) {
              :     return reinterpret_cast<uint8_t*>(this) +
              :         sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE;
              :   }
> given that NUM_HLL_BUCKETS is predetermined, why not just use an array:
Much better. Done.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500
PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, const T& src,
> I think a quick comment explaining what we're doing here would be helpful, 
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503
PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS;
> I wonder if picking a random bucket is better? We talked a bit about this. 
Interesting idea that I'll investigate. Ideally I'd not want to hold up this patch since the current simpler approach has shown good results and already took a long time to tune.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533
PS3, Line 1533:   int64_t counts[num_points];
              :   memset(counts, 0, num_points * sizeof(int64_t));
> how about:
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542
PS3, Line 1542: diminishing returns from creating additional data points
> not sure what that means. do you mean additional to 'num_points'? i.e. are 
Yes. Rephrased to clarify.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209
PS3, Line 209: x/y
> I think using a notation like "(x, y)" maybe clearer since this sounds like
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74
PS3, Line 74:   /// Human-readable name of this function. Used for debugging.
> should any of these members be private?
Oops. Yes, all of them. Done.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
> but what is this function suppose to do?  Does it make sense to call it a "
Added comment but did not rename. Not sure if "FittingFunction" would be clearer. This user supplied function is a little weird imo.

Happy to rename if you prefer FittingFunction(), lmk.


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519
PS3, Line 519: children_.get(1)
> samplePerc?
Done


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524
PS3, Line 524: children_.get(1)
> samplePerc?
Done.

Setting child 1 for clarify but it's not necessary since the cast modifies the literal in-place.


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337
PS3, Line 337: * sample_perc
> even with the comment, I'm not sure what's going on here.
Tried to clarify in comment. Lmk if still unclear.


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339
PS3, Line 339:   def __appx_equals(self, a, b, diff_perc):
> a quick comment for this would be helpful.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 06 Dec 2017 06:57:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 7
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Mon, 11 Dec 2017 23:58:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#5).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,900 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/5
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 5
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#6).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,885 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/6
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 6
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 12: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 12
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 22:20:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#10).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/LICENSE.txt
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
18 files changed, 3,894 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/10
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 10
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java:

http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2063
PS2, Line 2063: 0.0
Wondering if 0.0 is a valid sample %. What do we scan in that case? Do we assign some default ndvs?


http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2066
PS2, Line 2066:       for (Column col: allScalarTypes.getColumns()) {
-ve test cases on non scalar columns, just for completeness?



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Comment-Date: Thu, 23 Nov 2017 00:02:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3: 
I'm not sure this meets ASF rules, in particular the teh translation to C license, which looks to me just "no restrictions placed on distribution".

I'd suggest a close reading of https://www.apache.org/legal/resolved.html. If that doesn't settle it, and I suspect it wont, you might file a "LEGAL" JIRA: https://issues.apache.org/jira/browse/LEGAL



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 29 Nov 2017 22:20:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Reviewed-on: http://gerrit.cloudera.org:8080/8569
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M LICENSE.txt
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
19 files changed, 3,976 insertions(+), 12 deletions(-)

Approvals:
  Alex Behm: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 13
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 9:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt
File bin/rat_exclude_files.txt:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24
PS9, Line 24: be/src/thirdparty/mpfit/*
The license part of the DISCLAIMER file needs to be put in LICENSE.txt in order for the comment on line 22 of this file to be accurate.


http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh
File bin/run_clang_tidy.sh:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/run_clang_tidy.sh@58
PS9, Line 58: #
I think this is leftovers.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 17:49:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524
PS3, Line 524: children_.set(1,
> Done.
I see. It seems dangerous to rely on knowing the implementation for LiteralExpr since (a) it could change, and (b) that's not the general contract (other Expr's don't work that way) and so kinda violates the abstraction. So, thanks for adding the explicit set. I see we have other places in the code that we rely on knowing how uncheckedCastTo() works at the caller that we may want to clean up later.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 4
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 06 Dec 2017 17:08:44 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#9).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
18 files changed, 3,896 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/9
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#2).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,892 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 12: Code-Review+2

rebase


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 12
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:17:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 1:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1464
PS1, Line 1464: sizeof(int64_t) + AggregateFunctions::HLL_LEN)
> maybe factor this out into a GetSingleBucketSize() or something for readabi
Good idea done. I did some more similar cleanup in this class.


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1469
PS1, Line 1469:         sizeof(SampledNdvState) + bucket_idx * (sizeof(int64_t) + AggregateFunctions::HLL_LEN));
> nit: longline
Done


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@50
PS1, Line 50:   bool LmsFit(const double* xs, const double* ys, int num_points);
> WARN_UNUSED_RESULT, here and elsewhere
Done here. I could not find another place, did I miss one?


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@82
PS1, Line 82:   mp_result result_;
> Looks like this struct has a bunch of ptrs, whats the lifecycle of those?
Added comment.


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@90
PS1, Line 90:   const double* ys_;
> DISALLOW_COPY_ASSIGN
That's needed for putting into std::vector.


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc@29
PS1, Line 29:     num_params_(num_params),
            :     fn_(fn),
            :     num_points_(0),
            :     xs_(nullptr),
            :     ys_(nullptr) {
> nit: I think we could condense them into fewer lines (other places in the c
Done


http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@507
PS1, Line 507: && children_.size() == 2
> Isn't this more like a preconditions check inside the if {}?
No. We're parsing a generic FunctionCallExpr that can be invoked with any number of args and we want to return a "Could not resolve function" error when SAMPLED_NDV() is invoked with an invalid number. That's done in L527.

Added comment.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Nov 2017 19:03:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3: 
> I'm not sure this meets ASF rules, in particular the teh translation to C l
Thanks for bringing this up. Can you be more specific what you are concerned about?

No restrictions on distribution seems like a good thing.

I don't see any violation of: https://opensource.org/osd-annotated

Regardless I've filed LEGAL-348 to clarify.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Thu, 30 Nov 2017 00:41:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 11:

Thanks for the quick and thorough review Jim! Much appreciated.


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 11
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:17:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 5:

Before merging, I'll run the tests one more time and give the Apache legal team some time to respond regarding including the MpFit library.


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 5
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 06 Dec 2017 20:41:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#11).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M LICENSE.txt
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M bin/rat_exclude_files.txt
M bin/run_clang_tidy.sh
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
19 files changed, 3,976 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/11
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 11
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 1:

(7 comments)

I have a bunch of nits. Although I follow the basic flow of the patch, I think I need to spend more time by running some examples to get a deeper understanding, as I'm not fully familiar with how the curve fitting math works.

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1464
PS1, Line 1464: sizeof(int64_t) + AggregateFunctions::HLL_LEN)
maybe factor this out into a GetSingleBucketSize() or something for readability? (here and below)


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/exprs/aggregate-functions-ir.cc@1469
PS1, Line 1469:         sizeof(SampledNdvState) + bucket_idx * (sizeof(int64_t) + AggregateFunctions::HLL_LEN));
nit: longline


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@50
PS1, Line 50:   bool LmsFit(const double* xs, const double* ys, int num_points);
WARN_UNUSED_RESULT, here and elsewhere


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@82
PS1, Line 82:   mp_result result_;
Looks like this struct has a bunch of ptrs, whats the lifecycle of those?


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.h@90
PS1, Line 90:   const double* ys_;
DISALLOW_COPY_ASSIGN


http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/1/be/src/util/mpfit-util.cc@29
PS1, Line 29:     num_params_(num_params),
            :     fn_(fn),
            :     num_points_(0),
            :     xs_(nullptr),
            :     ys_(nullptr) {
nit: I think we could condense them into fewer lines (other places in the codebase use multiple args per line)


http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/1/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@507
PS1, Line 507: && children_.size() == 2
Isn't this more like a preconditions check inside the if {}?



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 1
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Comment-Date: Thu, 16 Nov 2017 08:28:37 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1552
PS2, Line 1552:  int bucket_idx = (i + j) % SampledNdvState::NUM_HLL_BUCKETS;
              :       merged_count += *state->GetCountPtr(bucket_idx);
              :       counts[pidx] = merged_count;
              :       StringVal hll = StringVal(state->GetHllPtr(bucket_idx), HLL_LEN);
              :       HllMerge(ctx, hll, &merged_hll);
              :       ndvs[pidx] = HllFinalEstimate(merged_hll.ptr, HLL_LEN);
> Ah, I misread the part where we do HllMerge(ctx, hll, &merged_hll). I read 
Your original comment was totally valid. It's easy to misread the code. As a result of your comment the code is now hopefully easier to understand for others, so definitely worthwhile!



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Nov 2017 19:19:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(14 comments)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1443
PS3, Line 1443: bucket_
I don't understand why we have "bucket" as a prefix here.  Both the row_count and hll_state are per bucket, right?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1456
PS3, Line 1456: existing issues with passing constant arguments to all
              :   /// aggregation phases
what does that mean? is there a JIRA we can just reference?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1469
PS3, Line 1469:   int64_t* GetCountPtr(int bucket_idx) {
              :     return reinterpret_cast<int64_t*>(GetBucketPtr(bucket_idx));
              :   }
              : 
              :   uint8_t* GetHllPtr(int bucket_idx) {
              :     return GetBucketPtr(bucket_idx) + sizeof(int64_t);
              :   }
              : 
              :   uint8_t* GetBucketPtr(int bucket_idx) {
              :     return reinterpret_cast<uint8_t*>(this) +
              :         sizeof(SampledNdvState) + bucket_idx * BUCKET_SIZE;
              :   }
given that NUM_HLL_BUCKETS is predetermined, why not just use an array:

struct { int64_t row_count, uint8_t hll[HLL_LEN] } buckets[NUM_HLL_BUCKETS];


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1500
PS3, Line 1500: void AggregateFunctions::SampledNdvUpdate(FunctionContext* ctx, const T& src,
I think a quick comment explaining what we're doing here would be helpful, something like

Incorporate the row into one of the intermediate HLLs, which will be used by Finalize to generate a set of the (x,y) points.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1503
PS3, Line 1503: state->total_row_count % SampledNdvState::NUM_HLL_BUCKETS;
I wonder if picking a random bucket is better? We talked a bit about this. Up to you whether you think it's worth investigating.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1533
PS3, Line 1533:   int64_t counts[num_points];
              :   memset(counts, 0, num_points * sizeof(int64_t));
how about:

int64_t counts[num_points] = { 0 };


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions-ir.cc@1542
PS3, Line 1542: diminishing returns from creating additional data points
not sure what that means. do you mean additional to 'num_points'? i.e. are you saying that empirically, 'num_points' is a sufficient number of points?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h
File be/src/exprs/aggregate-functions.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/exprs/aggregate-functions.h@209
PS3, Line 209: x/y
I think using a notation like "(x, y)" maybe clearer since this sounds like divide.


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h
File be/src/util/mpfit-util.h:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.h@74
PS3, Line 74:   /// Human-readable name of this function. Used for debugging.
should any of these members be private?


http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
but what is this function suppose to do?  Does it make sense to call it a "fitting function"?


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@519
PS3, Line 519: children_.get(1)
samplePerc?


http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524
PS3, Line 524: children_.get(1)
samplePerc?

why don't we have to do anything with the return value of uncheckedCastTo() (like set it as the children_ at index 1)?


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py
File tests/query_test/test_aggregation.py:

http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@337
PS3, Line 337: * sample_perc
even with the comment, I'm not sure what's going on here.


http://gerrit.cloudera.org:8080/#/c/8569/3/tests/query_test/test_aggregation.py@339
PS3, Line 339:   def __appx_equals(self, a, b, diff_perc):
a quick comment for this would be helpful.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Sat, 02 Dec 2017 00:20:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 6: Code-Review+2

Rebase and trivial size fix. Tests passed. Still waiting for legal.


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 6
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Thu, 07 Dec 2017 19:55:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 2:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1443
PS2, Line 1443: count
> IMO 'count' sounds a little generic and overlaps with total_count. May be r
Done


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1459
PS2, Line 1459:   double sample_perc;
> static?
This is the local state of a single aggregation instance, so static does not make sense.


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1465
PS2, Line 1465:   static int64_t GetStateSize(int64_t num_buckets) {
> Given num_buckets is always a const, may be make this a static const too li
Can't do that because sizeof(SampledNdvState) fails since SampledNdvState is considered to be (size not known).

This is the error:
error: invalid application of ‘sizeof’ to incomplete type ‘impala::SampledNdvState’
       sizeof(SampledNdvState) + NUM_HLL_BUCKETS * BUCKET_SIZE;


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1530
PS2, Line 1530: ndv
> may be call ndv_estimate, just to be clear
Done


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1552
PS2, Line 1552:  int bucket_idx = (i + j) % SampledNdvState::NUM_HLL_BUCKETS;
              :       merged_count += *state->GetCountPtr(bucket_idx);
              :       counts[pidx] = merged_count;
              :       StringVal hll = StringVal(state->GetHllPtr(bucket_idx), HLL_LEN);
              :       HllMerge(ctx, hll, &merged_hll);
              :       ndvs[pidx] = HllFinalEstimate(merged_hll.ptr, HLL_LEN);
> Wondering if we can avoid duplicates in data points since hll_to_merge for 
I expanded the comment here to explain and justify.

Not sure I follow your point generation method. I think in my method only the last data point is guaranteed to be a duplicate (which is a good thing, now explained in comment). Others may or may not be duplicates (probably not duplicates though).

Consider this example input and my point generation method:

input counts:
1, 2, 3, 4

counts of 1st rolling window:
1, 3, 6, 10

counts of 2nd rolling window
2, 5, 9, 10

counts of 3rd rolling window
3, 7, 8, 10

counts of 4th rolling window
4, 5, 7, 10

For each "count" value above, we mere the corresponding HLL intermediates to get an NDV estimate for that subset of the data.

Note that the merged NDVs corresponding to the same "count" value are likely to be different because they represent different subsets of the data.


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1563
PS2, Line 1563:   int64_t max_count = counts[num_points - 1];
> not sure I understand this, what invariant guarantees counts[num_points-1] 
Added comment.


http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@506
PS2, Line 506: SAMPED_NDV
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java:

http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2063
PS2, Line 2063: 0.0
> Wondering if 0.0 is a valid sample %. What do we scan in that case? Do we a
The value passed to this agg fn is independent of how much data is scanned. You can think of the sample percent as an "extrapolation factor".

We will eventually combine TABLESAMPLE and SAMPLED_NDV() in COMPUTE STATS and there the effective sampling rate will be plugged into the SAMPLED_NDV() function.

But for now, let's view SAMPLED_NDV() as a standalone aggregate function.

I think 0.0 is a valid value. It will give you the estimated NDV at x=0. Not sure how useful that is, but I don't think it's invalid (0 is a valid "sampling ratio").

I'm not sure it's worth special casing 0, what do you think?


http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2066
PS2, Line 2066:       for (Column col: allScalarTypes.getColumns()) {
> -ve test cases on non scalar columns, just for completeness?
Maybe I'm misunderstanding but we don't support Exprs on non-scalar types. Even if we have nested collections, we must explode them to scalars.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 2
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Comment-Date: Thu, 23 Nov 2017 01:08:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc
File be/src/util/mpfit-util.cc:

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/util/mpfit-util.cc@33
PS3, Line 33: function
> I just meant to say "fitting function" in the comment where I highlighted, 
Clarified in comment as discussed.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 06 Dec 2017 20:32:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/be/src/thirdparty/mpfit/DISCLAIMER
File be/src/thirdparty/mpfit/DISCLAIMER:

PS3: 
> Thanks for bringing this up. Can you be more specific what you are concerne
I'm not sure if it counts as a licence under ASF rules. Maybe it'll get the same treatment from the legal team as WTFPL, which is definitely allowed:

http://www.wtfpl.net/txt/copying/



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Thu, 30 Nov 2017 14:53:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#4).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,900 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 4
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Bharath Vissapragada (Code Review)" <ge...@cloudera.org>.
Bharath Vissapragada has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3: Code-Review+1

(3 comments)

LGTM.

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1465
PS2, Line 1465:   static int64_t GetStateSize() {
> Can't do that because sizeof(SampledNdvState) fails since SampledNdvState i
oops, yes.


http://gerrit.cloudera.org:8080/#/c/8569/2/be/src/exprs/aggregate-functions-ir.cc@1552
PS2, Line 1552: tringVal merged_hll(merged_hll_data, HLL_LEN);
              :     int64_t merged_count = 0;
              :     for (int j = 0; j < SampledNdvState::NUM_HLL_BUCKETS; ++j) {
              :       int bucket_idx = (i + j) % SampledNdvState::NUM_HLL_BUCKETS;
              :       merged_count += *state->GetCountPtr(bucket_idx);
              :       counts[pidx] = merged_count;
> I expanded the comment here to explain and justify.
Ah, I misread the part where we do HllMerge(ctx, hll, &merged_hll). I read it as HllMerge(ctx, hll, &src_hll[i]). It makes sense now, thanks. I agree the chance of dupes is pretty less and that also explains the max_count invariant. My bad.


http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
File fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java:

http://gerrit.cloudera.org:8080/#/c/8569/2/fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java@2063
PS2, Line 2063: 0.0
> The value passed to this agg fn is independent of how much data is scanned.
Makes sense, I too think the same. Thanks for explaining.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Nov 2017 22:58:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
File fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java:

http://gerrit.cloudera.org:8080/#/c/8569/3/fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java@524
PS3, Line 524: children_.get(1)
> I see. It seems dangerous to rely on knowing the implementation for Literal
Agree. You have a valid point, hence the change :)



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 06 Dec 2017 18:49:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#3).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
15 files changed, 3,897 insertions(+), 19 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 3
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt
File bin/rat_exclude_files.txt:

http://gerrit.cloudera.org:8080/#/c/8569/9/bin/rat_exclude_files.txt@24
PS9, Line 24: be/src/thirdparty/mpfit/*
> Done
Sorry, I was unclear. I think it has to be copied to the LICENSE.txt file in the root directory. I expect you can leave the file in that directory as-is, but the text should be copied to /LICENSE.txt.

If you want you can update the comment on line 22 to clarify.



-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 9
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:01:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Jim Apple (Code Review)" <ge...@cloudera.org>.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 11: Code-Review+2

+1 from me on tidy and RAT, carry Dan's +2

Thanks - I know it's bureaucratic.


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 11
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Tue, 12 Dec 2017 18:16:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change. ( http://gerrit.cloudera.org:8080/8569 )

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................


Patch Set 5: Code-Review+2

That's clearer, thanks.


-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 5
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>
Gerrit-Comment-Date: Wed, 06 Dec 2017 20:35:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5310: Part 2: Add SAMPLED NDV() function.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Bharath Vissapragada, Jim Apple, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8569

to look at the new patch set (#7).

Change subject: IMPALA-5310: Part 2: Add SAMPLED_NDV() function.
......................................................................

IMPALA-5310: Part 2: Add SAMPLED_NDV() function.

Adds a new SAMPLED_NDV() aggregate function that is
intended to be used in COMPUTE STATS TABLESAMPLE.
This patch only adds the function itself. Integration
with COMPUTE STATS will come in a separate patch.

SAMPLED_NDV() estimates the number of distinct values (NDV)
based on a sample of data and the corresponding sampling rate.
The main idea is to collect several x/y data points where x is
the number of rows and y is the corresponding NDV estimate.
These data points are used to fit an objective function to the
data such that the true NDV can be extrapolated.
The aggregate function maintains a fixed number of HyperLogLog
intermediates to compute the x/y points.
Several objective functions are fit and the best-fit one is
used for extrapolation.

Adds the MPFIT C library to perform curve fitting:
https://www.physics.wisc.edu/~craigm/idl/cmpfit.html

The library is a C port from Fortran. Scipy uses the
Fortran version of the library for curve fitting.

Testing:
- added functional tests
- core/hdfs run passed

Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
---
M NOTICE.txt
M be/src/exprs/aggregate-functions-ir.cc
M be/src/exprs/aggregate-functions.h
A be/src/thirdparty/mpfit/DISCLAIMER
A be/src/thirdparty/mpfit/README
A be/src/thirdparty/mpfit/mpfit.c
A be/src/thirdparty/mpfit/mpfit.h
M be/src/util/CMakeLists.txt
A be/src/util/mpfit-util.cc
A be/src/util/mpfit-util.h
M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java
M fe/src/main/java/org/apache/impala/catalog/BuiltinsDb.java
M fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
M fe/src/test/java/org/apache/impala/analysis/AnalyzeStmtsTest.java
M fe/src/test/java/org/apache/impala/common/FrontendTestBase.java
M tests/query_test/test_aggregation.py
16 files changed, 3,888 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/69/8569/7
-- 
To view, visit http://gerrit.cloudera.org:8080/8569
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia51d56ee67ec6073e92f90bebb4005484138b820
Gerrit-Change-Number: 8569
Gerrit-PatchSet: 7
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Bharath Vissapragada <bh...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Jim Apple <jb...@apache.org>