You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Anonymous Coward (Code Review)" <ge...@cloudera.org> on 2020/02/27 04:01:15 UTC

[Impala-ASF-CR] IMPALA-8205: Support number of true and false statistics for boolean column

583424568@qq.com has posted comments on this change. ( http://gerrit.cloudera.org:8080/14666 )

Change subject: IMPALA-8205:  Support number of true and false statistics for boolean column
......................................................................


Patch Set 6:

(11 comments)

> (8 comments)
 > 
 > does this include computing stats for boolean columns in Kudu as
 > well? (if not thats okay, just wondering)

No, I didn't consider kudu in this patch currently;
Do you think it is valuable or necessary if I make this feature also support kudu in another patch?

http://gerrit.cloudera.org:8080/#/c/14666/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14666/1//COMMIT_MSG@9
PS1, Line 9: 
> Please wrap the commit message at 72 characters per line and replace ";" wi
Done


http://gerrit.cloudera.org:8080/#/c/14666/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14666/3//COMMIT_MSG@11
PS3, Line 11: .
> nit: .
Done


http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/catalog-op-executor.cc
File be/src/exec/catalog-op-executor.cc:

http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/catalog-op-executor.cc@286
PS1, Line 286: ,
> nit: .
Done


http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/incr-stats-util.cc
File be/src/exec/incr-stats-util.cc:

http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/incr-stats-util.cc@164
PS1, Line 164: num_trues
> num_new_trues?
Done


http://gerrit.cloudera.org:8080/#/c/14666/1/be/src/exec/incr-stats-util.cc@165
PS1, Line 165: num_falses
> num_new_falses?
Done


http://gerrit.cloudera.org:8080/#/c/14666/1/common/thrift/CatalogObjects.thrift
File common/thrift/CatalogObjects.thrift:

http://gerrit.cloudera.org:8080/#/c/14666/1/common/thrift/CatalogObjects.thrift@172
PS1, Line 172: required
> Shoule this be optional? It's just set for boolean column.
Even for non-boolean columns, we will return -1 to indicate the statistics is missing;
I will double check it;


http://gerrit.cloudera.org:8080/#/c/14666/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java
File fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java:

http://gerrit.cloudera.org:8080/#/c/14666/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@293
PS6, Line 293: "COUNT(CASE WHEN " + colRefSql + " = TRUE THEN 1 ELSE NULL END)");
> why not just use "sum(" + colRefSql ")");? I'm not sure if one way is bette
Hi Sahil, I do conduct a test to compare their performance;
Conclusion: I didn't find obvious performance difference;

Below is my test detail:

I made test based on parquet and textfile tables;
The queries are as below 

I added a boolean column for a tpch table which is previously used for tpch test which 60 billion lines;

Query 1:
select sum(case when bool_test = true then 1 else 0 end) as numTrue, sum(case when bool_test = false then 1 else 0 end) as numFalse from [TEST TABLE];


Query 2:
 select sum(bool_test) as numTrue, count(bool_test) - sum(bool_test)  from lineitem_boolean_text;
Query: select sum(bool_test) as numTrue, count(bool_test) - sum(bool_test)  from [TEST TABLE]

In my impala cluster, both above queries' runtime varies from 4s ~ 10s, no obvious difference occured; Somtimes the Query 1 is faster than Query 2, and somtimes the opposite.


http://gerrit.cloudera.org:8080/#/c/14666/6/fe/src/main/java/org/apache/impala/analysis/ComputeStatsStmt.java@297
PS6, Line 297:  columnStatsSelectList.add("NULL");
             :         columnStatsSelectList.add("NULL");
> what is this for?
The null is used to mark that this column is not a boolean column so the numTrue and numFalse stats is meaningless;
You know that numTrue and numFalse is very different from other columns because it definitely depends on the column type. For non-boolean column, we use NULL to mark it.


http://gerrit.cloudera.org:8080/#/c/14666/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/14666/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@512
PS1, Line 512: ing();
> nit: numTrues_
Done


http://gerrit.cloudera.org:8080/#/c/14666/1/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@513
PS1, Line 513: 
> nit: numFalses_
Done


http://gerrit.cloudera.org:8080/#/c/14666/3/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java
File fe/src/main/java/org/apache/impala/catalog/ColumnStats.java:

http://gerrit.cloudera.org:8080/#/c/14666/3/fe/src/main/java/org/apache/impala/catalog/ColumnStats.java@39
PS3, Line 39: /**
> Unused imports?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14666
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I991bee8e7fdc644d908289f5fe2ee8032cc2c431
Gerrit-Change-Number: 14666
Gerrit-PatchSet: 6
Gerrit-Owner: Anonymous Coward <58...@qq.com>
Gerrit-Reviewer: Anonymous Coward <58...@qq.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Sahil Takiar <st...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Feb 2020 04:01:15 +0000
Gerrit-HasComments: Yes