You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2018/02/07 17:56:19 UTC
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9239
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()
The following query produces non-deterministic results currently:
SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;
The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.
The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static method called Tuple::Poison().
I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.
Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 47 insertions(+), 3 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/1
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 5:
Seems like it was hit by IMPALA-6532
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 14:25:55 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 6:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1948/
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 16:41:10 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 6: Code-Review+2
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 16:40:31 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 3:
(4 comments)
Logic of the patch looks good to me, did a final pass for stylistic nits. I'll +2 once those are fixed.
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc@228
PS3, Line 228: // Let's not leave tuple ptrs uninitialized.
Can you add the JIRA to this comment to better explain the consequences of not doing this?
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc@259
PS3, Line 259: // Initialize tuple ptrs to a dummy non-null value
Can you add the JIRA there too?
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@89
PS3, Line 89: Poison
This should either be lower-case if we're treating it as a variable or upper-case if we're treating it as a constant. It's a weird edge case but it feels like it should be a constant to me.
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@91
PS3, Line 91: void Init(int size) {
Unnecessary formatting change here.
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 19:48:17 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 5: Verified-1
Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1947/
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 09:45:53 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 2:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: 00:SCAN KUDU 3 90.856ms 107.409ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem
I tried to do a similar experiment with a larger Kudu scale factor (I created a new Kudu table like lineitem and expanded it by inserting duplicate data):
insert into biglineitem select l_orderkey + max_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment from biglineitem, (select max(l_orderkey) as max_orderkey from biglineitem) v
I can definitely see some time spent in the HandleEmptyProjection() function in "perf top" but the delta in performance seems smaller than your experiment showed. I saw it around 5% slower.
The count(*) optimisation sounds good but not sure if the regression is severe enough to block this going in. Maybe Thomas can weigh in on how important he thinks this is.
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Feb 2018 23:22:38 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 2:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: 00:SCAN KUDU 3 90.856ms 107.409ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem
> I tried to do a similar experiment with a larger Kudu scale factor (I creat
Yes, I measured it against a debug build...
Re-run the query against release versions of Impala and now the difference is much smaller.
Updated the commit with the new data.
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 16:26:03 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 1:
(4 comments)
Thanks for the comments!
http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG@32
PS1, Line 32:
> It would be good to do a microbenchmark to make sure we haven't regressed p
Extended the commit with a measurement. Do you think we should develop a count(*) optimisation for kudu like we do for Parquet?
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h@225
PS1, Line 225: void SetNull(const NullIndicatorOffset& offset) {
> I'm not sure about adding DCHECKs to all these functions. They're definitel
Removed the DCHECKs, pointer value is 42 now, such a low address should segfault (it does segfault on my system).
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@49
PS1, Line 49: Tuple* Tuple::Poison() {
> Can we avoid this function call indirection? MemPool just uses a bogus cons
Now it is just a pointer.
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@87
PS1, Line 87: if (UNLIKELY(this == Poison())) return Poison();
> We can skip this case. MemPool::Allocate() actually already returns a poiso
Yeah I saw that, just thought maybe it is better to keep our own poison (that sounds wrong:) ).
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Feb 2018 17:28:12 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 3:
I don't think there's a JIRA. Filed IMPALA-6501
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 19:41:51 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 3:
> I don't think there's a JIRA. Filed IMPALA-6501
Great, thanks.
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 19:45:25 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 3:
(5 comments)
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: | 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
> With the new numbers, I think that this seems okay.
Done, Tim filed IMPALA-6501
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc@228
PS3, Line 228: // Let's not leave tuple ptrs uninitialized.
> Can you add the JIRA to this comment to better explain the consequences of
Added the JIRA number
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc@259
PS3, Line 259: // Initialize tuple ptrs to a dummy non-null value
> Can you add the JIRA there too?
yes, added it
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@89
PS3, Line 89: Poison
> This should either be lower-case if we're treating it as a variable or uppe
Renamed it to all upper-case, I also think it is more of a constant. And we'll never modify the pointed object since it would raise an error instantly.
http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@91
PS3, Line 91: void Init(int size) {
> Unnecessary formatting change here.
Oops, it became multi-line when I added those DCHECKs everywhere.
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Feb 2018 17:39:20 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/9239
to look at the new patch set (#4).
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()
The following query produces non-deterministic results currently:
SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;
The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.
The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static member Tuple::POISON.
I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.
This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem
and doubled its data. The resulting table has 12,002,430 rows.
Without this patch 'select count(*) from biglineitem' runs for ~0.12s.
With the patch applied, the overhead is around a dozens of ms. I measured
the query on my desktop PC using a relase build of Impala. On debug builds,
the execution time of the patched version is around 160% of the original
version.
Without this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1 | 127.50us | 127.50us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 22.32ms | 22.32ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 1.78ms | 1.89ms | 3 | 1 | 16.00 KB | 10.00 MB | |
| 00:SCAN KUDU | 3 | 8.00ms | 8.28ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
With this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 33.00ms | 33.00ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 1.99ms | 2.13ms | 3 | 1 | 16.00 KB | 10.00 MB | |
| 00:SCAN KUDU | 3 | 13.13ms | 13.97ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 27 insertions(+), 2 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/4
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 5: Code-Review+2
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 06:03:35 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 6: Verified+1
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 20:23:20 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 5:
Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1947/
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 06:03:47 +0000
Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/9239
to look at the new patch set (#3).
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()
The following query produces non-deterministic results currently:
SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;
The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.
The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static method called Tuple::Poison().
I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.
This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem
and doubled its data. The resulting table has 12,002,430 rows.
Without this patch 'select count(*) from biglineitem' runs for ~0.12s.
With the patch applied, the overhead is around a dozens of ms. I measured
the query on my desktop PC using a relase build of Impala. On debug builds,
the execution time of the patched version is around 160% of the original
version.
Without this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1 | 127.50us | 127.50us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 22.32ms | 22.32ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 1.78ms | 1.89ms | 3 | 1 | 16.00 KB | 10.00 MB | |
| 00:SCAN KUDU | 3 | 8.00ms | 8.28ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
With this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 33.00ms | 33.00ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 1.99ms | 2.13ms | 3 | 1 | 16.00 KB | 10.00 MB | |
| 00:SCAN KUDU | 3 | 13.13ms | 13.97ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 30 insertions(+), 3 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/3
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 1:
(4 comments)
Looks good. Mainly just questions about perf.
http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG@32
PS1, Line 32:
It would be good to do a microbenchmark to make sure we haven't regressed performance (or if we have, so that we know how much). I was trying to think of the worst possible case. I think it might be count(*) from a large kudu table, since I think that can be served from metadata only and we currently create a bunch of empty row batches.
Parquet has a special count(*) optimisation and other file formats have to decode rows to compute the count, so they seem less likely to regress.
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h@225
PS1, Line 225: void SetNull(const NullIndicatorOffset& offset) {
I'm not sure about adding DCHECKs to all these functions. They're definitely valid but I'm concerned there might be a real impact on the performance of the debug build (which can be annoying for testing, etc). If we set the pointer to a bogus constant we could potentially catch the same set of bugs via the dereference of invalid memory.
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@49
PS1, Line 49: Tuple* Tuple::Poison() {
Can we avoid this function call indirection? MemPool just uses a bogus constant pointer value for this purpose.
http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@87
PS1, Line 87: if (UNLIKELY(this == Poison())) return Poison();
We can skip this case. MemPool::Allocate() actually already returns a poison value for 0 allocations.
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 07 Feb 2018 23:53:16 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()
The following query produces non-deterministic results currently:
SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;
The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.
The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static member Tuple::POISON.
I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.
This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem
and doubled its data. The resulting table has 12,002,430 rows.
Without this patch 'select count(*) from biglineitem' runs for ~0.12s.
With the patch applied, the overhead is around a dozens of ms. I measured
the query on my desktop PC using a relase build of Impala. On debug builds,
the execution time of the patched version is around 160% of the original
version.
Without this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1 | 127.50us | 127.50us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 22.32ms | 22.32ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 1.78ms | 1.89ms | 3 | 1 | 16.00 KB | 10.00 MB | |
| 00:SCAN KUDU | 3 | 8.00ms | 8.28ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
With this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator | #Hosts | Avg Time | Max Time | #Rows | Est. #Rows | Peak Mem | Est. Peak Mem | Detail |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
| 02:EXCHANGE | 1 | 33.00ms | 33.00ms | 3 | 1 | 0 B | 0 B | UNPARTITIONED |
| 01:AGGREGATE | 3 | 1.99ms | 2.13ms | 3 | 1 | 16.00 KB | 10.00 MB | |
| 00:SCAN KUDU | 3 | 13.13ms | 13.97ms | 12.00M | -1 | 512.00 KB | 0 B | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Reviewed-on: http://gerrit.cloudera.org:8080/9239
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 27 insertions(+), 2 deletions(-)
Approvals:
Tim Armstrong: Looks good to me, approved
Impala Public Jenkins: Verified
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
Patch Set 3:
(1 comment)
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: | 03:AGGREGATE | 1 | 129.01us | 129.01us | 1 | 1 | 28.00 KB | 10.00 MB | FINALIZE |
> Yes, I measured it against a debug build...
With the new numbers, I think that this seems okay.
Is there a JIRA for the count(*) optimization for Kudu? (and if not, could you file one?)
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 17:25:27 +0000
Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/9239
to look at the new patch set (#2).
Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................
IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()
The following query produces non-deterministic results currently:
SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;
The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.
The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static method called Tuple::Poison().
I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.
This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I chose tpch_kudu.lineitem
which has 6001215 rows.
Without this patch 'select count(*) from tpch_kudu.lineitem' runs
around 0.15s. With the patch applied, it runs around 0.24s. I ran the
query on my desktop PC.
Without this patch:
ExecSummary:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
-------------------------------------------------------------------------------------------------------------
03:AGGREGATE 1 432.840us 432.840us 1 1 28.00 KB 10.00 MB FINALIZE
02:EXCHANGE 1 31.375ms 31.375ms 3 1 0 0 UNPARTITIONED
01:AGGREGATE 3 5.013ms 5.502ms 3 1 16.00 KB 10.00 MB
00:SCAN KUDU 3 16.581ms 20.199ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem
With this patch:
ExecSummary:
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
-------------------------------------------------------------------------------------------------------------
03:AGGREGATE 1 472.898us 472.898us 1 1 28.00 KB 10.00 MB FINALIZE
02:EXCHANGE 1 122.700ms 122.700ms 3 1 0 0 UNPARTITIONED
01:AGGREGATE 3 6.104ms 6.151ms 3 1 16.00 KB 10.00 MB
00:SCAN KUDU 3 90.856ms 107.409ms 6.00M 6.00M 512.00 KB 0 tpch_kudu.lineitem
Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 30 insertions(+), 3 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/2
--
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>