You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2018/02/07 17:56:19 UTC

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9239


Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................

IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()

The following query produces non-deterministic results currently:

SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
  SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
  functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;

The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.

The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static method called Tuple::Poison().

I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.

Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 47 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 5:

Seems like it was hit by IMPALA-6532


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 14:25:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1948/


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 16:41:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 16:40:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 3:

(4 comments)

Logic of the patch looks good to me, did a final pass for stylistic nits. I'll +2 once those are fixed.

http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc@228
PS3, Line 228:     // Let's not leave tuple ptrs uninitialized.
Can you add the JIRA to this comment to better explain the consequences of not doing this?


http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc@259
PS3, Line 259:     // Initialize tuple ptrs to a dummy non-null value
Can you add the JIRA there too?


http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@89
PS3, Line 89: Poison
This should either be lower-case if we're treating it as a variable or upper-case if we're treating it as a constant. It's a weird edge case but it feels like it should be a constant to me.


http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@91
PS3, Line 91:   void Init(int size) {
Unnecessary formatting change here.



-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 19:48:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1947/


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 09:45:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: 00:SCAN KUDU        3   90.856ms  107.409ms  6.00M       6.00M  512.00 KB              0  tpch_kudu.lineitem
I tried to do a similar experiment with a larger Kudu scale factor (I created a new Kudu table like lineitem and expanded it by inserting duplicate data):

   insert into biglineitem select l_orderkey + max_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag, l_linestatus, l_shipdate, l_commitdate, l_receiptdate, l_shipinstruct, l_shipmode, l_comment from biglineitem, (select max(l_orderkey) as max_orderkey from biglineitem) v

I can definitely see some time spent in the HandleEmptyProjection() function in "perf top" but the delta in performance seems smaller than your experiment showed. I saw it around 5% slower.

The count(*) optimisation sounds good but not sure if the regression is severe enough to block this going in. Maybe Thomas can weigh in on how important he thinks this is.



-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 09 Feb 2018 23:22:38 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: 00:SCAN KUDU        3   90.856ms  107.409ms  6.00M       6.00M  512.00 KB              0  tpch_kudu.lineitem
> I tried to do a similar experiment with a larger Kudu scale factor (I creat
Yes, I measured it against a debug build...
Re-run the query against release versions of Impala and now the difference is much smaller.
Updated the commit with the new data.



-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 16:26:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 1:

(4 comments)

Thanks for the comments!

http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG@32
PS1, Line 32: 
> It would be good to do a microbenchmark to make sure we haven't regressed p
Extended the commit with a measurement. Do you think we should develop a count(*) optimisation for kudu like we do for Parquet?


http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h@225
PS1, Line 225:   void SetNull(const NullIndicatorOffset& offset) {
> I'm not sure about adding DCHECKs to all these functions. They're definitel
Removed the DCHECKs, pointer value is 42 now, such a low address should segfault (it does segfault on my system).


http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@49
PS1, Line 49: Tuple* Tuple::Poison() {
> Can we avoid this function call indirection? MemPool just uses a bogus cons
Now it is just a pointer.


http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@87
PS1, Line 87:   if (UNLIKELY(this == Poison())) return Poison();
> We can skip this case. MemPool::Allocate() actually already returns a poiso
Yeah I saw that, just thought maybe it is better to keep our own poison (that sounds wrong:) ).



-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 08 Feb 2018 17:28:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 3:

I don't think there's a JIRA. Filed IMPALA-6501


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 19:41:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 3:

> I don't think there's a JIRA. Filed IMPALA-6501

Great, thanks.


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 19:45:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: | 03:AGGREGATE | 1      | 129.01us | 129.01us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
> With the new numbers, I think that this seems okay.
Done, Tim filed IMPALA-6501


http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/hdfs-scanner.cc@228
PS3, Line 228:     // Let's not leave tuple ptrs uninitialized.
> Can you add the JIRA to this comment to better explain the consequences of 
Added the JIRA number


http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/exec/kudu-scanner.cc@259
PS3, Line 259:     // Initialize tuple ptrs to a dummy non-null value
> Can you add the JIRA there too?
yes, added it


http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@89
PS3, Line 89: Poison
> This should either be lower-case if we're treating it as a variable or uppe
Renamed it to all upper-case, I also think it is more of a constant. And we'll never modify the pointed object since it would raise an error instantly.


http://gerrit.cloudera.org:8080/#/c/9239/3/be/src/runtime/tuple.h@91
PS3, Line 91:   void Init(int size) {
> Unnecessary formatting change here.
Oops, it became multi-line when I added those DCHECKs everywhere.



-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Feb 2018 17:39:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9239

to look at the new patch set (#4).

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................

IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()

The following query produces non-deterministic results currently:

SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
  SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
  functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;

The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.

The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static member Tuple::POISON.

I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.

This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem
and doubled its data. The resulting table has 12,002,430 rows.

Without this patch 'select count(*) from biglineitem' runs for ~0.12s.
With the patch applied, the overhead is around a dozens of ms. I measured
the query on my desktop PC using a relase build of Impala. On debug builds,
the execution time of the patched version is around 160% of the original
version.

Without this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1      | 127.50us | 127.50us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
| 02:EXCHANGE  | 1      | 22.32ms  | 22.32ms  | 3      | 1          | 0 B       | 0 B           | UNPARTITIONED       |
| 01:AGGREGATE | 3      | 1.78ms   | 1.89ms   | 3      | 1          | 16.00 KB  | 10.00 MB      |                     |
| 00:SCAN KUDU | 3      | 8.00ms   | 8.28ms   | 12.00M | -1         | 512.00 KB | 0 B           | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+

With this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1      | 129.01us | 129.01us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
| 02:EXCHANGE  | 1      | 33.00ms  | 33.00ms  | 3      | 1          | 0 B       | 0 B           | UNPARTITIONED       |
| 01:AGGREGATE | 3      | 1.99ms   | 2.13ms   | 3      | 1          | 16.00 KB  | 10.00 MB      |                     |
| 00:SCAN KUDU | 3      | 13.13ms  | 13.97ms  | 12.00M | -1         | 512.00 KB | 0 B           | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+

Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 27 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/4
-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 06:03:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 20:23:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1947/


-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 16 Feb 2018 06:03:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Tim Armstrong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9239

to look at the new patch set (#3).

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................

IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()

The following query produces non-deterministic results currently:

SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
  SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
  functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;

The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.

The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static method called Tuple::Poison().

I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.

This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem
and doubled its data. The resulting table has 12,002,430 rows.

Without this patch 'select count(*) from biglineitem' runs for ~0.12s.
With the patch applied, the overhead is around a dozens of ms. I measured
the query on my desktop PC using a relase build of Impala. On debug builds,
the execution time of the patched version is around 160% of the original
version.

Without this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1      | 127.50us | 127.50us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
| 02:EXCHANGE  | 1      | 22.32ms  | 22.32ms  | 3      | 1          | 0 B       | 0 B           | UNPARTITIONED       |
| 01:AGGREGATE | 3      | 1.78ms   | 1.89ms   | 3      | 1          | 16.00 KB  | 10.00 MB      |                     |
| 00:SCAN KUDU | 3      | 8.00ms   | 8.28ms   | 12.00M | -1         | 512.00 KB | 0 B           | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+

With this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1      | 129.01us | 129.01us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
| 02:EXCHANGE  | 1      | 33.00ms  | 33.00ms  | 3      | 1          | 0 B       | 0 B           | UNPARTITIONED       |
| 01:AGGREGATE | 3      | 1.99ms   | 2.13ms   | 3      | 1          | 16.00 KB  | 10.00 MB      |                     |
| 00:SCAN KUDU | 3      | 13.13ms  | 13.97ms  | 12.00M | -1         | 512.00 KB | 0 B           | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+

Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 30 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/3
-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 1:

(4 comments)

Looks good. Mainly just questions about perf.

http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9239/1//COMMIT_MSG@32
PS1, Line 32: 
It would be good to do a microbenchmark to make sure we haven't regressed performance (or if we have, so that we know how much). I was trying to think of the worst possible case. I think it might be count(*) from a large kudu table, since I think that can be served from metadata only and we currently create a bunch of empty row batches.

Parquet has a special count(*) optimisation and other file formats have to decode rows to compute the count, so they seem less likely to regress.


http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h
File be/src/runtime/tuple.h:

http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.h@225
PS1, Line 225:   void SetNull(const NullIndicatorOffset& offset) {
I'm not sure about adding DCHECKs to all these functions. They're definitely valid but I'm concerned there might be a real impact on the performance of the debug build (which can be annoying for testing, etc). If we set the pointer to a bogus constant we could potentially catch the same set of bugs via the dereference of invalid memory.


http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc
File be/src/runtime/tuple.cc:

http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@49
PS1, Line 49: Tuple* Tuple::Poison() {
Can we avoid this function call indirection? MemPool just uses a bogus constant pointer value for this purpose.


http://gerrit.cloudera.org:8080/#/c/9239/1/be/src/runtime/tuple.cc@87
PS1, Line 87:   if (UNLIKELY(this == Poison())) return Poison();
We can skip this case. MemPool::Allocate() actually already returns a poison value for 0 allocations.



-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 07 Feb 2018 23:53:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................

IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()

The following query produces non-deterministic results currently:

SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
  SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
  functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;

The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.

The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static member Tuple::POISON.

I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.

This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I copied tpch_kudu.lineitem
and doubled its data. The resulting table has 12,002,430 rows.

Without this patch 'select count(*) from biglineitem' runs for ~0.12s.
With the patch applied, the overhead is around a dozens of ms. I measured
the query on my desktop PC using a relase build of Impala. On debug builds,
the execution time of the patched version is around 160% of the original
version.

Without this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1      | 127.50us | 127.50us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
| 02:EXCHANGE  | 1      | 22.32ms  | 22.32ms  | 3      | 1          | 0 B       | 0 B           | UNPARTITIONED       |
| 01:AGGREGATE | 3      | 1.78ms   | 1.89ms   | 3      | 1          | 16.00 KB  | 10.00 MB      |                     |
| 00:SCAN KUDU | 3      | 8.00ms   | 8.28ms   | 12.00M | -1         | 512.00 KB | 0 B           | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+

With this patch:
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| Operator     | #Hosts | Avg Time | Max Time | #Rows  | Est. #Rows | Peak Mem  | Est. Peak Mem | Detail              |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+
| 03:AGGREGATE | 1      | 129.01us | 129.01us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
| 02:EXCHANGE  | 1      | 33.00ms  | 33.00ms  | 3      | 1          | 0 B       | 0 B           | UNPARTITIONED       |
| 01:AGGREGATE | 3      | 1.99ms   | 2.13ms   | 3      | 1          | 16.00 KB  | 10.00 MB      |                     |
| 00:SCAN KUDU | 3      | 13.13ms  | 13.97ms  | 12.00M | -1         | 512.00 KB | 0 B           | default.biglineitem |
+--------------+--------+----------+----------+--------+------------+-----------+---------------+---------------------+

Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Reviewed-on: http://gerrit.cloudera.org:8080/9239
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 27 insertions(+), 2 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change. ( http://gerrit.cloudera.org:8080/9239 )

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9239/2//COMMIT_MSG@57
PS2, Line 57: | 03:AGGREGATE | 1      | 129.01us | 129.01us | 1      | 1          | 28.00 KB  | 10.00 MB      | FINALIZE            |
> Yes, I measured it against a debug build...
With the new numbers, I think that this seems okay.

Is there a JIRA for the count(*) optimization for Kudu? (and if not, could you file one?)



-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Feb 2018 17:25:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9239

to look at the new patch set (#2).

Change subject: IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows
......................................................................

IMPALA-6258: Uninitialized tuple pointers in row batch for empty rows

Tuple pointers in the generated row batches may not be initialized
if a tuple has byte size 0. There are some codes which compare these
uninitialized pointers against nullptr so having them uninitialized
may return wrong (and non-deterministic) results, e.g.:
impala::TupleIsNullPredicate::GetBooleanVal()

The following query produces non-deterministic results currently:

SELECT count(v.x) FROM functional.alltypestiny t3 LEFT OUTER JOIN (
  SELECT true AS x FROM functional.alltypestiny t1 LEFT OUTER JOIN
  functional.alltypestiny t2 ON (true)) v ON (v.x = t3.bool_col)
WHERE t3.bool_col = true;

The alltypestiny table has 8 records, 4 records of them has the true
boolean value for bool_col. Therefore, the above query is a fancy
way of saying "8 * 8 * 4", i.e. it should return 256.

The solution is that scanners initialize tuple pointers to a non-null
value when there are no materialized slots. This non-null value is
provided by the new static method called Tuple::Poison().

I extended QueryTest/scanners.test with the above query. This test
executes the query against all table formats.

This change has the biggest performance impact on count(*) queries on
large kudu tables. For my quick benchmark I chose tpch_kudu.lineitem
which has 6001215 rows.

Without this patch 'select count(*) from tpch_kudu.lineitem' runs
around 0.15s. With the patch applied, it runs around 0.24s. I ran the
query on my desktop PC.

Without this patch:
ExecSummary:
Operator       #Hosts   Avg Time   Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail
-------------------------------------------------------------------------------------------------------------
03:AGGREGATE        1  432.840us  432.840us      1           1   28.00 KB       10.00 MB  FINALIZE
02:EXCHANGE         1   31.375ms   31.375ms      3           1          0              0  UNPARTITIONED
01:AGGREGATE        3    5.013ms    5.502ms      3           1   16.00 KB       10.00 MB
00:SCAN KUDU        3   16.581ms   20.199ms  6.00M       6.00M  512.00 KB              0  tpch_kudu.lineitem

With this patch:
ExecSummary:
Operator       #Hosts   Avg Time   Max Time  #Rows  Est. #Rows   Peak Mem  Est. Peak Mem  Detail
-------------------------------------------------------------------------------------------------------------
03:AGGREGATE        1  472.898us  472.898us      1           1   28.00 KB       10.00 MB  FINALIZE
02:EXCHANGE         1  122.700ms  122.700ms      3           1          0              0  UNPARTITIONED
01:AGGREGATE        3    6.104ms    6.151ms      3           1   16.00 KB       10.00 MB
00:SCAN KUDU        3   90.856ms  107.409ms  6.00M       6.00M  512.00 KB              0  tpch_kudu.lineitem

Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
---
M be/src/exec/hdfs-scanner.cc
M be/src/exec/kudu-scanner.cc
M be/src/runtime/tuple.cc
M be/src/runtime/tuple.h
M testdata/workloads/functional-query/queries/QueryTest/scanners.test
5 files changed, 30 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/9239/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9239
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I298122aaaa7e62eb5971508e0698e189519755de
Gerrit-Change-Number: 9239
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>