You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Norbert Luksa (Code Review)" <ge...@cloudera.org> on 2020/01/17 17:16:31 UTC

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Norbert Luksa has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15051


Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

WIP: IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be aquired
with the given indices and lengths.

Tests:
 * TODO: performance tests.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
3 files changed, 118 insertions(+), 49 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/1
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 12: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 12
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Feb 2020 15:28:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5286/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 05 Feb 2020 10:06:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     97s      30s     3.23
   l_shipinstruct DICTIONARY 86s      18s     4.77
   l_commitdate   DICTIONARY 85s      20s     4.25

   The queries were run on my machine with MT_DOP and NUM_NODES
   set to 1.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
3 files changed, 56 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/4
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15051/7/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/7/be/src/exec/orc-column-readers.h@212
PS7, Line 212: static_cast<orc::EncodedStringVectorBatch*>(batch_) ==
             :         dynamic_cast<orc::EncodedStringVectorBatch*>(orc_batch)
> it will be true even if orc_batch is just a StringVectorBatch, because it w
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 8
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 10:07:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5512/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Jan 2020 10:56:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 14:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15051/12/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15051/12/be/src/exec/orc-column-readers.cc@180
PS12, Line 180: >(
> I think that >= offsets.size() - 1 is needed, because of the offsets[index 
Done


http://gerrit.cloudera.org:8080/#/c/15051/12/be/src/exec/orc-column-readers.cc@184
PS12, Line 184:     src_ptr = blob_ + offsets[index];
              :     src_len = offsets[index + 1] - offsets[index];
> I think that we cannot trust completely in the length values at the moment 
Thanks Csaba for investigating, uploaded a PS for the above check, will open ORC jiras for both of your comments.



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 14
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Feb 2020 11:02:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5357/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 14
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Feb 2020 11:48:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 6:

(5 comments)

Thank you all for taking a look, addressed your comments.

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG@31
PS3, Line 31:   39.06s  
> nit: probably 'a desktop PC', or 'a desktop PC with CPU X'
Done


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/hdfs-orc-scanner.cc@613
PS2, Line 613: 
> why did you restore the original narrow try-block?
I do not call the ORC library anymore in the UpdateInputBatch, therefore the change here is not necessary.


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@625
PS3, Line 625:   orc_root_batch_ = row_reader_->createRowBatch(row_batch->capacity());
> nit: this extra new line is not needed
Done


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683
PS3, Line 683: 
> As we have discussed offline, this is no longer correct. Using the current 
Thank you, Csaba for the detailed comment.
Added mempools for the dictionary and the non-dictionary cases, looks like it works now.


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/orc-column-readers.h@198
PS3, Line 198:   HdfsOrcScanne
> Looking at the ORC library, it should be possible, since it passes the dict
Added mempools for the scanner, now we can reuse the blobs.



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 12:04:37 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 9: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 9
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 11:33:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5458/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 17 Jan 2020 17:47:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 15: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 15
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Feb 2020 14:28:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG@31
PS3, Line 31: my machine
nit: probably 'a desktop PC', or 'a desktop PC with CPU X'


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/hdfs-orc-scanner.cc@613
PS2, Line 613: orc_r
why did you restore the original narrow try-block?


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/orc-column-readers.h@198
PS3, Line 198: blob_ = nullptr
So we cannot reuse blobs?

Did you measure different scan times with/without this optimization? (the commit message still contains the measurements from PS2 which had this optimization)



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 22 Jan 2020 14:06:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5586/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 8
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 10:53:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 11: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 11
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Feb 2020 14:12:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 8
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 10:13:32 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h@207
PS2, Line 207: /*&& !orc_batch->isEncoded*/
nit: pls remove commented code


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@150
PS2, Line 150:   if (blob_ == nullptr) {
nit: It is probably just personal preference but I find it more readable to return early in edge cases. This saves some indentation later on. In this case if (blob_ != nullptr) return Status::OK()


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@151
PS2, Line 151: TODO
Is this TODO going to be taken care within this patch or do you expect this to be handled somewhere in the future unrelated to this change? If the second one then opening a Jira and adding the Jira ID to the comment would make sense. (I see that the TODO was added after a review comment requested it, though).


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@167
PS2, Line 167:  
nit : too many spaces here and below


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@176
PS2, Line 176: InitBlob
I'm wondering if calling InitBlob() could move into UpdateInputBatch() from here and from L186. This way you don't have to do the "if (blob_ == nullptr)" check in every ReadValue() through the InitBlob() functions just simply you can DCHECK that it's not null.



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 20 Jan 2020 14:19:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 11: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5397/


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 11
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Feb 2020 18:39:04 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     97s      30s     3.23
   l_shipinstruct DICTIONARY 86s      18s     4.77
   l_commitdate   DICTIONARY 85s      20s     4.25

   The queries were run on my machine with MT_DOP and NUM_NODES
   set to 1.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
3 files changed, 57 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/3
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 15:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5428/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 15
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Feb 2020 09:33:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be aquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name      | encoding | before      | after
   =====================================================
   l_comment       DIRECT     14.05s (97s)  4.11s (30s)
   l_shipinstruct  DICTIONARY 12.36s (86s)  2.16s (18s)
   l_commitdate    DICTIONARY 12.56s (85s)  2.54s (20s)

   The queries are run on my machine with 8 core, while
   in the brackets the durations are when MT_DOP and NUM_NODES
   are set to 1.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
3 files changed, 95 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/2
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
   and tpch query tests.
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     16.42s   14.38s  14%
   l_shipinstruct DICTIONARY 5.26s    3.80s   32%
   l_commitdate   DICTIONARY 5.46s    5.19s   5%
   all string col BOTH       39.06s   32.18s  21%

   The queries were run on a desktop PC with MT_DOP and NUM_NODES
   set to 1.
 * Also run TPC-H queries on the TPC-H benchmark where some
   queries' runtime improved by around 10-15%, while there were
   no regression for the others.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 135 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/8
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 8
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
   and tpch query tests.
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     16.42s   14.38s  14%
   l_shipinstruct DICTIONARY 5.26s    3.80s   32%
   l_commitdate   DICTIONARY 5.46s    5.19s   5%
   all string col BOTH       39.06s   32.18s  21%

   The queries were run on a desktop PC with MT_DOP and NUM_NODES
   set to 1.
 * Also run TPC-H queries on the TPC-H benchmark where some
   queries' runtime improved by around 10-15%, while there were
   no regression for the others.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Reviewed-on: http://gerrit.cloudera.org:8080/15051
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 135 insertions(+), 42 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 16
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5459/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 17 Jan 2020 17:47:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5409/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 12
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Feb 2020 15:28:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
   and tpch query tests.
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     16.42s   14.38s  14%
   l_shipinstruct DICTIONARY 5.26s    3.80s   32%
   l_commitdate   DICTIONARY 5.46s    5.19s   5%
   all string col BOTH       39.06s   32.18s  21%

   The queries were run on a desktop PC with MT_DOP and NUM_NODES
   set to 1.
 * Also run TPC-H queries on the TPC-H benchmark where some
   queries' runtime improved by around 10-15%, while there were
   no regression for the others.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 135 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/14
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 14
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15051/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15051/1/be/src/exec/orc-column-readers.cc@184
PS1, Line 184:         static_cast<uint64_t>(index) >= currentBatch->dictionary->dictionaryOffset.size()) {
line too long (92 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 17 Jan 2020 17:17:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5550/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 18:01:57 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5409/


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 12
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Feb 2020 20:17:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 15: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 15
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Feb 2020 09:33:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5483/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Jan 2020 15:34:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
   and tpch query tests.
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     16.42s   14.38s  14%
   l_shipinstruct DICTIONARY 5.26s    3.80s   32%
   l_commitdate   DICTIONARY 5.46s    5.19s   5%
   all string col BOTH       39.06s   32.18s  21%

   The queries were run on a desktop PC with MT_DOP and NUM_NODES
   set to 1.
 * Also run TPC-H queries on the TPC-H benchmark where some
   queries' runtime improved by around 10-15%, while there were
   no regression for the others.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 135 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/7
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15051/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/2//COMMIT_MSG@17
PS2, Line 17: aquired
nit: acquired


http://gerrit.cloudera.org:8080/#/c/15051/2//COMMIT_MSG@25
PS2, Line 25:    col_name      | encoding | before      | after
nit: you could add a 'speedup' column


http://gerrit.cloudera.org:8080/#/c/15051/2//COMMIT_MSG@31
PS2, Line 31: 8 core
Since you measured with MT_DOP and NUM_NUDES being 1, having 8 core is not significant here.


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@167
PS2, Line 167:  
nit: extra space


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@175
PS2, Line 175: =
nit: put the '=' sign to the previous line



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 20 Jan 2020 14:04:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 10: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5286/


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 05 Feb 2020 15:07:23 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15051/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/6//COMMIT_MSG@35
PS6, Line 35: Also run TPC-H queries on the TPC-H benchmark where some
            :    queries' runtime improved by around 10-15%.
You could also state explicitly whether other queries have regressed or not.



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 15:44:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15051/7/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/7/be/src/exec/orc-column-readers.h@212
PS7, Line 212: b_ = nullptr;
             : 
it will be true even if orc_batch is just a StringVectorBatch, because it will be nullptr == nullptr



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Jan 2020 13:18:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683
PS3, Line 683:  dst_batch->tuple_data_pool()
> Thank you, Csaba for the detailed comment.
As I see (and as you also suggested offline), we do not need to pass the mempool here, right?



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 12:31:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 4:

Rebased.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 4
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Jan 2020 10:10:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5489/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 9
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 11:33:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................

IMPALA-9226: Improve string allocations of the ORC scanner

Currently the OrcColumnReader copies values from the
orc::StringVectorBatch one-by-one. Since ORC 1.6, the blob which
contains the pointed values is moved to the StringVectorBatch,
so we can copy it.

This commit beside the above improvement also enables the
LazyEncoding option for the ORC reader. This way, for stripes
with DICTIONARY_ENCODING[_V2], EncodedStringVectorBatch contains
the data in a dictionaryBlob from which the data can be acquired
with the given indices and lengths.

Tests:
 * Run ORC scanner tests (query_tests/test_scanners.py::TestOrc)
   and tpch query tests.
 * Tested performance on tpch.lineitem table with scale=25,
   running queries that selects min of string columns.
   Some results:
   col_name     | encoding | before | after | speedup
   =============================================================
   l_comment      DIRECT     16.42s   14.38s  14%
   l_shipinstruct DICTIONARY 5.26s    3.80s   32%
   l_commitdate   DICTIONARY 5.46s    5.19s   5%
   all string col BOTH       39.06s   32.18s  21%

   The queries were run on a desktop PC with MT_DOP and NUM_NODES
   set to 1.
 * Also run TPC-H queries on the TPC-H benchmark where some
   queries' runtime improved by around 10-15%.

Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
---
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
4 files changed, 137 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/51/15051/6
-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 12:

(2 comments)

I looked around for things that could go wrong here and in the Orc lib.

http://gerrit.cloudera.org:8080/#/c/15051/12/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15051/12/be/src/exec/orc-column-readers.cc@180
PS12, Line 180: >=
I think that >= offsets.size() - 1 is needed, because of the offsets[index + 1] in line 185.

The same too permissive error check also exists in the Orc lib: https://github.com/apache/orc/blob/bd4825568dca4ce06f8d3428f5e9ced2f53bb6f2/c%2B%2B/include/orc/Vector.hh#L137

dictionaryOffset is initialized to have place for one extra element at the end: https://github.com/apache/orc/blob/f349dc65af43911f8b839d25bacbd39bcc3dddd9/c%2B%2B/src/ColumnReader.cc#L582 , so normally we shouldn't hit this issue.


BTW, this error condition could be UNLIKELY


http://gerrit.cloudera.org:8080/#/c/15051/12/be/src/exec/orc-column-readers.cc@184
PS12, Line 184:     src_ptr = blob_ + offsets[index];
              :     src_len = offsets[index + 1] - offsets[index];
I think that we cannot trust completely in the length values at the moment due to a possible overflow in the Orc lib:
https://github.com/apache/orc/blob/f349dc65af43911f8b839d25bacbd39bcc3dddd9/c%2B%2B/src/ColumnReader.cc#L590

The issue there is that corrupt huge length values can cause some elements to overflow, but at the end of the vector it can return to a "sane" range, and only the last element is used in later code, so the issue can pass uncaught in the Orc lib.



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 12
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Feb 2020 23:02:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 14: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 14
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 27 Feb 2020 13:20:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG@24
PS3, Line 24: Some results:
> As discussed offline: These measurement were run with a debug build. Let' s
Right, unfortunately when I run the queries in release build, the results are not so bright. There is still a little speedup, but it's far away from the speedup in the debug build.
Will upload the exact numbers when I tested it with more runs.


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/orc-column-readers.h@198
PS3, Line 198: blob_ = nullptr
> So we cannot reuse blobs?
Looking at the ORC library, it should be possible, since it passes the dictionary along the way.
However, I did not find a simple solution for the same in Impala. After some digging, I found that the memory where the blob is currently allocated is owned by the row batch, and we cannot use this memory with the next row batch.
I'll try to check if I keep track of the row batch and reuse the blob between them, will come back to you with the results.



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Jan 2020 15:51:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5397/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 11
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Feb 2020 14:12:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683
PS3, Line 683:  dst_batch->tuple_data_pool()
As we have discussed offline, this is no longer correct. Using the current batch's mempool means that the memory can be reclaimed by the consumer of this row batch (the caller of this node's GetNext()) after the current GetNext() call returned. So it cannot be used in future batches.

Before your optimization, only those string were allocated here, which were used in the current row batch, so there was no problem. Now if batch_one starts the read rows from the Orc vector, and batch_two finishes it, the blob will belong to batch_one, so it can be already freed/reused when batch_two starts to use it.

The solution is to give the scanner its own mempool, use that to initialize the blobs, and attach the blobs to the last row batch that reads rows from the given vector. 

For Parquet this is mempool is https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/be/src/exec/parquet/parquet-column-chunk-reader.h#L140

Parquet dictionaries use a different pool: https://github.com/apache/impala/blob/4fa6b5260d9e28dee63a87de3bea1434706a9d05/be/src/exec/parquet/hdfs-parquet-scanner.h#L430

Doing this in Orc will be much simpler, as the library hides compression/decoding of data, but the way buffers are attached to row batches should be the same.



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Jan 2020 16:24:07 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 2:

As Zoltán suggested, I tried to add an optimisation, where I would not have to copy the blob in each batch, but only for stripes. I have not managed to figure it out, looks like the memory is freed between the batches. I left this part of the code in the commit in a comment for now, will try to solve it next week.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 17 Jan 2020 17:21:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Gabor Kaszab (Code Review)" <ge...@cloudera.org>.
Gabor Kaszab has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/3//COMMIT_MSG@24
PS3, Line 24: Some results:
As discussed offline: These measurement were run with a debug build. Let' see the same with release builds as that is the one we would like to optimize.
I also experienced a big difference in performance between debug and perf.


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@625
PS3, Line 625: 
nit: this extra new line is not needed



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Jan 2020 14:50:25 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15051/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15051/6//COMMIT_MSG@35
PS6, Line 35: Also run TPC-H queries on the TPC-H benchmark where some
            :    queries' runtime improved by around 10-15%,
> You could also state explicitly whether other queries have regressed or not
Done


http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683
PS3, Line 683: 
> It is still used for CollectionValueBuilder.
Oh, right.


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@202
PS6, Line 202:     batch_ = static_cast<orc::StringVectorBatch*>(orc_batch);
             :     if (orc_batch == nullptr) return Status::OK();
> Is it ok that batch_ is not set to null if orc_batch is null?
Swapped the two lines.


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@210
PS6, Line 210: &batch_-
> Does Orc support mixed encoding, e.g. a stripe that has both dictionary and
Yes, according to the ORC documentation it cannot change through stripes.
I'm not sure if it's worth to add an extra variable just to make sure that this behaviour is preserved by the ORC lib. There are DCHECKs which should also fail in that case (dynamic casting to the Vector batch is based on the encoding).
I think a comment to clarify this behaviour would be enough, but please raise your concern if you think otherwise.


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@211
PS6, Line 211: 
> "RETURN_IF_ERROR" + "return Status::OK()" could be simplified to "return".
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 7
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 17:17:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/15051/3/be/src/exec/hdfs-orc-scanner.cc@683
PS3, Line 683: 
> As I see (and as you also suggested offline), we do not need to pass the me
It is still used for CollectionValueBuilder.


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@202
PS6, Line 202:     if (orc_batch == nullptr) return Status::OK();
             :     batch_ = static_cast<orc::StringVectorBatch*>(orc_batch);
Is it ok that batch_ is not set to null if orc_batch is null?


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@210
PS6, Line 210: InitBlob
Does Orc support mixed encoding, e.g. a stripe that has both dictionary and direct encoded encoded chunks? In that case if there were already dictionary encoded vectors, then we "forget" the dictionary blob here. If it is possible that there will be more dictionary encoded chunks, then blob_ will still point to a direct encoded blob instead of the dictionary.

According to https://orc.apache.org/specification/ORCv2/ the same encoding will be used for a column in the whole stripe, so the issue above is not possible, but the code suggests that encoding can vary from ColumnVectorBatch to ColumnVectorBatch.

It would be clearer to add a state to OrcStringColumnReader that stores the encoding, and treat it as an error if it changes within a stripe. In Parquet encoding can vary from page to page, it generally starts with dictionary encoded pages and falls back to "plain" when the dictionary gets too large, so readers (like me) may assume that Orc works similarly.


http://gerrit.cloudera.org:8080/#/c/15051/6/be/src/exec/orc-column-readers.h@211
PS6, Line 211: return Status::OK();
"RETURN_IF_ERROR" + "return Status::OK()" could be simplified to "return".



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 15:20:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 9: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5489/


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 9
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 15:50:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Daniel Becker (Code Review)" <ge...@cloudera.org>.
Daniel Becker has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h@207
PS2, Line 207: /*&& !orc_batch->isEncoded*/
Do we need this condition?


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.h@219
PS2, Line 219: char*
Don't we usually use uint8_t* to refer to raw bytes? Or is it specifically for strings?


http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15051/2/be/src/exec/orc-column-readers.cc@180
PS2, Line 180: index < 0
Index is unsigned so it is never less than zero.

Why do we cast currentBatch->index[row_idx] to uint64_t? Can't it remain int64_t?



-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 20 Jan 2020 12:20:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5546/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 6
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 12:49:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9226: Improve string allocations of the ORC scanner

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15051 )

Change subject: IMPALA-9226: Improve string allocations of the ORC scanner
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15051
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If2d975946fb6f4104d8dc98895285b3a0c6bef7f
Gerrit-Change-Number: 15051
Gerrit-PatchSet: 10
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Daniel Becker <da...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 05 Feb 2020 10:06:13 +0000
Gerrit-HasComments: No