You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Amogh Margoor (Code Review)" <ge...@cloudera.org> on 2021/09/21 00:00:14 UTC

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Amogh Margoor has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17860


Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

[WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized, before filtering upon it during
scan. Instead, cost can be saved if only the columns required for
filtering are materialized first and then rest of the columns are
materialized only for rows surviving after filter.

Performance: TBD

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/scratch-tuple-batch.h
12 files changed, 344 insertions(+), 58 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 1
Gerrit-Owner: Amogh Margoor <am...@gmail.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS3, Line 253:     vector<ParquetColumnReader*>* non_filter_readers) const {
             :   vector<ScalarExpr*> conjuncts;
             :   conjuncts.reserve(scan_node_->conjuncts().size() +
> Great!  May add a test for the following query. Since the filter is on a.wr
sure I have added a test for it. Thanks.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 19 Oct 2021 18:48:36 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

Wrote the following code for randomly generating selected rows. Have not get a chance to test it out  yet (some runtime issue with by dev box). 

// This tests checks conversion of 'selected_rows' with randomly generated                    
// 'true' values to 'ScratchMicroBatch';                                                      
TEST_F(ScratchTupleBatchTest, TestRandomGeneratedMicroBatches) {                              
  const int BATCH_SIZE = 1024;                                                                
  scoped_ptr<ScratchTupleBatch> scratch_batch(                                                
      new ScratchTupleBatch(*desc_, BATCH_SIZE, &tracker_));                                  
  scratch_batch->num_tuples = BATCH_SIZE;                                                     
  // gaps to try                                                                              
  vector<int> gaps = {5, 16, 29, 37, 1025};                                                   
  for (auto n : gaps) {                                                                       
    // Set randomly locations as selected.                                                    
    srand (time(NULL));                                                                       
    for (int batch_idx = 0; batch_idx < BATCH_SIZE; ++batch_idx) {                            
      scratch_batch->selected_rows[batch_idx] = rand() < (RAND_MAX / 2);                      
    }                                                                                         
    ScratchMicroBatch micro_batches[BATCH_SIZE];                                              
    int batches = scratch_batch->GetMicroBatches(n, micro_batches);                           
    EXPECT_TRUE(batches > 1);                                                                 
    EXPECT_TRUE(batches <= BATCH_SIZE);                                                       
    // Verify every batch                                                                     
    for (int i = 0; i < batches; i++) {                                                       
      const ScratchMicroBatch& batch = micro_batches[i];                                      
      EXPECT_TRUE(batch.start <= batch.end);                                                  
      EXPECT_TRUE(batch.length == batch.end - batch.start + 1);                               
      EXPECT_TRUE(batch.start);                                                               
      EXPECT_TRUE(batch.end);                                                                 
      int last_true_idx = batch.start;                                                        
      for (int j = batch.start + 1; j < batch.end; j++) {                                     
        if (scratch_batch->selected_rows[j]) {                                                
          EXPECT_TRUE(j - last_true_idx + 1 <= n);                                            
          last_true_idx = j;                                                                  
        }                                                                                     
      }                                                                                       
    }                                                                                         
    // Verify any two consecutive batches i and i+1                                           
    for (int i = 0; i < batches - 1; i++) {                                                   
      const ScratchMicroBatch& batch = micro_batches[i];                                      
      const ScratchMicroBatch& nbatch = micro_batches[i + 1];                                 
      EXPECT_TRUE(batch.end < nbatch.start);                                                  
      EXPECT_TRUE(nbatch.start - batch.end >= n);                                             
      // Any row in betweeen the two batches should not be selected                           
      for (int j=batch.end+1; j<nbatch.start; j++) {                                          
        EXPECT_FALSE(scratch_batch->selected_rows[j]);                                        
      }                                                                                       
    }                                                                                         
  }                                                                                           
}


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 14:34:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

(5 comments)

Looks great. We may need to beef up the test section a bit.

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@69
PS12, Line 69: 2, 4, 8, 16, 32
nit. I wonder if we can construct a test with randomly assigned true/false for the batch as follows. 

1. Random assign true/false: for each element, flip a coin;
2. Construct the micro batches;
3. Verify the micro batches as follows: each one has valid start/end and length per selected_rows[start, end], and the gap between batch i and i+1 is correct.


http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@86
PS12, Line 86: Creates
nit. Expect to observe


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@29
PS10, Line 29: ScratchMicroBatch
> Using aggregate initialisers instead of constructor accepting arguments as 
Done


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@203
PS10, Line 203:                     << "should be true";
              :     /// Add the last micro batch which was b
> We can avoid that extra branch and condition and also extra condition on cl
Done


http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@436
PS12, Line 436: row_regex:.* RF00.\[min_max\] -. .\.wr_item_sk.*
In addition, I wonder if we can grab a few counters on late materialized rows.  For example a total of 100 rows is returned out of a total 1000 rows, then

row_regex: .*LATE_MATERIALIZED_ROWS=100.* 

Such test probably can be done for a mix of each kind of expressions handled in the code, and the return type of scan column.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Oct 2021 13:24:52 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG@24
PS12, Line 24: TPCH scale 42
> Hi Zoltan,
Thanks Amogh for doing the benchmarking. It's nice to see the improvements and that there's no regressions.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 02 Nov 2021 10:16:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#9). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less that 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
20 files changed, 895 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/9
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 9
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 11:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/17860/11/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/11/be/src/exec/parquet/hdfs-parquet-scanner.cc@2291
PS11, Line 2291:         int num_micro_batches = scratch_batch_->GetMicroBatches(late_materialization_threshold_,
line too long (96 > 90)


http://gerrit.cloudera.org:8080/#/c/17860/11/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/11/be/src/exec/scratch-tuple-batch-test.cc@84
PS11, Line 84:       EXPECT_EQ(scratch_batch->GetMicroBatches(10 /*Skip Length*/, micro_batches), 1024/n);
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17860/11/be/src/exec/scratch-tuple-batch-test.cc@116
PS11, Line 116:     EXPECT_EQ(scratch_batch->GetMicroBatches(10 /*Skip Length*/, micro_batches), 1024/(n * 2));
line too long (95 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 11
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 10:30:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#12). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
20 files changed, 935 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/12
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 4:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9565/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 4
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Oct 2021 09:38:26 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

[WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized, before filtering upon it during
scan. Instead, cost can be saved if only the columns required for
filtering are materialized first and then rest of the columns are
materialized only for rows surviving after filter.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
18 files changed, 774 insertions(+), 121 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 4
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9543/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Oct 2021 16:25:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9593/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 7
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Oct 2021 13:45:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 5:

(7 comments)

Looks good!

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330: 
> >> By reading the code, my guess is that each batch covers a >> number of, 
Okay. Thanks for the clarification on skip length. 

My guess is that converting to batches may not be beneficial, especially when the T and F are interleaved tightly (the common case). In this case, you may need to recheck the selected rows in the batch.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381
PS3, Line 2381: if (micro_batches[0].start > 0) {
              :           if (UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
              :             return Status(Substitute("Couldn't skip rows in file $0.", filename()));
              :           }
              :         }
> It is possible to have micro_batches[0].start==0, but in that case we don't
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417
PS3, Line 2417: return Status::OK();
> This behaviour is retained from earlier code. You will find the code in Ass
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@553
PS3, Line 553: 
> sure. Not sure if this happened due to clang-format.
Normally, I just highlight a section of code modified (in vi: shift-v and scroll down with j) and clang format (in vi: control-k) :-).


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1151
PS3, Line 1151: Status::OK();
> Not really. It depends on if abort_on_error is set as Query option. LogCorr
Good to know! Done.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1484
PS3, Line 1484: num_buffered_values_ == 0
> This signifies end of RowGroup. Same logic is currently being used (check N
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1538
PS3, Line 1538: }
> We don't need to check, we can just return it and client will check it. But
Hmm. It seems a false returning status means likely the data in the page is corrupted. Sounds like we should bail out? The return status from SkipTopLevelRows() in the code below is checked. 

1218 bool BaseScalarColumnReader::SkipTopLevelRows(int64_t num_rows) {                          
1219   DCHECK_GE(num_buffered_values_, num_rows);                                               
1220   DCHECK_GT(num_rows, 0);   
.. .. ..
1272     }                                                                                      
1273   }                                                                                        
1274   return SkipEncodedValuesInPage(num_values_to_skip);                                      
1275 }   

 385 template <typename InternalType, parquet::Type::type PARQUET_TYPE, bool MATERIALIZED>      
 386 bool ScalarColumnReader<InternalType, PARQUET_TYPE,                                        
 387     MATERIALIZED>::SkipEncodedValuesInPage(int64_t num_values) {                           
 388   if (bool_decoder_) {                                                                     
 389     return bool_decoder_->SkipValues(num_values);                                          
 390   }                                                                                        
 391   if (IsDictionaryEncoding(page_encoding_)) {                                              
 392     return dict_decoder_.SkipValues(num_values);                                           
 393   } else {                                                                                 
 394     DCHECK_EQ(page_encoding_, Encoding::PLAIN);                                            
 395     int64_t encoded_len = ParquetPlainEncoder::EncodedLen<PARQUET_TYPE>(                   
 396         data_, data_end_, fixed_len_size_, num_values);                                    
 397     if (encoded_len < 0) return false;                                                     
 398     data_ += encoded_len;                                                                  
 399   }                                                                                        
 400   return true;                                                                             
 401 }



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Oct 2021 13:46:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@69
PS12, Line 69: 2, 4, 8, 16, 32
> Verification of micro batches created for randomly assigned true/false woul
I see. Let us assume the following: 

0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 
F F F F T T F F T T F F F F F F T T F T

gap=5

batches formed

[4, 9], [16, 19] 

How do we verify?

1. For any batch i [begin, end, length]
     a. selected_rows[begin] == selected_rows[end] == T
     b. The distance of any consecutive T within selected_rows[begin, end] should be no more than gap

2. For any batch i and i+1
     a. i.end < (i+1).begin
     b. (i+1).begin - i.end +1 >= gap

Will the above work?



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Oct 2021 16:16:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12: Code-Review+1

(2 comments)

Thanks Amogh, this work is truly marvellous.

And thanks for measuring code coverage. I think you used test_scanners.py (as we discussed offline). With test_nested_types.py as well I think we would get probably complete code coverage.

The change looks great, I'm ready to upgrade my +1 to +2 once it looks OK for the other reviewers.

http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG@24
PS12, Line 24: TPCH scale 42
I think it would be good to execute the whole benchmark with bin/single_node_perf_run.py on your dev machine, just to make sure this patch won't cause any regressions. And of course it'd be also interesting to see if there are any significant perf gains in the benchmark queries.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@2223
PS10, Line 2223: c.
> We don't need to re-filter after step 3. I will explain it in comment.
Thanks!



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Oct 2021 16:56:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9476/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 1
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 21 Sep 2021 00:25:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

[WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized, before filtering upon it during
scan. Instead, cost can be saved if only the columns required for
filtering are materialized first and then rest of the columns are
materialized only for rows surviving after filter.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
19 files changed, 797 insertions(+), 109 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less that 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
19 files changed, 897 insertions(+), 91 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/6
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 6
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9577/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 6
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Oct 2021 16:47:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 16:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9683/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 16
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 20:09:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 17:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9684/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 17
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 20:19:54 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@994
PS4, Line 994:     
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2266
PS4, Line 2266:         int num_micro_batches = 
line has trailing whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 4
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Oct 2021 09:17:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less that 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
20 files changed, 896 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/8
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 14:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9681/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 14
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 19:41:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 15:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/15/tests/query_test/test_parquet_late_materialization.py
File tests/query_test/test_parquet_late_materialization.py:

http://gerrit.cloudera.org:8080/#/c/17860/15/tests/query_test/test_parquet_late_materialization.py@2
PS15, Line 2: class TestParquetLateMaterialization(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 0


http://gerrit.cloudera.org:8080/#/c/17860/15/tests/query_test/test_parquet_late_materialization.py@14
PS15, Line 14: 
flake8: W292 no newline at end of file



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 15
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 19:41:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 17: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7575/


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 17
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Oct 2021 02:43:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 18:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9689/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 18
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Oct 2021 11:58:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#18). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'
 3. Added end-to-end test for late materialization.
Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
A tests/query_test/test_parquet_late_materialization.py
22 files changed, 1,070 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/18
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 18
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 3:

(9 comments)

Was able to complete the review. Have some comments.

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h@60
PS3, Line 60: bool* selected_rows
Need to document the use of this new argument in the comment above.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@249
PS3, Line 249: vector<ParquetColumnReader*>& 
nit. const vector<ParquetColumnReader*>&.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250
PS3, Line 250: vector<ParquetColumnReader*>& filter_readers,
             :     vector<ParquetColumnReader*>& non_filter_readers)
In Impala BE, prefer to use vector<ParquetColumnReader*>* when the argument is to be modified.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS3, Line 253:   conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->conjuncts()),
             :       std::end(scan_node_->conjuncts()));
             :   conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->filter_exprs()),
Can you please check if the code is sufficient to deal with min/max filters from HJs? 

For min/max filter F from HJ, we allocate two slots in minMaxTuple_ (see HdfsScanNode.java  line 843 to 846, and save the start slot index, together with the filter Id, in overlapPredicateDescs_. 

Thus, we may need to iterate over BE version of  overlapPredicateDescs_ (see GetOverlapPredicateDescs() in parquet/hdfs-parquet-scanner.h) to grab the pair of the slots from which the column path can be found.

Is it possible to later materialize rows surviving a min/max filter?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@270
PS3, Line 270: vector<ScalarExpr*> conjuncts
should use const T&.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330: int skip_length)
By reading the code, my guess is that each batch covers a number of, up to skip length, selected rows. 

When the selected rows interleave with non-selected rows (for example, T, T, F, T, F, F, T), then use of selected_rows as the representation should be better, since here in the code, the batch covers both rows T and F. 

When the selected rows form a cluster (such as on a sorted column), then the current implementation selects on span of rows which is very nice.

Is it possible to work with selected_rows directly?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2377
PS3, Line 2377: continue_execution
should init this boolean, as the code below conditionally set it.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381
PS3, Line 2381: if (micro_batches[0].start > 0) {
              :           if (UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
              :             return Status(Substitute("Couldn't skip rows in file $0.", filename()));
              :           }
              :         }
Do we need an else branch to deal with micro_batches[0].start==0? If it is not possible, a DCHECK() on it probably should be added.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417
PS3, Line 2417: return Status::OK();
I wonder if a non Status::OK() object should return here.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Oct 2021 19:38:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Kurt Deschler (Code Review)" <ge...@cloudera.org>.
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 1:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-columnar-scanner-ir.cc
File be/src/exec/hdfs-columnar-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-columnar-scanner-ir.cc@25
PS1, Line 25: selected_rows
Can you make this a member of RowBatch or a devived classso you don't need to pass the 2 (related) arguments separately?


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-orc-scanner.cc@629
PS1, Line 629:     bool selected_rows[scratch_batch_->capacity];
Should check end of stack here or allocate memory if capacity is anything substantial. If not, comment/DCHECK that size is small. Non-issue if this moves into RowBatch.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h@431
PS1, Line 431:   /// Column readers among 'column_readers_' used for filtering
Consider packaging these in a class if they are going to be passed together.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h@911
PS1, Line 911:   vector<SlotId> GetSlotIdsForConjuncts(vector<ScalarExpr*> conjuncts);
Pass vector as reference to avoid copying.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@246
PS1, Line 246:     vector<ParquetColumnReader*>& non_filter_readers) {
Declare function const.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@254
PS1, Line 254:   slot_ids.insert(std::begin(conjunct_slot_ids), std::end(conjunct_slot_ids));
use conjunct_slot_ids instead of making a copy.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2147
PS1, Line 2147:       // Old logic
Old vs new not meaningful comment here. This doesn't look like the old logic..


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2148
PS1, Line 2148:       ScratchMicroBatch batches[1];
change array[1] to single variable.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2151
PS1, Line 2151:       batches[0].length = scratch_batch_->capacity;
Probably best to leave the old logic in-tact here both to avoid any potential performance regressions and so that the new logic can be completely disabled via parameter. Plus calling the new interface for a single batch seems clunky.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2166
PS1, Line 2166:       ScratchMicroBatch batches[1];
Move this new logic into a function as well.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2238
PS1, Line 2238:         // continue the old range as 'last' is within 'skip_length' of last range.
How does it end up in this case?


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253
PS1, Line 2253:     batches[range].start = start;
Can we only get here if batch_size==0?


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2275
PS1, Line 2275:             return Status(Substitute("Couldn't skip rows in file $0.", filename()));
Add comments how this can occur



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 1
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Sep 2021 18:07:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9523/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 2
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Sep 2021 22:01:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 3:

Thanks for the comments Kurt. Will take care of them.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Oct 2021 16:14:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 7:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS6, Line 253:   conjuncts.reserve(scan_node_->conjuncts().size() +
> Pass reserve to ctor instead.
Avoiding it for the reason mentioned in another similar comment.


http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/hdfs-parquet-scanner.cc@274
PS6, Line 274:     const vector<ScalarExpr*>& conjuncts) const {
> Pass reserve to ctor instead.
Passing size to ctor would not only reserve the memory it would also initialise it with some values (0 for int, default constructor for user defined class etc ). So using push_back and insert after that would give a different result: https://onlinegdb.com/uyvYdBNZ1


http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/parquet-column-chunk-reader.h@124
PS6, Line 124:   Status ReadNextDataPage(
> Probably not necessary/good to templatize page-level functions like this un
I agree. Since it is at the page level we can avoid it. Removed it now.


http://gerrit.cloudera.org:8080/#/c/17860/5/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/17860/5/be/src/exec/parquet/parquet-column-chunk-reader.h@125
PS5, Line 125:       bool* eos, uint8_t** data, int* data_size, bool read_data = true);
> What was the motivation for inlining this (page-level) function?
inlining was side effect of using template; since definition for templated function has to be visible everywhere it is used, it was pulled into header. But since I removed template I have removed inlining too.


http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/scratch-tuple-batch-test.cc@64
PS6, Line 64:   scoped_ptr<ScratchTupleBatch> scratch_batch(
> line too long (100 > 90)
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 7
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Oct 2021 13:24:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 10:

(10 comments)

Looks great!

On testing, I wonder if we can add a counter on # of rows (or amount of data) not surviving the materialization. This will be useful to safe guard the feature and demonstrate its usefulness.

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc@67
PS10, Line 67: // set every 16th row as selected.
I wonder if we can add two more tests for the following situations. 

1. Clustered: over 1024 values, 200 consecutive are true and the rest is false;
2. interleaved: over 1024 values, randomly set 10%, 30%, and 70% to true.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@29
PS10, Line 29: ScratchMicroBatch
May need a cstr to properly init these fields.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@171
PS10, Line 171:   /// Consecutive bits set are used to create ranges. Ranges that differ by less than
nit (or micro batches).


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@172
PS10, Line 172:  E.g., for ranges 1-8, 11-20, 35-100 derived
              :   /// from 'selected_rows' and 'skip_length' as 10, first two ranges would be merged
              :   /// into 1-20 as they differ by 3 (11 - 8) which is less than 10 ('skip_length').
This is wonderful.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@176
PS10, Line 176: atleast
nit.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@178
PS10, Line 178: range
nit. batch_idx may be a better name in this method.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@203
PS10, Line 203: DCHECK(start != -1) << "Atleast one of the 'scratch_batch_->selected_rows'"
              :                         << "should be true";
nit. An alternative is the following, which is more robust. 

if (start == -1) {
 return 0;
}


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h@50
PS10, Line 50: PARQUET_MATERIALIZATION_THRESHOLD
nit: PARQUET_LATE_MATERIALIZATION_THRESHOLD?


http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift@701
PS10, Line 701:   // of columns in parquet
nit. -1 to turn off the feature.


http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift@554
PS10, Line 554:   // of columns in parquet
nit. -1 to turn off the feature.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 10
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 14:43:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 11:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@2223
PS10, Line 2223: c.
> Could you please explain where do we filter out the rows in the merged micr
We don't need to re-filter after step 3. I will explain it in comment.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch-test.cc@67
PS10, Line 67: scratch_batch->num_tuples = BATCH_
> I wonder if we can add two more tests for the following situations. 
Done


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@29
PS10, Line 29: ScratchMicroBatch
> May need a cstr to properly init these fields.
Using aggregate initialisers instead of constructor accepting arguments as we need default constructor too. Plus we don't want many function calls on hot path (GetMicroBatches).


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@171
PS10, Line 171:   /// bits set are used to create micro batches. Micro batches that differ by less than
> nit (or micro batches).
Done


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@176
PS10, Line 176: present
> nit.
Done


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@178
PS10, Line 178: batch
> nit. batch_idx may be a better name in this method.
Done


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/scratch-tuple-batch.h@203
PS10, Line 203:                     << "should be true";
              :     /// Add the last micro batch which was b
> nit. An alternative is the following, which is more robust. 
We can avoid that extra branch and condition and also extra condition on client side to handle 0 being returned, as it is anyways going to be dead code and also mentioned as precondition for method. DCHECK is to ensure that precondition and in future this dead code doesn't get activated.


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h
File be/src/service/query-options.h:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/service/query-options.h@50
PS10, Line 50: PARQUET_LATE_MATERIALIZATION_THRE
> nit: PARQUET_LATE_MATERIALIZATION_THRESHOLD?
Done


http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/ImpalaService.thrift@701
PS10, Line 701:   ENABLE_ASYNC_DDL_EXECUTION = 136
> nit. -1 to turn off the feature.
Done


http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift
File common/thrift/Query.thrift:

http://gerrit.cloudera.org:8080/#/c/17860/10/common/thrift/Query.thrift@554
PS10, Line 554:   137: optional bool enable_async_ddl_execution = true;
> nit. -1 to turn off the feature.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 11
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 10:30:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 3:

(8 comments)

Spotted a couple of nits. Will do a deeper review in the following days.

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@560
PS3, Line 560: materilizing
materializing


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@560
PS3, Line 560: it
nit: them?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@562
PS3, Line 562: materilizing
materializing


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@565
PS3, Line 565: Mmterializing
materializing


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@568
PS3, Line 568: [1]
Why is it a one-element array? The name also misses a last underscore.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@927
PS3, Line 927: 
nit: Unnecessary empty line.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@930
PS3, Line 930: ScratchMicroBatches
nit: 'micro_batches'


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@941
PS3, Line 941: fisrt
nit: first



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Oct 2021 16:35:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 13:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9680/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 13
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 19:41:20 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#16). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'
 3. Added end-to-end test for late materialization.
Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
A testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
A tests/query_test/test_parquet_late_materialization.py
21 files changed, 1,049 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/16
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 16
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 11:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9647/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 11
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 10:51:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 18:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7577/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 18
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Oct 2021 11:41:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS3, Line 253:   conjuncts.reserve(scan_node_->conjuncts().size() +
             :     scan_node_->filter_exprs().size());
             :   conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->conjuncts()),
> I tested this for Min/Max filters at row level and this code handles it as 
Great!  May add a test for the following query. Since the filter is on a.wr_item_sk, the later materialization should help save on wr_reason_sk.  

select a.wr_reason_sk, a.wr_item_sk from web_returns a, web_returns b where 
a.wr_item_sk = b.wr_item_sk and 
b.wr_return_ship_cost < 10;


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330: 
> There is no recheck happening once the batch is formed even if they have fe
Okay. That makes sense.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 7
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Oct 2021 17:53:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Kurt Deschler (Code Review)" <ge...@cloudera.org>.
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 6:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS6, Line 253:   conjuncts.reserve(scan_node_->conjuncts().size() +
Pass reserve to ctor instead.


http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/hdfs-parquet-scanner.cc@274
PS6, Line 274:   slot_ids.reserve(conjuncts.size());
Pass reserve to ctor instead.


http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/parquet/parquet-column-chunk-reader.h@124
PS6, Line 124:   template <bool READ_DATA_PAGE = true>
Probably not necessary/good to templatize page-level functions like this unless they are confirmed hotspots. Can lead to code bloat issues.


http://gerrit.cloudera.org:8080/#/c/17860/5/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/17860/5/be/src/exec/parquet/parquet-column-chunk-reader.h@125
PS5, Line 125:   Status ReadNextDataPage(bool* eos, uint8_t** data, int* data_size) {
What was the motivation for inlining this (page-level) function?



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 6
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Oct 2021 02:17:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

[WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized, before filtering upon it during
scan. Instead, cost can be saved if only the columns required for
filtering are materialized first and then rest of the columns are
materialized only for rows surviving after filter.

Performance: TBD

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
18 files changed, 774 insertions(+), 121 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 3:

(17 comments)

Just had time to go through about 2/3 of the changes. Will resume next week.

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-chunk-reader.h@127
PS3, Line 127: should
             :     // be
nit. format


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@427
PS3, Line 427: ReadNextPageHeader
Name it as ReadNextDataPageHeader()?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@439
PS3, Line 439: would
will?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@439
PS3, Line 439: .
nit. to insert "regardless of the page type" to make it clear.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@513
PS3, Line 513: virtual bool SetRowGroupAtEnd() override;
Missing comment?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@526
PS3, Line 526:   int64_t LastRowIdxInCurrentPage() const {
missing comment?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@527
PS3, Line 527:     DCHECK(!candidate_data_pages_.empty());
May add a DCHECK(candidate_page_idx_ <= candidate_data_pages_.size() - 1) here.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@553
PS3, Line 553: bool DoesPageFiltering() const { return !candidate_data_pages_.empty(); }
nit. Noticed that this change and all the ones following this one are due to format. Is it possible to restore the original format to avoid these changes?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1151
PS3, Line 1151: Status::OK();
We should return a non Status::OK() object here.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1160
PS3, Line 1160: num_values < 0
nit. UNLIKELY()


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1210
PS3, Line 1210: remaining
init to 0.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1372
PS3, Line 1372: && status
This is redundant.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1479
PS3, Line 1479: JumpNextPageHeader
AdvanceNextPageHeader() sounds better.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1484
PS3, Line 1484: num_buffered_values_ == 0
I wonder if we could skip this empty page and advance to the next one.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1509
PS3, Line 1509: bool BaseScalarColumnReader::SkipRows(int64_t num_rows, int64_t skip_row_id) {
              :   if (max_rep_level() > 0) {
              :     return SkipRowsInternal<true>(num_rows, skip_row_id);
              :   } else {
              :     return SkipRowsInternal<false>(num_rows, skip_row_id);
              :   }
              : };
This probably should be inlined.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1517
PS3, Line 1517: /// Wrapper around 'SkipTopLevelRows' to skip across multiple pages.
Please add a description for the algorithm used in the method, as well as ones for each data block inside the method.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1538
PS3, Line 1538: result = 
Do we need to check the status here?



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Oct 2021 19:24:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 18: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 18
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Oct 2021 18:04:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 11:

> (10 comments)
 > 
 > Looks great!
 > 
 > On testing, I wonder if we can add a counter on # of rows (or
 > amount of data) not surviving the materialization. This will be
 > useful to safe guard the feature and demonstrate its usefulness.

Thanks Qifan for the review and the suggestion of counter is good and something I pondered about earlier. Issue is that we don't skip decoding rows, instead we skip decoding values where one row may constitute hundreds of values out of which some will be read and others might be skipped. But we cannot accurately keep track number of values being skipped in current scheme of things without incurring significant performance penalty. For instance, we sometimes skip pages without decompressing it - if skipped page has page index with candidate rows we will need to decompress the page to get the accurate values skipped due to late materialisation. In that scenario where we directly skip pages, even if page is not compressed, figuring out number of values for corresponding candidate range can be time consuming. Hence, using timed counters would be more appropriate here, which are already present.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 11
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 10:45:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

Fix Jenkins indent comments.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 10:58:06 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#14). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
19 files changed, 986 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/14
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 14
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 8:

(5 comments)

Left some minor comments.
When adding tests, I think it'd be useful to measure code coverage:

buildall.sh -notests -codecoverage

# To generate reports:
bin/coverage_helper.sh

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@110
PS8, Line 110: int HdfsColumnarScanner::ProcessScratchBatchCodegenOrInterpret(
nit: unintended new line?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@509
PS8, Line 509: remaining
Please add a comment about 'remaining'


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1282
PS8, Line 1282: &
Output parameters should be pointers.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@74
PS8, Line 74:   boost::scoped_array<bool> selected_rows;
I wonder if using vector<bool> would be better since it's uses a space-efficient representation, i.e. 1 bool = 1 bit.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@177
PS8, Line 177: ScratchMicroBatch* batches
nit: usually we put output parameters at the end of the param list



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 20 Oct 2021 13:22:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@74
PS8, Line 74:   boost::scoped_array<bool> selected_rows;
> I wonder if using vector<bool> would be better since it's uses a space-effi
So I had thought about using std::bitset, vector<bool> and 
current boolean array. bitset needs length at compile time so I discarded it. Based on this stack overflow discussion, I had decided to use vector<bool> itself: https://stackoverflow.com/a/36933356/17210459. But when I remeasured the same Prime function code in answer and another benchmark I created to mimick simple pattern that we use I found boolean array to be faster on gcc 7.5 on CPU times. Benchmarks:
https://quick-bench.com/q/ejXNWbgFJlDqC0jsHCieLTr_aK0
https://quick-bench.com/q/EJsSeRrjbqU1eXsOH2ySLtWdR1M

I agree vector<bool> is more space efficient but I think bit manipulation to set and read values might be consuming more time. I was hoping in second benchmark vector<bool> may be faster as it may fit within cacheline but even there it was 1.3 times slower.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 11:21:30 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/17860/9/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@248
PS9, Line 248: // page index are not supported for late materialization. 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17860/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@263
PS9, Line 263:         && std::find(conjunct_slot_ids_.begin(), conjunct_slot_ids_.end(), slot_desc->id())
line too long (91 > 90)


http://gerrit.cloudera.org:8080/#/c/17860/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@2277
PS9, Line 2277:       if (USES_PAGE_INDEX || !scratch_batch_->AtEnd() || num_row_to_commit == scratch_batch_->num_tuples) {
line too long (107 > 90)


http://gerrit.cloudera.org:8080/#/c/17860/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@2281
PS9, Line 2281:         RETURN_IF_ERROR(FillScratchMicroBatches(non_filter_readers_, row_batch, skip_row_group,
line too long (95 > 90)


http://gerrit.cloudera.org:8080/#/c/17860/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@2288
PS9, Line 2288:         RETURN_IF_ERROR(FillScratchMicroBatches(non_filter_readers_, row_batch, skip_row_group,
line too long (95 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 9
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 19:18:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 5:

(44 comments)

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-columnar-scanner-ir.cc
File be/src/exec/hdfs-columnar-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-columnar-scanner-ir.cc@25
PS1, Line 25: selected_rows
> Can you make this a member of RowBatch or a devived classso you don't need 
'selected_rows' is not related to RowBatch. RowBatch is also used in many other operators and are serialized/deserialized where selected_rows would not have any relevance. Additionally, they hold on to memory much longer than the VLA  (variable length array) as they are streamed from one operator to another until it hits blocking operator or root of fragment. However 'selected_rows' and this function is related to ScratchTupleBatch. If scratch batch is not being shared by threads, we can make it a member of it - just that getting that memory from malloc/new might be slower than VLA.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc
File be/src/exec/hdfs-columnar-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc@25
PS3, Line 25: selected_rows
> Can you make this a member of RowBatch or a devived classso you don't need 
'selected_rows' is not related to RowBatch. Additionally, RowBatch is used at many other operators, are serialized/deserialized etc where selected_rows has no significance and also can keep holding onto memory much longer than VLA (variable length array) currently used as it is streamed from operator to another until it reaches root fragment or blocking operator. However, I had thought about pushing it into ScratchTupleBatch to which both the function and 'selected_rows' are related. I didn't do it as VLA is faster than malloc.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/hdfs-orc-scanner.cc@629
PS1, Line 629:     bool selected_rows[scratch_batch_->capacity];
> Should check end of stack here or allocate memory if capacity is anything s
Thinking about making it a member of ScratchTupleBatch as mentioned in another discussion.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@430
PS3, Line 430:   std::vector<ParquetColumnReader*> column_readers_;
> Consider packaging these in a class if they are going to be passed together
It's not being passed together in current change except for being initialised and not being used outside this class, so we can avoid extra indirection.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@560
PS3, Line 560: d materializ
> materializing
changed it throughout.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@560
PS3, Line 560: th
> nit: them?
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@562
PS3, Line 562: skip materia
> materializing
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@565
PS3, Line 565: Materializing
> materializing
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@568
PS3, Line 568: _;
> Why is it a one-element array? The name also misses a last underscore.
Changed it.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@927
PS3, Line 927:   /// Micro batches are sub ranges in 0..num_tuples-1 which needs to be read.
> nit: Unnecessary empty line.
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@928
PS3, Line 928:   /// Tuple memory to write to is specified by 'scratch_batch->tuple_mem'.
> Pass vector as reference to avoid copying.
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@930
PS3, Line 930: ch* row_batch, bool
> nit: 'micro_batches'
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@941
PS3, Line 941: 
> nit: first
Done


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h@431
PS1, Line 431:   /// Column readers among 'column_readers_' used for filtering
> Consider packaging these in a class if they are going to be passed together
In recent change, they are not being passed together except when being initialised.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.h@911
PS1, Line 911:   /// values that pass the conjuncts.
> Pass vector as reference to avoid copying.
Done


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@246
PS1, Line 246: }
> Declare function const.
Done


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@254
PS1, Line 254:       std::end(scan_node_->conjuncts()));
> use conjunct_slot_ids instead of making a copy.
Done


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2147
PS1, Line 2147:       bool continue_execution;
> Old vs new not meaningful comment here. This doesn't look like the old logi
This is changed now.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2148
PS1, Line 2148:       if (col_reader->max_rep_level() > 0) {
> change array[1] to single variable.
It has been pulled to member variable.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2151
PS1, Line 2151:             &scratch_batch_->num_tuples);
> Probably best to leave the old logic in-tact here both to avoid any potenti
makes sense. I have kept the old code intact in AssemblyRowInternal.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2166
PS1, Line 2166:           Status err(Substitute("Corrupt Parquet file '$0': column '$1' "
> Move this new logic into a function as well.
This function has only the new logic now.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2238
PS1, Line 2238:     if (num_row_to_commit == 0) {
> How does it end up in this case?
It ends up here even for simple case like consecutive true values. But for more complicated cases: assume skip_length as 5 and batch of 10: TTTFFFTTTT - it will enter here for all true values except the first one.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253
PS1, Line 2253:         // When using Page Index, materialize the entire batch. Currently, only avoiding
> Can we only get here if batch_size==0?
We will get here for pretty much everything except when batch_size is 0 or when all values in batch are false. I am assuming you meant to ask when will we hit the false branch here. However, I am removing condition (start = -1) and adding DCHECK to ensure batch sizes are not 0 and values are not false.


http://gerrit.cloudera.org:8080/#/c/17860/1/be/src/exec/parquet/hdfs-parquet-scanner.cc@2275
PS1, Line 2275:     if (row_end) {
> Add comments how this can occur
Added comment to the function declaration of 'SkipRows'.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@249
PS3, Line 249: const vector<ParquetColumnRead
> nit. const vector<ParquetColumnReader*>&.
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250
PS3, Line 250: vector<ParquetColumnReader*>* filter_readers,
             :     vector<ParquetColumnReader*>* non_filter_readers)
> Declare function const.
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250
PS3, Line 250: vector<ParquetColumnReader*>* filter_readers,
             :     vector<ParquetColumnReader*>* non_filter_readers)
> In Impala BE, prefer to use vector<ParquetColumnReader*>* when the argument
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@259
PS3, Line 259:     ParquetColumnReader* column_reader = column_readers[column_idx];
> use conjunct_slot_ids instead of making a copy.
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@270
PS3, Line 270: 
> should use const T&.
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-chunk-reader.h
File be/src/exec/parquet/parquet-column-chunk-reader.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-chunk-reader.h@127
PS3, Line 127: should be called after we know that the first page is not a dictionary page.
             :     // Th
> nit. format
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@439
PS3, Line 439:  and 
> will?
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@439
PS3, Line 439:  
> nit. to insert "regardless of the page type" to make it clear.
its going to be data page only. its just a wrapper over ReadNextPageHeader. changed the comment.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@526
PS3, Line 526: 
> missing comment?
added.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@527
PS3, Line 527:   /// Skip values in the page data. Returns true on success, false otherwise.
> May add a DCHECK(candidate_page_idx_ <= candidate_data_pages_.size() - 1) h
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@553
PS3, Line 553: 
> nit. Noticed that this change and all the ones following this one are due t
sure. Not sure if this happened due to clang-format.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1151
PS3, Line 1151: Status::OK();
> We should return a non Status::OK() object here.
Not really. It depends on if abort_on_error is set as Query option. LogCorruptNumValuesInMetadataError is used similarly in other places.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1160
PS3, Line 1160: UNLIKELY(num_v
> nit. UNLIKELY()
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1210
PS3, Line 1210: remaining
> init to 0.
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1372
PS3, Line 1372: 
> This is redundant.
right. removed it. thanks.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1479
PS3, Line 1479: AdvanceNextPageHea
> AdvanceNextPageHeader() sounds better.
For sure. Changed it. I was under the influence of 'JumpToNextPage'.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1484
PS3, Line 1484: num_buffered_values_ == 0
> I wonder if we could skip this empty page and advance to the next one.
This signifies end of RowGroup. Same logic is currently being used (check NextPage()).


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1509
PS3, Line 1509: /// Wrapper around 'SkipTopLevelRows' to skip across multiple pages.
              : /// Function handles 3 scenarios:
              : /// 1. Page Filtering: When this is enabled this function can be used
              : ///    to skip to a particular 'skip_row_id'.
              : /// 2. Collection: When this scalar reader is reading elements of a collection
              : /// 3. Rest of the cases.
              : //
> This probably should be inlined.
makes sense. done.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1517
PS3, Line 1517: /// At that point, we can just skip to the required row id.
> Please add a description for the algorithm used in the method, as well as o
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1538
PS3, Line 1538: }
> Do we need to check the status here?
We don't need to check, we can just return it and client will check it. But before that we can fix the candidate_row_range even when skipping happened partially and failed.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Oct 2021 14:45:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Kurt Deschler (Code Review)" <ge...@cloudera.org>.
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 3:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc
File be/src/exec/hdfs-columnar-scanner-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner-ir.cc@25
PS3, Line 25: selected_rows
Can you make this a member of RowBatch or a devived classso you don't need to pass the 2 (related) arguments separately?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc@629
PS3, Line 629:     bool selected_rows[scratch_batch_->capacity];
Should check end of stack here or allocate memory if capacity is anything substantial. If not, comment/DCHECK that size is small. Non-issue if this moves into RowBatch.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@430
PS3, Line 430:   std::vector<ParquetColumnReader*> column_readers_;
Consider packaging these in a class if they are going to be passed together.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.h@928
PS3, Line 928:   vector<SlotId> GetSlotIdsForConjuncts(vector<ScalarExpr*> conjuncts);
Pass vector as reference to avoid copying.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@250
PS3, Line 250: vector<ParquetColumnReader*>& filter_readers,
             :     vector<ParquetColumnReader*>& non_filter_readers)
> In Impala BE, prefer to use vector<ParquetColumnReader*>* when the argument
Declare function const.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@259
PS3, Line 259:   slot_ids.insert(std::begin(conjunct_slot_ids), std::end(conjunct_slot_ids));
use conjunct_slot_ids instead of making a copy.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2342
PS3, Line 2342:         // continue the old range as 'last' is within 'skip_length' of last range.
How does it end up in this case?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2357
PS3, Line 2357:     batches[range].start = start;
Can we only get here if batch_size==0?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2388
PS3, Line 2388:           return Status(Substitute("Couldn't skip rows in file $0.", filename()));
Add comments how this can occur



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Oct 2021 17:09:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 10:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17860/10//COMMIT_MSG@19
PS10, Line 19: that
nit: than


http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/10/be/src/exec/parquet/hdfs-parquet-scanner.cc@2223
PS10, Line 2223: c.
Could you please explain where do we filter out the rows in the merged micro batches?



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 10
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Oct 2021 14:30:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 10:

> Uploaded patch set 10.

Fixes the format here.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 10
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 21:35:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

> (5 comments)
 > 
 > Left some minor comments.
 > When adding tests, I think it'd be useful to measure code coverage:
 > 
 > buildall.sh -notests -codecoverage
 > 
 > # To generate reports:
 > bin/coverage_helper.sh

I could finally get the coverage: https://drive.google.com/drive/folders/1SQgqCh44VEYYvmqJyTX284adLiVipWcm?usp=sharing. Existing tests covers all the functions introduced in this patch quite well: AssembleRows, AssembleRowsWithoutLateMaterialization, SkipTopLevelRows, ReadNextDataPageHeader etc.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Oct 2021 11:23:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@69
PS12, Line 69: 2, 4, 8, 16, 32
> nit. I wonder if we can construct a test with randomly assigned true/false 
Verification of micro batches created for randomly assigned true/false would need us to group and merge true values of selected_rows using exactly same algorithm as that in GetMicroBatches (the function we are testing in the first place). In other words, Step 3 needs same algorithm to create expected micro batches as that being invoked in Step 2 - which is not the right thing to do while testing. That is the reason, while verifying the output of 'GetMicroBatches' input/output is predetermined and output of GetMicroBatches is checked against the predetermined output.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Oct 2021 14:28:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 17:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7575/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 17
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 20:35:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

> Wrote the following code for randomly generating selected rows.
 > Have not get a chance to test it out  yet (some runtime issue with
 > by dev box).
 > 
 > // This tests checks conversion of 'selected_rows' with randomly
 > generated
 > // 'true' values to 'ScratchMicroBatch';
 > TEST_F(ScratchTupleBatchTest, TestRandomGeneratedMicroBatches) {
 > const int BATCH_SIZE = 1024;
 > scoped_ptr<ScratchTupleBatch> scratch_batch(
 > new ScratchTupleBatch(*desc_, BATCH_SIZE, &tracker_));
 > scratch_batch->num_tuples = BATCH_SIZE;
 > // gaps to try
 > vector<int> gaps = {5, 16, 29, 37, 1025};
 > for (auto n : gaps) {
 > // Set randomly locations as selected.
 > srand (time(NULL));
 > for (int batch_idx = 0; batch_idx < BATCH_SIZE; ++batch_idx) {
 > scratch_batch->selected_rows[batch_idx] = rand() < (RAND_MAX / 2);
 > }
 > ScratchMicroBatch micro_batches[BATCH_SIZE];
 > int batches = scratch_batch->GetMicroBatches(n, micro_batches);
 > EXPECT_TRUE(batches > 1);
 > EXPECT_TRUE(batches <= BATCH_SIZE);
 > // Verify every batch
 > for (int i = 0; i < batches; i++) {
 > const ScratchMicroBatch& batch = micro_batches[i];
 > EXPECT_TRUE(batch.start <= batch.end);
 > EXPECT_TRUE(batch.length == batch.end - batch.start + 1);
 > EXPECT_TRUE(batch.start);
 > EXPECT_TRUE(batch.end);
 > int last_true_idx = batch.start;
 > for (int j = batch.start + 1; j < batch.end; j++) {
 > if (scratch_batch->selected_rows[j]) {
 > EXPECT_TRUE(j - last_true_idx + 1 <= n);
 > last_true_idx = j;
 > }
 > }
 > }
 > // Verify any two consecutive batches i and i+1
 > for (int i = 0; i < batches - 1; i++) {
 > const ScratchMicroBatch& batch = micro_batches[i];
 > const ScratchMicroBatch& nbatch = micro_batches[i + 1];
 > EXPECT_TRUE(batch.end < nbatch.start);
 > EXPECT_TRUE(nbatch.start - batch.end >= n);
 > // Any row in betweeen the two batches should not be selected
 > for (int j=batch.end+1; j<nbatch.start; j++) {
 > EXPECT_FALSE(scratch_batch->selected_rows[j]);
 > }
 > }
 > }
 > }

hey Qifan, Thanks a lot for this snippet. I almost wrote the code - will merge your snippet to it. Huge thanks for both  - detailed description of the verfication algo earlier and also for this snippet.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 14:47:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS3, Line 253:   conjuncts.reserve(scan_node_->conjuncts().size() +
             :     scan_node_->filter_exprs().size());
             :   conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->conjuncts()),
> Oh thanks for this info - didn't know. Will take a look at this and revert 
I tested this for Min/Max filters at row level and this code handles it as filter_exprs() will contain ScalarExprs on the required slotId. We can use that slotId to figure out the column that needs to be read for filtering.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 7
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 12 Oct 2021 23:52:05 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 8:

(23 comments)

Looks great. I was able to look at 11 files and plan to finish the rest 9 files tomorrow. 

Most comments in this set are minors.

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h@59
PS8, Line 59: rows
nit. use of 'tuples' instead of 'row' here should be better as tuples are mentioned earlier in the comment.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@66
PS8, Line 66: must not be called when 
nit. Is it possible to DCHECK() it?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@562
PS8, Line 562: has
nit. have.

May also include a sentence at the end to illustrate:

For example, if the threshold is 10, then ... ...


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@935
PS8, Line 935:   /// are the readers reading columns involved in either static filter or runtime filter.
nit. Will be useful to describe non_filter_readers: All 'non_filter_readers' are responsible for reading surviving rows from those columns that are not involved in filtering.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117
PS8, Line 117:  complete_micro_batch_.start = 0;
             :   complete_micro_batch_.end = scratch_batch_->capacity - 1;
             :   complete_micro_batch_.length = scratch_batch_->capacity;
This probably should be moved to the cst.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@262
PS8, Line 262: for (int column_idx = 0; column_idx < column_readers.size(); column_idx++) {
             :     ParquetColumnReader* column_reader = column_readers[column_idx];
             :     auto slot_desc = column_reader->slot_desc();
this loop can be written as 
for (auto column_reader : column_readers) {
...


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@268
PS8, Line 268: filter_readers
Should we do the following prior to the for loop or DCHECK() they are empty?

filter_readers->erase()
non_filter_reders->erase()


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2215
PS8, Line 2215: Instead
nit. Suggest to remove as it can cause confusion on the action taken in this step.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2218
PS8, Line 2218: in earlier iteratio
remove?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2236
PS8, Line 2236: disabled
nit: "not needed" is better.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253
PS8, Line 2253: Status fill_status = FillScratchMicroBatches(filter_readers_, row_batch,
              :         skip_row_group, &complete_micro_batch_, 1, scratch_batch_->capacity,
              :         scratch_batch_->num_tuples);
              :     if (!fill_status.ok() || *skip_row_group) {
              :       return fill_status;
              :     }
nit Suggest to use the following pattern to enhance code readability. 

RETURN_IF_ERROR(foo( skip_row_group));
if (*skip_row_group) return Status::OK();


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2282
PS8, Line 2282: num_tuples
Suggest to use int* = nullptr instead of int& for the function, so that we can drop int num_tuples at line 2274.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2291
PS8, Line 2291: if (!fill_status.ok() || *skip_row_group) {
              :         return fill_status;
nit. same comment on code readability.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2334
PS8, Line 2334:  for (int c = 0; c < column_readers.size(); ++c) {
              :       ParquetColumnReader* col_reader = column_readers[c];
nit. Use for (auto col_reader : column_readers) {


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336
PS8, Line 2336: due
nit or due to


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336
PS8, Line 2336: in
nit to


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2358
PS8, Line 2358:   for (int c = 0; c < column_readers.size(); ++c) {
              :     ParquetColumnReader* col_reader = column_readers[c]
nit. Same comment above.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2363
PS8, Line 2363: if (r == 0)
UNLIKELY


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2364
PS8, Line 2364:  if (micro_batches[0].start > 0) {
I wonder why we need this test.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@168
PS8, Line 168:   /// row after 'skip_row_id'.
nit. add "and ignores 'num_rows'".


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1151
PS8, Line 1151: Status::OK();
Should we return error instead OK here?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1156
PS8, Line 1156: // Read next data page's header only. 'false' passed in call below signifies that.
nit. Can put the comment next to the bool argument for better readability.

 RETURN_IF_ERROR(col_chunk_reader_.ReadNextDataPage(&eos, &data_, &data_size, false /* read next page header only */));


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1189
PS8, Line 1189: // Data can be empty if the column contains all NULLs
nit. I wonder if this comment is correct. For example, if the filter is NOT NULL, then this page should be looked at and survive.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 17:26:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@74
PS8, Line 74:   boost::scoped_array<bool> selected_rows;
> I wouldn't spend too much time on row-level micro optimizations at this poi
I see, thanks for testing this out!



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 15:10:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#10). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less that 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
20 files changed, 898 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/10
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 10
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#15). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'
 3. Added end-to-end test for late materialization.
Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
A testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
A tests/query_test/test_parquet_late_materialization.py
21 files changed, 1,045 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/15
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 15
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 16:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@69
PS12, Line 69: 
> yeah, that is a good point. 
Have added this code now by integrating snippet you provided.


http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@436
PS12, Line 436: 
> Good point on performance. 
# pages skipped is a good counter to add even if it doesn't give complete picture of saving provided. I have added it. Regarding the testing for different data types, code added is independent of any scalar data type so testing that combination is not needed. I have added end-to-end tests for other combinations (test_parquet_late_materialization.py).



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 16
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 19:52:13 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 9:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.h@59
PS8, Line 59: rows
> nit. use of 'tuples' instead of 'row' here should be better as tuples are m
Done.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc
File be/src/exec/hdfs-columnar-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@66
PS8, Line 66: 
> nit. Is it possible to DCHECK() it?
It is being checked in FilterScratchBatch. Moved comment to the same function too.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/hdfs-columnar-scanner.cc@110
PS8, Line 110: int HdfsColumnarScanner::ProcessScratchBatchCodegenOrInterpret(RowBatch* dst_batch) {
> nit: unintended new line?
Right. This function was changed in earlier patch with an extra argument. Hence formatting went for toss. Changed it now.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@562
PS8, Line 562: es 
> nit. have.
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.h@935
PS8, Line 935:   /// Tuple memory to write to is specified by 'scratch_batch->tuple_mem'.
> nit. Will be useful to describe non_filter_readers: All 'non_filter_readers
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117
PS8, Line 117:  complete_micro_batch_.start = 0;
             :   complete_micro_batch_.end = scratch_batch_->capacity - 1;
             :   complete_micro_batch_.length = scratch_batch_->capacity;
> This probably should be moved to the cst.
what is cst ?


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@262
PS8, Line 262:   if (DoesNotSupportLateMaterialization(column_reader) || (slot_desc != nullptr
             :         && std::find(conjunct_slot_ids_.begin(), conjunct_slot_ids_.end(), slot_desc->id())
             :             != conjunct_slot_ids_.end())) {
> this loop can be written as 
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@268
PS8, Line 268: 
> Should we do the following prior to the for loop or DCHECK() they are empty
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2215
PS8, Line 2215: These c
> nit. Suggest to remove as it can cause confusion on the action taken in thi
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2218
PS8, Line 2218:  materializing tupl
> remove?
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2236
PS8, Line 2236: ptr);
> nit: "not needed" is better.
it is both actually. have updated comment.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2253
PS8, Line 2253: // 1. Filter rows only materializing the columns in 'filter_readers_'
              :     // 2. Transfer the surviving rows
              :     // 3. Materialize rest of the columns only for surviving rows.
              : 
              :     RETURN_IF_ERROR(FillScratchMicroBatches(filter_readers_, row_batch,
              :      
> nit Suggest to use the following pattern to enhance code readability. 
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2282
PS8, Line 2282: omplete_mi
> Suggest to use int* = nullptr instead of int& for the function, so that we 
It cannot be nullptr and will be modified by existing APIs like ReadValueBatch. but it has been changed to pointer though.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2291
PS8, Line 2291: if (*skip_row_group) { return Status::OK(); }
              :     }
> nit. same comment on code readability.
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2336
PS8, Line 2336: >S
> nit to
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2358
PS8, Line 2358:     bool continue_execution = false;
              :     int last = -1;
> nit. Same comment above.
We need the index here unlike the above place.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2363
PS8, Line 2363:     if (UNL
> UNLIKELY
In many cases it would be micro batch of size 1.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@2364
PS8, Line 2364:      return Status(Substitute("Cou
> I wonder why we need this test.
First batch might start from 100th row, in which case we need to skip first 100 rows.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@168
PS8, Line 168:   /// row after 'skip_row_id' and ignores 'num_rows'.
> nit. add "and ignores 'num_rows'".
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.h@509
PS8, Line 509: re not st
> Please add a comment about 'remaining'
Done.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1151
PS8, Line 1151: Status::OK();
> Should we return error instead OK here?
Nope. As commented earlier if error needs to be returned LogCorruptNumValuesInMetadataError will do it.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1156
PS8, Line 1156: RETURN_IF_ERROR(col_chunk_reader_.ReadNextDataPage(&eos, &data_, &data_size,
> nit. Can put the comment next to the bool argument for better readability.
Done


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/parquet/parquet-column-readers.cc@1282
PS8, Line 1282: *
> Output parameters should be pointers.
Done. changed it at other places too.


http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@177
PS8, Line 177: int skip_length, ScratchMi
> nit: usually we put output parameters at the end of the param list
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 9
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 19:18:16 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 11:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/10//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17860/10//COMMIT_MSG@19
PS10, Line 19: than
> nit: than
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 11
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 10:46:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Kurt Deschler (Code Review)" <ge...@cloudera.org>.
Kurt Deschler has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h
File be/src/exec/scratch-tuple-batch.h:

http://gerrit.cloudera.org:8080/#/c/17860/8/be/src/exec/scratch-tuple-batch.h@74
PS8, Line 74:   boost::scoped_array<bool> selected_rows;
> So I had thought about using std::bitset, vector<bool> and 
I wouldn't spend too much time on row-level micro optimizations at this point unless we have a CPU profile that is showing a hot spot.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 14:19:00 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less that 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
19 files changed, 884 insertions(+), 71 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/7
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 7
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 6:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/17860/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17860/4//COMMIT_MSG@17
PS4, Line 17: on 'parquet_materialization_threshold' is introduced,
            : which is minimum number of consecutive
> Great results! Are these whole query speedups or only scan-time speedups?
For high selectivity queries, I measured complete query time speedup. For Low selectivity queries I measured just the scan time as queries for them were not simple select on filter queries.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h@60
PS3, Line 60: 
> I am thinking to push it to ScratchTupleBatch. If not will document it in n
Pushed it to ScratchTupleBatch.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2377
PS3, Line 2377:    col_reader->Rea
> right, missed this one. will fix it in next revision. will keep this open f
Done


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@252
PS4, Line 252: conjuncts
> We could reserve() capacity for this vector.
Done


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@258
PS4, Line 258:       std::end(scan_node_->filter_exprs()));
             :   auto conjunct_slot_ids = GetSlotIdsForConjuncts(conjuncts);
> nit: unordered_set has a constructor that takes a two iterators.
not using it anymore. removed it.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@271
PS4, Line 271: ctor<SlotId> HdfsParquet
> We should reserve() capacity for the vector.
Done


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2131
PS4, Line 2131: er::Read*ValueBatch(
> Maybe AssembleRowsWithoutLateMaterialization()?
Done


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2193
PS4, Line 2193:   row_group_rows_read_ += num_rows_rea
> This comments needs to be extended with the explanation of late materializa
Done


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2234
PS4, Line 2234: 
> nit: magic number
commented it.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2240
PS4, Line 2240: !column
> nit: row_group_end?
Done


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2255
PS4, Line 2255:     num_rows_read += scratch_batch_->num_tuples;
> We already have a 'fill_status' at L2233. Maybe we can just reuse that one.
Done


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2329
PS4, Line 2329:  0) {
> Could be static, then maybe we could add backend tests for this.
Have moved it to ScratchTupleBatch as 'GetMicroBatches' and added tests for it.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.cc@1538
PS3, Line 1538: }
> Hmm. It seems a false returning status means likely the data in the page is
ok. bailing out if it returns false.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 6
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Oct 2021 16:26:03 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 10:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9639/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 10
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 21:57:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 9:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9638/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 9
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Oct 2021 19:37:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9648/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 25 Oct 2021 11:20:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 17: Code-Review+2

(1 comment)

Looks great and thanks a lot for adding the tests. 

+2 to carry Zoltan's +1.

http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@436
PS12, Line 436: 
> # pages skipped is a good counter to add even if it doesn't give complete p
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 17
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 20:33:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 18: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 18
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 28 Oct 2021 20:09:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/6/be/src/exec/scratch-tuple-batch-test.cc@64
PS6, Line 64:   scoped_ptr<ScratchTupleBatch> scratch_batch(new ScratchTupleBatch(*desc_, BATCH_SIZE, &tracker_));
line too long (100 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 6
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Oct 2021 16:26:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 5:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h
File be/src/exec/hdfs-columnar-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-columnar-scanner.h@60
PS3, Line 60: bool* selected_rows
> Need to document the use of this new argument in the comment above.
I am thinking to push it to ScratchTupleBatch. If not will document it in next revision. I will keep this open for now.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@253
PS3, Line 253:   conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->conjuncts()),
             :       std::end(scan_node_->conjuncts()));
             :   conjuncts.insert(std::end(conjuncts), std::begin(scan_node_->filter_exprs()),
> Can you please check if the code is sufficient to deal with min/max filters
Oh thanks for this info - didn't know. Will take a look at this and revert back.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330: 
> By reading the code, my guess is that each batch covers a number of, up to 
>> By reading the code, my guess is that each batch covers a >> number of, up to skip length, selected rows. 
Actually skip_length is not the max length of batch. It is just the minimum gap between two ranges. If two clusters of True values have a gap less than skip_length we merge them into 1 cluster.

>> Is it possible to work with selected_rows directly?

Do you mean to say directly use selected rows instead of micro batches in FillScratchBatch ?


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2342
PS3, Line 2342:         batches[range].start = start;
> How does it end up in this case?
I just realised you had to re-comment as versions moved ahead. Sorry for the trouble, was not intentional - I generally go back to all the comments in older versions too. I will paste response from comment in older version for completeness.
"It ends up here even for simple case like consecutive true values. But for more complicated cases: assume skip_length as 5 and batch of 10: TTTFFFTTTT - it will enter here for all true values except the first one."


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2357
PS3, Line 2357:   /// TTTTTTTTTT or even FFFFFTTTTT. In both cases we would need below.
> Can we only get here if batch_size==0?
commenting from earlier:
We will get here for pretty much everything except when batch_size is 0 or when all values in batch are false. I am assuming you meant to ask when will we hit the false branch here. However, I am removing condition (start = -1) and adding DCHECK to ensure batch sizes are not 0 and values are not false.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2377
PS3, Line 2377: continue_execution
> should init this boolean, as the code below conditionally set it.
right, missed this one. will fix it in next revision. will keep this open for now.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2381
PS3, Line 2381: if (micro_batches[0].start > 0) {
              :           if (UNLIKELY(!col_reader->SkipRows(micro_batches[0].start, -1))) {
              :             return Status(Substitute("Couldn't skip rows in file $0.", filename()));
              :           }
              :         }
> Do we need an else branch to deal with micro_batches[0].start==0? If it is 
It is possible to have micro_batches[0].start==0, but in that case we don't need any Skip call. Skip call is needed only when micro_batches[0].start > 0 i.e., everything from 0 to micro_batches[0].start needs to be skipped. Hence else is not required, neither is DCHECK reqd.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2388
PS3, Line 2388:           return Status(Substitute("Couldn't skip rows in file $0.", filename()));
> Add comments how this can occur
have added comment to the SkipRow declaration for 'false' return value.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2417
PS3, Line 2417: return Status::OK();
> I wonder if a non Status::OK() object should return here.
This behaviour is retained from earlier code. You will find the code in AssembleRows in old code.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@427
PS3, Line 427: ReadNextDataPageHe
> Name it as ReadNextDataPageHeader()?
Done


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/parquet-column-readers.h@513
PS3, Line 513: /// is more than the rows left in current
> Missing comment?
Added it now.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Oct 2021 15:37:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17860/2/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@985
PS2, Line 985:     
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17860/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@2149
PS2, Line 2149:     // Check if filter_readers exist or late materilization is disabled 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17860/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@2150
PS2, Line 2150:     if (filter_readers.empty() || non_filter_readers.empty() 
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17860/2/be/src/exec/parquet/hdfs-parquet-scanner.cc@2210
PS2, Line 2210:         int num_micro_batches = 
line has trailing whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 2
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Sep 2021 21:40:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 19:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17860/12//COMMIT_MSG@24
PS12, Line 24: TPCH scale 42
> I think it would be good to execute the whole benchmark with bin/single_nod
Hi Zoltan,
Sorry for the delay with benchmark. I ran the entire tpch bechmark at scale 42. This was the summary of report (Delta is the change).

Report Generated on 2021-10-28
Run Description: "78ce235db6d5b720f3e3319ff571a2da054a2602 vs c46d765dccd5739c848d8c1c82043e72394b8397"

Cluster Name: UNKNOWN
Lab Run Info: UNKNOWN
Impala Version:          impalad version 4.1.0-SNAPSHOT RELEASE (2021-10-28)
Baseline Impala Version: impalad version 4.1.0-SNAPSHOT RELEASE (2021-10-27)

+----------+-----------------------+---------+------------+------------+----------------+
| Workload | File Format           | Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) |
+----------+-----------------------+---------+------------+------------+----------------+
| TPCH(42) | parquet / none / none | 12.83   | -1.54%     | 8.26       | -1.48%         |
+----------+-----------------------+---------+------------+------------+----------------+

Very slight improvement overall and major improvements in these 2 queries:

(I) Improvement: TPCH(42) TPCH-Q6 [parquet / none / none] (1.85s -> 1.72s [-7.30%])
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+-------+-----------+
| Operator     | % of Query | Avg   | Base Avg | Delta(Avg) | StdDev(%) | Max   | Base Max | Delta(Max) | #Hosts | #Inst | #Rows | Est #Rows |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+-------+-----------+
| 00:SCAN HDFS | 94.83%     | 1.50s | 1.62s    | -7.75%     |   2.07%   | 1.56s | 1.73s    | -9.58%     | 1      | 1     | 4.79M | 29.96M    |
+--------------+------------+-------+----------+------------+-----------+-------+----------+------------+--------+-------+-------+-----------+

(I) Improvement: TPCH(42) TPCH-Q19 [parquet / none / none] (4.73s -> 4.18s [-11.72%])
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+-------+--------+-----------+
| Operator     | % of Query | Avg      | Base Avg | Delta(Avg) | StdDev(%) | Max      | Base Max | Delta(Max) | #Hosts | #Inst | #Rows  | Est #Rows |
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+-------+--------+-----------+
| 01:SCAN HDFS | 22.68%     | 729.91ms | 736.69ms | -0.92%     |   1.61%   | 751.55ms | 747.34ms | +0.56%     | 1      | 1     | 20.33K | 1.50M     |
| 00:SCAN HDFS | 74.84%     | 2.41s    | 2.97s    | -18.98%    |   0.67%   | 2.44s    | 3.00s    | -18.70%    | 1      | 1     | 13.07K | 29.96M    |
+--------------+------------+----------+----------+------------+-----------+----------+----------+------------+--------+-------+--------+-----------+

There was no regression reported as such just these 2 improvements and couple of queries with high variability in runtime (not related to our change).



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 19
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Nov 2021 17:51:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9621/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 18 Oct 2021 23:35:28 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 5:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9566/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 5
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Oct 2021 15:05:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#11). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
20 files changed, 933 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/11
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 11
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@69
PS12, Line 69: 2, 4, 8, 16, 32
> I see. Let us assume the following: 
Ah, got it! It may not be sufficient though. For instance,

0 1 2 3 4 5 6 7 8 9 0 1 2 3
T T F T F F F T T T T T F F - > we will verify these 2 batches [1,3] and [10, 11] with gap of 5 as correct result even if they are not. Probably some extra conditions might be needed.


http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@436
PS12, Line 436: row_regex:.* RF00.\[min_max\] -. .\.wr_item_sk.*
> In addition, I wonder if we can grab a few counters on late materialized ro
I had commented on the issue with counters earlier (pasting it below). Let me know your thoughts:

--- PASTED ---
Thanks Qifan for the review and the suggestion of counter is good and something I pondered about earlier. Issue is that we don't skip decoding rows, instead we skip decoding values where one row may constitute hundreds of values out of which some will be read and others might be skipped. But we cannot accurately keep track number of values being skipped in current scheme of things without incurring significant performance penalty. For instance, we sometimes skip pages without decompressing it - if skipped page has page index with candidate rows we will need to decompress the page to get the accurate values skipped due to late materialisation. In that scenario where we directly skip pages, even if page is not compressed, figuring out number of values for corresponding candidate range can be time consuming. Hence, using timed counters would be more appropriate here, which are already present.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Oct 2021 18:02:14 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 15:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9682/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 15
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 20:01:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#17). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'
 3. Added end-to-end test for late materialization.
Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
A testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
A tests/query_test/test_parquet_late_materialization.py
21 files changed, 1,051 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/17
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 17
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 16:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/16/tests/query_test/test_parquet_late_materialization.py
File tests/query_test/test_parquet_late_materialization.py:

http://gerrit.cloudera.org:8080/#/c/17860/16/tests/query_test/test_parquet_late_materialization.py@3
PS16, Line 3: class TestParquetLateMaterialization(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1


http://gerrit.cloudera.org:8080/#/c/17860/16/tests/query_test/test_parquet_late_materialization.py@18
PS16, Line 18: 
flake8: W292 no newline at end of file



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 16
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 27 Oct 2021 19:49:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#13). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
19 files changed, 986 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/13
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 13
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.
......................................................................

IMPALA-9873: Avoid materialization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized before filtering during scan.
Instead of paying the cost of materializing upfront, for columnar
formats we can avoid doing it for rows that are filtered out.
Columns that are required for filtering are the only ones that are
needed to be materialized before filtering. For rest of the columns,
materialization can be delayed and be done only for rows that survive.
This patch implements this technique for Parquet format only.

New configuration 'parquet_materialization_threshold' is introduced,
which is minimum number of consecutive rows that are filtered out
to avoid materialization. If set to less than 0, it disables the
late materialization.

Performance:
Peformance measured for single daemon, single threaded impalad
upon TPCH scale 42 lineitem table with 252 million rows,
unsorted data. Upto 2.5x improvement for non-page indexed and
upto 4x improvement in page index seen. Queries for page index
borrowed from blog:
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
More details:
https://docs.google.com/spreadsheets/d/17s5OLaFOPo-64kimAPP6n3kJA42vM-iVT24OvsQgfuA/edit?usp=sharing

Testing:
 1. Ran existing tests
 2. Added UT for 'ScratchTupleBatch::GetMicroBatch'
 3. Added end-to-end test for late materialization.
Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Reviewed-on: http://gerrit.cloudera.org:8080/17860
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Qifan Chen <qc...@cloudera.com>
---
M be/src/exec/CMakeLists.txt
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
A be/src/exec/scratch-tuple-batch-test.cc
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M be/src/util/tuple-row-compare.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-local-tz-conversion.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-late-materialization.test
A tests/query_test/test_parquet_late_materialization.py
22 files changed, 1,070 insertions(+), 52 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Qifan Chen: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 19
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 12:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc
File be/src/exec/scratch-tuple-batch-test.cc:

http://gerrit.cloudera.org:8080/#/c/17860/12/be/src/exec/scratch-tuple-batch-test.cc@69
PS12, Line 69: 2, 4, 8, 16, 32
> Ah, got it! It may not be sufficient though. For instance,
yeah, that is a good point. 

So the extra condition is that anything outside micro batches should be F.


http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test
File testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test:

http://gerrit.cloudera.org:8080/#/c/17860/12/testdata/workloads/functional-query/queries/QueryTest/min_max_filters.test@436
PS12, Line 436: row_regex:.* RF00.\[min_max\] -. .\.wr_item_sk.*
> I had commented on the issue with counters earlier (pasting it below). Let 
Good point on performance. 

In the example mentioned, I wonder if we keep a counter on # of pages skipped. That would be a good indicator too. 

Lastly, one can always measure the scan performance with the feature on and off and compare the result. This is one way to verify the feature works.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 12
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 26 Oct 2021 18:40:26 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@994
PS3, Line 994:     
line has trailing whitespace


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2266
PS3, Line 2266:         int num_micro_batches = 
line has trailing whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 3
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Oct 2021 16:03:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 4:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/17860/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17860/4//COMMIT_MSG@17
PS4, Line 17: Upto 2.5x improvement for non-page indexed and
            : upto 4x improvement in page index seen
Great results! Are these whole query speedups or only scan-time speedups?


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.h@928
PS4, Line 928: (vector<ScalarExpr*>
chould be const&


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.h@942
PS4, Line 942: 1-20
nit: maybe you could say a few words about how rows 9-10 are filtered out in this case.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@252
PS4, Line 252: conjuncts
We could reserve() capacity for this vector.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@258
PS4, Line 258:   unordered_set<SlotId> slot_ids;
             :   slot_ids.insert(std::begin(conjunct_slot_ids), std::end(conjunct_slot_ids));
nit: unordered_set has a constructor that takes a two iterators.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@271
PS4, Line 271: vector<SlotId> slot_ids;
We should reserve() capacity for the vector.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2131
PS4, Line 2131: AssembleRowsInternal
Maybe AssembleRowsWithoutLateMaterialization()?


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2193
PS4, Line 2193: /// High-level steps of this function:
This comments needs to be extended with the explanation of late materialization.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2234
PS4, Line 2234: 1
nit: magic number


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2240
PS4, Line 2240: row_end
nit: row_group_end?


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2255
PS4, Line 2255:       Status fill_status;
We already have a 'fill_status' at L2233. Maybe we can just reuse that one.


http://gerrit.cloudera.org:8080/#/c/17860/4/be/src/exec/parquet/hdfs-parquet-scanner.cc@2329
PS4, Line 2329: ConvertToRange
Could be static, then maybe we could add backend tests for this.



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 4
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Oct 2021 14:14:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has posted comments on this change. ( http://gerrit.cloudera.org:8080/17860 )

Change subject: IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc
File be/src/exec/hdfs-orc-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/hdfs-orc-scanner.cc@629
PS3, Line 629:     return Status::OK();
> Should check end of stack here or allocate memory if capacity is anything s
Has been moved to scratch_batch_.


http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17860/3/be/src/exec/parquet/hdfs-parquet-scanner.cc@2330
PS3, Line 2330: 
> Okay. Thanks for the clarification on skip length. 
There is no recheck happening once the batch is formed even if they have few  False values. Secondly,  Batch needs to be formed as the interface to materialize values have been optimized to read in batch. Reading it individually instead of batch causes massive slowdown. Check the section 'Materialization threshold' in the design doc for details: https://docs.google.com/document/d/1QFu_Zu9nHuMpu5Pqb3qe62MbZPA88j_o7NtpZ2a2zSA/edit#heading=h.qdtalwag0ooo



-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 7
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Kurt Deschler <kd...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 13 Oct 2021 11:40:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Posted by "Amogh Margoor (Code Review)" <ge...@cloudera.org>.
Amogh Margoor has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/17860 )

Change subject: [WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.
......................................................................

[WIP] IMPALA-9873: Avoid materilization of columns for filtered out rows in Parquet table.

Currently, entire row is materialized, before filtering upon it during
scan. Instead, cost can be saved if only the columns required for
filtering are materialized first and then rest of the columns are
materialized only for rows surviving after filter.

Performance: TBD

Testing: TBD

Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
---
M be/src/codegen/gen_ir_descriptions.py
M be/src/exec/hdfs-columnar-scanner-ir.cc
M be/src/exec/hdfs-columnar-scanner.cc
M be/src/exec/hdfs-columnar-scanner.h
M be/src/exec/hdfs-orc-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.cc
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-collection-column-reader.cc
M be/src/exec/parquet/parquet-collection-column-reader.h
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/scratch-tuple-batch.h
M be/src/service/query-options.cc
M be/src/service/query-options.h
M common/thrift/ImpalaService.thrift
M common/thrift/Query.thrift
18 files changed, 608 insertions(+), 111 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/60/17860/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17860
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I46406c913297d5bbbec3ccae62a83bb214ed2c60
Gerrit-Change-Number: 17860
Gerrit-PatchSet: 2
Gerrit-Owner: Amogh Margoor <am...@gmail.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>