You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Pranay Singh (Code Review)" <ge...@cloudera.org> on 2017/09/11 23:28:07 UTC

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Pranay Singh has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/8034

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces a new MemTracker to track the memory used
by std::vector in DictDecoder class, whereas an existing memory pool
for DictEncoder class is used when allocating memory for std::vector
in DictEncoder class.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/exec-node.cc
M be/src/exec/exec-node.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 52 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@274
PS8, Line 274:   // Before closing the Column readers, the accounted bytes in mem_tracker for
             :   // dict_decoder_ is released.
             :   if (mem_tracker_ != nullptr) {
             :     for (ParquetColumnReader* col_reader : column_readers_) col_reader->Cleanup();
             :   }
I think this code won't be necessary if col_reader->Close() does the dictionary close.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: CloseAndUnregisterFromParent
Same question here as elsewhere: Do we really need CloseAndUnregisterFromParent() or will Close() do?


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@178
PS8, Line 178: dict_encoder_base_->ClearIndices();
This should call the new base Close() function (which should do ClearIndices()).


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@1057
PS8, Line 1057: CloseAndUnregisterFromParent();
I think we want to limit use of CloseAndUnregisterFromParent() to cases that really need it. I think the ordinary Close() should work here. Check if that works.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h@234
PS8, Line 234:   /// Delete the bytes used for memory tracking.
             :   virtual void Cleanup() {}
I think this can be folded into Close() and removed.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc@269
PS8, Line 269:   virtual void Cleanup() {
             :     dict_decoder_.Close();
             :   }
I think this can be folded into Close() without a separate Cleanup() function.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@107
PS8, Line 107:   DictEncoder(MemPool* pool, int encoded_value_size, MemTracker* mem_tracker) :
             :       DictEncoderBase(pool), buckets_(HASH_TABLE_SIZE, Node::INVALID_INDEX),
             :       encoded_value_size_(encoded_value_size),
             :       dict_mem_tracker_(mem_tracker),
             :       dict_bytes_cnt_(0) { }
As I understand it, the DictEncoderBase/DictEncoder split is that DictEncoderBase contains everything that does not rely on T and DictEncoder contains the type-specific stuff. The same split applies for DictDecoderBase/DictDecoder. Since the memory tracking is not type specific, I think it makes sense to put it in the base classes. This should simplify the code in HdfsParquetWriter, because then you can call Close() on the DictEncoderBase rather than having to go to the typed version. Close() on DictEncoderBase should also do ClearIndices().


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: ReleaseBytes();
Is there any codepath that should legitimately use this? If not, DCHECK that dict_bytes_cnt_ is zero.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Thu, 12 Oct 2017 17:23:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h@485
PS6, Line 485:   /// Tracks all the memory allocation used by dictionary in DictDecoder.
             :   struct DictMemTrack {
             :     std::unique_ptr<MemTracker> mem_tracker;
             : 
             :     DictMemTrack(MemTracker *parent_memtrack):mem_tracker(new MemTracker(-1,
             :         "Decoder parquet dict", parent_memtrack, false)) { };
             : 
             :     MemTracker*  GetMemTracker() {
             :       return  mem_tracker.get();
             :     }
             : 
             :     ~DictMemTrack() {
             :       mem_tracker->CloseAndUnregisterFromParent();
             :     }
             :   };
I dont think we need to add a new struct for this, you can just call CloseAndUnregisterFromParent() on the dict_mem_tracker in HdfsParquetScanner::Close

Similarly for the writer


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@113
PS6, Line 113:  // Bytes used for memory tracking by dictionary in DictEncoder is decremented.
             :   virtual void Cleanup() {
             :     if (dict_encoder_base_ != nullptr) {
             :       dict_encoder_base_->Cleanup();
             :     }
             :   }
you can move this to the BaseColumnWriter::Close() method


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@767
PS6, Line 767:   // Below code will decrement the bytes used for memory tracking by dictionary in
             :   // DictEncoder class for each ColumnWriter.
             :   for (int i = 0; i < columns_.size(); i++) {
             :      if (columns_[i] != nullptr) {
             :        columns_[i]->Cleanup();
             :      }
             :   }
similarly you can move this to the HdfsParquetTableWriter::Close() method


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@138
PS6, Line 138: SizeofDict
I think we can use a more precise name here, SizeOfDict might be confused with returning the number of elements in the Dictionary.
Something on the lines of DictByteSize?


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@438
PS6, Line 438: bytes_alloc += sizeof(value);
can you switch to what joe suggested earlier, that is, instead of bytes_alloc, using something like the function SizeofDict() that you wrote to update ConsumeBytes at the end.
Refer Joe's first comment.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 6
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Wed, 04 Oct 2017 22:29:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> The parquet DictionaryPageHeader contains a num_values field. Look where we
using num_values * size of type might not work when dealing with variable sized type like string. 

I would recommend keeping count of the bytes used in a local variable and then do a ConsumeBytes when we exit the loop. This is because mtrackp loops on all its parent trackers to update mem usage every time Consume(num_bytes) is called. Not a huge optimization but I think it might be worth avoiding another loop inside the hot path.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc@39
PS3, Line 39: tracker
can you add test cases that verify that the encoder/decoder is keeping track correctly.
You can do this by using tracker.consumption() to get the num of bytes consumed and compare it to the expected size you calculate separately.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Fri, 29 Sep 2017 01:16:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@274
PS8, Line 274:   // Before closing the Column readers, the accounted bytes in mem_tracker for
             :   // dict_decoder_ is released.
             :   if (mem_tracker_ != nullptr) {
             :     for (ParquetColumnReader* col_reader : column_readers_) col_reader->Cleanup();
             :   }
> I think this code won't be necessary if col_reader->Close() does the dictio
Removed this code and wrapped it up with the col_reader->Close()


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: CloseAndUnregisterFromParent
> Same question here as elsewhere: Do we really need CloseAndUnregisterFromPa
I found that MemTracker::Close() is not unregistering with the parent MemTracker , so I found CloseAndUnregisterFromParent to be more appropriate.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@178
PS8, Line 178: dict_encoder_base_->ClearIndices();
> This should call the new base Close() function (which should do ClearIndice
Done


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-table-writer.cc@1057
PS8, Line 1057: CloseAndUnregisterFromParent();
> I think we want to limit use of CloseAndUnregisterFromParent() to cases tha
I checked the MemTracker::Close(), it does not unregister the mem_tracker_ from it's parent,

 Since this mem_tracker_ should not be used beyond this point I have reset it to NULL


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.h@234
PS8, Line 234:   /// Delete the bytes used for memory tracking.
             :   virtual void Cleanup() {}
> I think this can be folded into Close() and removed.
Done


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/parquet-column-readers.cc@269
PS8, Line 269:   virtual void Cleanup() {
             :     dict_decoder_.Close();
             :   }
> I think this can be folded into Close() without a separate Cleanup() functi
Done


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@107
PS8, Line 107:   DictEncoder(MemPool* pool, int encoded_value_size, MemTracker* mem_tracker) :
             :       DictEncoderBase(pool), buckets_(HASH_TABLE_SIZE, Node::INVALID_INDEX),
             :       encoded_value_size_(encoded_value_size),
             :       dict_mem_tracker_(mem_tracker),
             :       dict_bytes_cnt_(0) { }
> As I understand it, the DictEncoderBase/DictEncoder split is that DictEncod
Moved the functionality to the base class as you suggested


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: ReleaseBytes();
> Is there any codepath that should legitimately use this? If not, DCHECK tha
Checking the dict_bytes_cnt_ to be zero seems to be tricky say even if it is not used in a function it can have a non zero value.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Fri, 13 Oct 2017 20:45:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#13).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 137 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/13
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 13
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.
Taras Bobrovytsky has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/17//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8034/17//COMMIT_MSG@7
PS17, Line 7: :U
nit: put a space after the semi colon



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 17
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Dec 2017 00:57:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   /// dict_encoded_size() bytes.
             :   virtual void Writ
> Close() should call ClearIndices() as one of its steps, ClearIndices() stil
My bad didn't realize it, now that's fixed.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 14
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 20:44:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 18: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1593/


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 18
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Dec 2017 04:28:55 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#4).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 171 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 4
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#6).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 170 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/6
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 6
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 12:

(5 comments)

I think this is very close.

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h@352
PS12, Line 352:   MemTracker* dict_mem_tracker() { return scan_node_->mem_tracker(); }
See comment in HdfsParquetTableWriter. Also, see how we pass the mem_tracker to data_page_pool_.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h@81
PS12, Line 81:   /// Memory tracker to track the memory used by encoder.
             :   MemTracker* dict_mem_tracker();
When I'm reading the code, I'm thinking that this is hiding more than it is illuminating. We use this in exactly one place. I would rather have the exact code right there with a good comment explaining which mem_tracker we are using.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@179
PS12, Line 179:       dict_encoder_base_->ClearIndices();
              :       dict_encoder_base_->Close();
When Close() does a call to ClearIndices(), this can be simplified to just a Close() call.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@322
PS12, Line 322:                            plain_encoded_value_size_,
              :                            parent_->dict_mem_tracker()));
Nit: Indentation in this case should be even with the "D" in new DictEncoder. It goes against every editor's typical behavior, but it's the standard we have. Also, feel free to keep plain_encoded_value_size_ on the first line. (It's on the edge of our 90 char limit.)


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   void Close() {
             :     ReleaseBytes();
I think Close() should do ClearIndices().



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 12
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Oct 2017 23:59:23 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#17).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 140 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/17
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 17
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 17: Code-Review+1

Did you want to have another look Joe?


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 17
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Dec 2017 00:21:51 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#3).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class ExecNode level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class DataSink to track the memory
used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/data-sink.cc
M be/src/exec/data-sink.h
M be/src/exec/exec-node.cc
M be/src/exec/exec-node.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 93 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 17: Code-Review+1

This looks good to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 17
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Dec 2017 00:41:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#14).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 134 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/14
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 14
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 9:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: f the column.
> We should be using Close() here instead of unregistering every tracker (thi
Removed the Close() since I'm using the MemTracker of 
scan_node_->mem_tracker so won't be using 1000's of memtrackers now


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@133
PS8, Line 133:     di
> nit: inline here and nearby isn't necessary for a couple of reason. First, 
Removed inline


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@184
PS8, Line 184:     Node(const T& v, const NodeIndex& n) : value(v), next(n) { }
> Nit: check isn't necessary - Release() does this for you.
Removed the check


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@194
PS8, Line 194:     enum { INVALID_INDEX = 40000 };
> The TODO should be to use TryConsume(). The asynchronous checks are not the
Changed the TODO to use TryConsume()


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: / This function
> I agree with Joe's comment. All codepaths should call Close(), so we can ju
Added a DCHECK found during the testing that some of the column readers do not call Close, so I've added ReleaseBytes() so that accounting of memory used for dictionary is release when DictDecoder is destroyed.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 9
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 18 Oct 2017 05:28:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has uploaded a new patch set (#2).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses existing dictionary_pool_ to track the memory used
by std::vector in DictDecoder class, whereas an existing memory pool
for DictEncoder class is used when allocating memory for std::vector
in DictEncoder class.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
3 files changed, 42 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 15:

(8 comments)

Pretty close, have some formatting nits (nothing horrible, just trying to keep the codebase consistent) and one perf concern.

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@54
PS15, Line 54:     : dict_encoded_size_(0), pool_(NULL), dict_bytes_cnt_(0), dict_mem_tracker_(NULL) {}
Ok to leave for now but we've generally been moving towards using the recent C++ extension that allows initialising member variables to constants at their declartion.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@59
PS15, Line 59: DCHECK
DCHECK_EQ(dict_bytes_cnt_, 0);


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   void ClearIndices() {
We generally use the more concise one-line formatting for very short single-statement functions like this one.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@123
PS15, Line 123:  
formatting nit: * goes on the left with the type name.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@168
PS15, Line 168: inline
nit: inline isn't necessary. It isn't harmful but can be confusing to have unnecessary modifiers.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@269
PS15, Line 269:  *
nit: * should be on left.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@330
PS15, Line 330: inline
nit: unnecessary inline


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@374
PS15, Line 374:   ConsumeBytes(sizeof(node));
I'm concerned that calling MemTracker::Consume for every node added to the table could hurt performance in some workloads since it will end up incrementing the memory consumption counter in the root process-wide MemTracker up to 40k times per dictionary.

(In contrast, MemPool::Allocate() is generally fine since it allocates memory in chunks).

How about we track the memory for some number of nodes at a time. E.g.

  const int NODE_MEM_TRACKING_GRANULARITY = 4096;
  ...
  if (nodes % NODE_MEM_TRACKING_GRANULARITY == 0) {
    ConsumeBytes(sizeof(node) * NODE_MEM_TRACKING_GRANULARITY);
  }



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 15
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 03 Nov 2017 23:07:31 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 12:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-scanner.h@352
PS12, Line 352:   MemTracker* dict_mem_tracker() { return scan_node_->mem_tracker(); }
> See comment in HdfsParquetTableWriter. Also, see how we pass the mem_tracke
Done


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h
File be/src/exec/hdfs-parquet-table-writer.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.h@81
PS12, Line 81:   /// Memory tracker to track the memory used by encoder.
             :   MemTracker* dict_mem_tracker();
> When I'm reading the code, I'm thinking that this is hiding more than it is
Removed the function made it in line where the memtracker is used so it's more explicit


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@179
PS12, Line 179:       dict_encoder_base_->ClearIndices();
              :       dict_encoder_base_->Close();
> When Close() does a call to ClearIndices(), this can be simplified to just 
implemented the Close() logic inside ClearIndices()


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/exec/hdfs-parquet-table-writer.cc@322
PS12, Line 322:                            plain_encoded_value_size_,
              :                            parent_->dict_mem_tracker()));
> Nit: Indentation in this case should be even with the "D" in new DictEncode
aligned the indentation with D in next line.


http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   void Close() {
             :     ReleaseBytes();
> I think Close() should do ClearIndices().
implemented the logic of Close() inside ClearIndices()



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 12
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 01:07:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#7).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 158 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/7
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 7
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352
PS10, Line 352: return mem_tracker_;
if this is the same mem tracker in scan_node_ perhaps we can just do 
scan_node_->mem_tracker(); 
over here and avoid adding a new member variable?
Same in the hdfs-table-writer.


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc@337
PS10, Line 337: if (mem_tracker_ != nullptr) {
              :     mem_tracker_ = nullptr;
              :   }
nit: one line

if (mem_tracker_ != nullptr) mem_tracker_ = nullptr;


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356
PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close();
instead of dict_decoder_base_ we can just use the virtual function GetDictionaryDecoder() to get the dict if it exists


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@206
PS10, Line 206: 
nit: extra line


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@324
PS10, Line 324: 
nit: extra line



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 10
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 19 Oct 2017 21:41:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#16).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 142 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/16
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 16
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> using num_values * size of type might not work when dealing with variable s
For strings, the StringValue is a pointer and a length. In the case of the dictionary, the pointer is pointing into the dictionary (allocated from dictionary_pool_ in InitDictionary()). Since the memory for the dictionary page is already tracked, we only need to track these StringValue's, which are fixed-size. Either way, we should aggregate the Consume() calls here.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Mon, 02 Oct 2017 21:48:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 17:

> This is still on my radar, need to get back to it.

Thanks Tim


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 17
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 23 Nov 2017 02:23:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#8).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 157 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/8
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 18:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1593/


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 18
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Dec 2017 00:57:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 15:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   void ClearIndices() {
> We generally use the more concise one-line formatting for very short single
To add to what Tim said, this code didn't actually change, so it is nice to avoid touching code if you can. It makes it easier to track down what code change last impacted a piece of code.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@152
PS15, Line 152:   /// Destructor, release the bytes used by Memtracker.
              :   ~DictEncoder() {
              :     ReleaseBytes();
              :     DCHECK(dict_bytes_cnt_ == 0);
              :   }
I think we can omit this destructor, since DictEncoderBase does the same thing.


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@296
PS15, Line 296:   /// Destructor, release the bytes used by Memtracker and close.
              :   ~DictDecoder() {
              :     ReleaseBytes();
              :     DCHECK(dict_bytes_cnt_ == 0);
              :    }
I think we can omit this destructor, as DictDecoderBase does this.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 15
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 03 Nov 2017 23:14:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Taras Bobrovytsky, Joe McDonnell, Tim Armstrong, Bikramjeet Vig, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#19).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 173 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/19
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 19
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Reviewed-on: http://gerrit.cloudera.org:8080/8034
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 173 insertions(+), 24 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 22
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 17:

This is still on my radar, need to get back to it.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 17
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 23 Nov 2017 00:29:19 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 3:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/data-sink.h
File be/src/exec/data-sink.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/data-sink.h@126
PS3, Line 126:   /// This memtracker tracks all the allocations done by parquet dictonary encoder.
             :   boost::scoped_ptr<MemTracker> dict_mem_tracker_;
Can we move this down to HdfsParquetTableWriter?


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/exec-node.h
File be/src/exec/exec-node.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/exec/exec-node.h@325
PS3, Line 325:   /// Tracks all the memory allocation by parquet dictonary decoders
             :   boost::scoped_ptr<MemTracker> dict_mem_tracker_;
Can this move down to HdfsParquetScanner? Other ExecNodes don't need this.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@113
PS3, Line 113:  
Remove trailing spaces. This also applies to lines that don't have any code (there are a couple others in this file). These are easily visible in the Gerrit UI.

If using emacs, the following will highlight trailing whitespace:
(setq-default show-trailing-whitespace t)


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@164
PS3, Line 164: mtrackp
For class fields, the convention is to end the field name with an underscore. Also, look at other code that uses MemTrackers and see what variable name that code uses. It is good to harmonize these things across the codebase.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
The parquet DictionaryPageHeader contains a num_values field. Look where we call CreateDictionaryDecoder and you'll see that we reference to it on the next line. I think we should consider doing a single ConsumeBytes using num_values * size of type.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Thu, 28 Sep 2017 16:55:24 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 18:

@pranay - looks like compilation failed https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/775/console


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 18
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Sat, 09 Dec 2017 00:24:05 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-scanner.h@485
PS6, Line 485:   /// Tracks all the memory allocation used by dictionary in DictDecoder.
             :   struct DictMemTrack {
             :     std::unique_ptr<MemTracker> mem_tracker;
             : 
             :     DictMemTrack(MemTracker *parent_memtrack):mem_tracker(new MemTracker(-1,
             :         "Decoder parquet dict", parent_memtrack, false)) { };
             : 
             :     MemTracker*  GetMemTracker() {
             :       return  mem_tracker.get();
             :     }
             : 
             :     ~DictMemTrack() {
             :       mem_tracker->CloseAndUnregisterFromParent();
             :     }
             :   };
> I dont think we need to add a new struct for this, you can just call CloseA
I had added a new structure to have the creation /destruction of memtracker in one place

and also the CloseAndUnregisterFromParent() has to be called after
releasing all bytes owned by MemTracker in the DictDecoder class (which happens in it's destructor), since the destructor of member variables are called later than the owner making a separate struct/class for DictMemTrack  solved the problem.


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@113
PS6, Line 113:  // Bytes used for memory tracking by dictionary in DictEncoder is decremented.
             :   virtual void Cleanup() {
             :     if (dict_encoder_base_ != nullptr) {
             :       dict_encoder_base_->Cleanup();
             :     }
             :   }
> you can move this to the BaseColumnWriter::Close() method
This can't be done because I'm not tracking the bytes at the BaseColumnWriter


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/exec/hdfs-parquet-table-writer.cc@767
PS6, Line 767:   // Below code will decrement the bytes used for memory tracking by dictionary in
             :   // DictEncoder class for each ColumnWriter.
             :   for (int i = 0; i < columns_.size(); i++) {
             :      if (columns_[i] != nullptr) {
             :        columns_[i]->Cleanup();
             :      }
             :   }
> similarly you can move this to the HdfsParquetTableWriter::Close() method
I'll try moving the functionality in Close() for HdfsParquetTableWriter/Reader


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@138
PS6, Line 138: SizeofDict
> I think we can use a more precise name here, SizeOfDict might be confused w
I've changed it to be DictByteSize


http://gerrit.cloudera.org:8080/#/c/8034/6/be/src/util/dict-encoding.h@438
PS6, Line 438: bytes_alloc += sizeof(value);
> can you switch to what joe suggested earlier, that is, instead of bytes_all
This looks fine what is the issue with this?



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 6
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Thu, 05 Oct 2017 01:58:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#12).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 146 insertions(+), 15 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/12
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 12
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has uploaded a new patch set (#2).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses existing dictionary_pool_ to track the memory used
by std::vector in DictDecoder class, whereas an existing memory pool
for DictEncoder class is used when allocating memory for std::vector
in DictEncoder class.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
3 files changed, 42 insertions(+), 18 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#5).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
7 files changed, 171 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/5
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 5
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#15).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
5 files changed, 140 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/15
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 15
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#11).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 160 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/11
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 11
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 18: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 18
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Dec 2017 00:57:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 14:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/12/be/src/util/dict-encoding.h@63
PS12, Line 63:   /// dict_encoded_size() bytes.
             :   virtual void Writ
> implemented the logic of Close() inside ClearIndices()
Close() should call ClearIndices() as one of its steps, ClearIndices() still needs to exist separately and cannot incorporate the logic of Close().



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 14
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 01 Nov 2017 01:16:49 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Tim Armstrong, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#10).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch uses the memory tracker of HdfsScanner to track the memory used
by dictionary in DictDecoder class. Similary it uses memory tracker of
HdfsTableSink to track the memory used by dictionary in DictEncoder class.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 166 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/10
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 10
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 11:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352
PS10, Line 352: return mem_tracker_;
> I tried that earlier but I was getting the below error for parquet table wr
I see, that probably happens because HdfsTableSink class only has a forward declaration in hdfs-table-write.h

If you would like to give this another try:

You can try including the hdfs-table-sink.h in the hdfs-parquet-table-writer.h and it should work. 
OR 
you can move the implementation of dict_mem_tracker() to hdfs-parquet-table-writers.cc and add the include there instead in order to minimize the includes in the header file.


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356
PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close();
> GetDictionaryDecoder() returns dict_decoder_base_ it should be the same cal
Sorry, I should have been more clearer about what I was suggesting. What I meant was, instead of changing the method GetDictionaryDecoder() in parquet-column-readers.h where you return dict_decoder_base_, we can instead get rid of dict_decoder_base_ member var altogether and use GetDictionaryDecoder() because the overridden method in ScalarColumnReader would return the actual dict_decoder_ object that we can use directly and in all other cases the default implementation will return a nullptr.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 11
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Sat, 21 Oct 2017 00:25:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/exec-node.h
File be/src/exec/exec-node.h:

PS1, Line 215:  MemTracker* decoder_mem_tracker() { return decoder_mem_tracker_.get(); }
The dictionary is only used for Parquet, so I think the MemPool should not be in the ExecNode, as that is used for a large number of other things. Look into moving this down to HdfsParquetScanner (or reuse dictionary_pool_).


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS1, Line 213: dict_decoder_(new MemPool(parent->scan_node_->decoder_mem_tracker())),
It is important not to create objects on a per-column basis unless truly necessary. I think it makes more sense to have a single MemPool up at the HdfsParquetScanner level. Look at how dictionary_pool_ works. I think it might be better to reuse that pool.


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

PS1, Line 234: *val_ptr = *dict_[index];
dict_ needs to return T's directly rather than T*'s. This is a performance critical path, and an extra dereference is too expensive.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Bikramjeet Vig (Code Review)" <ge...@cloudera.org>.
Bikramjeet Vig has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> For strings, the StringValue is a pointer and a length. In the case of the 
yes I agree



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Tue, 03 Oct 2017 18:59:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/2/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS2, Line 213: dict_decoder_(parent_->dictionary_pool_.get()),
We should think about having a separate MemPool for parts of the dictionary that will not be referenced by the RowBatch and can be freed at the end of the row group.

Since some datatypes are copied by value rather than pointing into the dictionary, this might be used for those as well.


http://gerrit.cloudera.org:8080/#/c/8034/2/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

PS2, Line 229:     memcpy(val_ptr, dict_[index], sizeof(T));
The memcpy is doing a dereference here and is not different from *val_ptr = *dict_[index];


PS2, Line 238:   std::vector<T*> dict_;
Using a T* means that GetValue() must do a dereference to obtain the value. This is a very performance sensitive codepath, so we need to avoid this extra dereference. This should maintain the value directly in the vector.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.h@352
PS10, Line 352: return mem_tracker_;
> if this is the same mem tracker in scan_node_ perhaps we can just do 
I tried that earlier but I was getting the below error for parquet table writer, so in order to have similar change in scanner and writer went this route

Error:
-----
hdfs-parquet-table-writer.h:83:57: error: invalid use of incomplete type ‘class impala::HdfsTableSink’


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/hdfs-parquet-scanner.cc@337
PS10, Line 337: if (mem_tracker_ != nullptr) {
              :     mem_tracker_ = nullptr;
              :   }
> nit: one line
Done


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/exec/parquet-column-readers.h@356
PS10, Line 356: if (dict_decoder_base_ != nullptr) dict_decoder_base_->Close();
> instead of dict_decoder_base_ we can just use the virtual function GetDicti
GetDictionaryDecoder() returns dict_decoder_base_ it should be the same calling the function GetDictionaryDecoder() or accessing member variable  dict_decoder_base_ directly.


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@206
PS10, Line 206: 
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/8034/10/be/src/util/dict-encoding.h@324
PS10, Line 324: 
> nit: extra line
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 10
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 20 Oct 2017 23:04:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@113
PS3, Line 113:  
> Remove trailing spaces. This also applies to lines that don't have any code
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@164
PS3, Line 164: mtrackp
> For class fields, the convention is to end the field name with an underscor
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> using num_values * size of type might not work when dealing with variable s
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> The parquet DictionaryPageHeader contains a num_values field. Look where we
Ok I'll allocate a local variable to track the bytes consumed.


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-encoding.h@412
PS3, Line 412: ConsumeBytes(sizeof(value));
> For strings, the StringValue is a pointer and a length. In the case of the 
Done


http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

http://gerrit.cloudera.org:8080/#/c/8034/3/be/src/util/dict-test.cc@39
PS3, Line 39: tracker
> can you add test cases that verify that the encoder/decoder is keeping trac
I've added the checks now



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 3
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Comment-Date: Tue, 03 Oct 2017 21:45:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 8:

(4 comments)

I took a look at it and just wanted to point out some cases where the new code is diverging from best practices in our codebase.

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/exec/hdfs-parquet-scanner.cc@348
PS8, Line 348: CloseAndUnregisterFromParent
> I found that MemTracker::Close() is not unregistering with the parent MemTr
We should be using Close() here instead of unregistering every tracker (this is documented in the comment for CloseAndUnregisterFromParent()).

That may be moot since I don't think we need a special MemTracker for this - it's fine if we just track this against the scan node tracker. There are pros/cons of tracking things at a finer granularity but generally we don't track memory at a very fine granularity. Looking at the JIRA description, I can see it being read as "track the memory against a new MemTracker" but that wasn't what I intended.

As-is the behaviour added by this patch is particularly weird because we don't have per-scanner MemTrackers, so you'll end up with tens of these "dict decoder" trackers hanging off the scan node.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@133
PS8, Line 133: inline
nit: inline here and nearby isn't necessary for a couple of reason. First, the compiler can generally figure out that trivial functions can be inlined with or without a hint. Second, functions defined in C++ class bodies have an implicit inline hint implied anyway: https://isocpp.org/wiki/faq/inline-functions#inline-member-fns-more


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@184
PS8, Line 184:     if (dict_bytes_cnt_ != 0) {
Nit: check isn't necessary - Release() does this for you.


http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@194
PS8, Line 194:     // TODO: AnyLimitExceeded() can be called to check if memory limit has been exceeded.
The TODO should be to use TryConsume(). The asynchronous checks are not the direction we want to take things in. Michael Ho had a nice comment here explaining the direction we're trying to take things: https://issues.apache.org/jira/browse/IMPALA-2399



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 13 Oct 2017 22:49:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 16:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@54
PS15, Line 54:     DCHECK(buffered_indices_.empty());
> Ok to leave for now but we've generally been moving towards using the recen
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@59
PS15, Line 59: / This
> DCHECK_EQ(dict_bytes_cnt_, 0);
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   /// indices. Used to size the buffer passed to WriteData().
> To add to what Tim said, this code didn't actually change, so it is nice to
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@77
PS15, Line 77:   /// indices. Used to size the buffer passed to WriteData().
> We generally use the more concise one-line formatting for very short single
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@123
PS15, Line 123: 
> formatting nit: * goes on the left with the type name.
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@152
PS15, Line 152:       }
              :       num_call_track_++;
              :     }
              :   }
              : };
> I think we can omit this destructor, since DictEncoderBase does the same th
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@168
PS15, Line 168: /// th
> nit: inline isn't necessary. It isn't harmful but can be confusing to have 
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@269
PS15, Line 269: 
> nit: * should be on left.
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@296
PS15, Line 296: 
              : template<typename T>
              : class DictDecoder : public DictDecoderBase {
              :  public:
              :   //
> I think we can omit this destructor, as DictDecoderBase does this.
Done


http://gerrit.cloudera.org:8080/#/c/8034/15/be/src/util/dict-encoding.h@330
PS15, Line 330: /// It
> nit: unnecessary inline
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 16
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Nov 2017 06:22:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Pranay Singh has posted comments on this change.

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/exec-node.h
File be/src/exec/exec-node.h:

PS1, Line 215:  MemTracker* decoder_mem_tracker() { return decoder_mem_tracker_.get(); }
> The dictionary is only used for Parquet, so I think the MemPool should not 
I also think using existing memory pool, dictionary_pool_  is a better option, I missed it. I'll use dictionary_pool_ instead


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS1, Line 213: dict_decoder_(new MemPool(parent->scan_node_->decoder_mem_tracker())),
> It is important not to create objects on a per-column basis unless truly ne
I'll use dictionary_pool_ instead of creating a new memory pool for every column reads


http://gerrit.cloudera.org:8080/#/c/8034/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

PS1, Line 234: *val_ptr = *dict_[index];
> dict_ needs to return T's directly rather than T*'s. This is a performance 
I'll use memcpy to copy the contents of dictionary to the buffer passed.


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 21: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 21
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Sat, 06 Jan 2018 01:30:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 21:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1681/


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 21
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 05 Jan 2018 21:52:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 21: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 21
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 05 Jan 2018 21:52:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/8034/8/be/src/util/dict-encoding.h@251
PS8, Line 251: ReleaseBytes();
> Checking the dict_bytes_cnt_ to be zero seems to be tricky say even if it i
I agree with Joe's comment. All codepaths should call Close(), so we can just add a DCHECK here to enforce that they did so.



-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 8
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 13 Oct 2017 22:50:19 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Pranay Singh (Code Review)" <ge...@cloudera.org>.
Hello Joe McDonnell, Bikramjeet Vig, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8034

to look at the new patch set (#9).

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................

IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Currently DictDecoder class and DictEncoder class uses std::vector
to store the tables mapping codeword to value and vice-versa. It is
hard to detect the memory usage by these tables when they becomes
very large, since this memory is not accounted by Impala's memory
mangement infrastructure.

This patch introduces memory tracker to track the memory used
by dictionary used in DictDecoder class and DictEncoder class
when a parquet file is read from or written to.

Memory tracker is introduced at the class HdfsParquetScanner level to
track all the memory used by dictonary in DictDecoders. Similarly
memory tracker is introduced in class HdfsParquetTableWriter to
track the memory used by dictionary in DictEncoders.

Memory for the dictionary, stored as std::vector is still allocated
from std:allocator but the amount allocated is accounted by
introducing a counter which is incremented and decremented as the
memory is consumed and released by vector.

Testing
-------
Ran all the backend and end-end tests with no failures.

Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/hdfs-parquet-scanner.h
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/parquet-column-readers.cc
M be/src/exec/parquet-column-readers.h
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
8 files changed, 170 insertions(+), 21 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/34/8034/9
-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 9
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh

[Impala-ASF-CR] IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8034 )

Change subject: IMPALA-5522:Use tracked memory for DictDecoder and DictEncoder
......................................................................


Patch Set 17: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8034
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I02a3b54f6c107d19b62ad9e1c49df94175964299
Gerrit-Change-Number: 8034
Gerrit-PatchSet: 17
Gerrit-Owner: Pranay Singh
Gerrit-Reviewer: Bikramjeet Vig <bi...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Pranay Singh
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Fri, 08 Dec 2017 00:57:09 +0000
Gerrit-HasComments: No