You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Joe McDonnell (Code Review)" <ge...@cloudera.org> on 2017/02/28 21:53:41 UTC

[Impala-ASF-CR] Fix parquet table writer dictionary leak

Joe McDonnell has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6181

Change subject: Fix parquet table writer dictionary leak
......................................................................

Fix parquet table writer dictionary leak

Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState
object pool to be freed at the end of the query. However, for clustered inserts
into a partitioned table, the OutputPartitions are only used one at a time.
They can be immediately freed once done writing to that partition.

In addition, the HdfsParquetTableWriter's ColumnWriters are also added to
this object pool. These constitute a significant amount of memory, as they
contain the dictionaries for Parquet encoding.

This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so
that they are cleaned up when the HdfsParquetTableWriter is deleted. It also
explicitly cleans up the OutputPartition rather than leaving it to the object
pool.

Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-table-sink.cc
3 files changed, 14 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/6181/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>

[Impala-ASF-CR] Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#2).

Change subject: Fix parquet table writer dictionary leak
......................................................................

Fix parquet table writer dictionary leak

Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState
object pool to be freed at the end of the query. However, for clustered inserts
into a partitioned table, the OutputPartitions are only used one at a time.
They can be immediately freed once done writing to that partition.

In addition, the HdfsParquetTableWriter's ColumnWriters are also added to
this object pool. These constitute a significant amount of memory, as they
contain the dictionaries for Parquet encoding.

This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so
that they are cleaned up when the HdfsParquetTableWriter is deleted. It also
explicitly cleans up the OutputPartition rather than leaving it to the object
pool.

Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-table-sink.cc
3 files changed, 13 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/6181/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 6: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 6: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] Fix parquet table writer dictionary leak

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Fix parquet table writer dictionary leak
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6181/1/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 334:       OutputPartition* deletable_partition = current_clustered_partition_->first;
Why not just delete it here and save the temp variable?


Line 336:       delete(deletable_partition);
delete deletable_partition; - delete is a keyword instead of a function


Line 579:     OutputPartition* partition = new OutputPartition();
Could we use a unique_ptr here and in PartitionPair to make the ownership explicit and cleanup automatic? I think this would be a unique_ptr, and it would be move()d into the map .


Line 587:       delete(partition);
delete partition;


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 4: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6181

to look at the new patch set (#5).

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................

IMPALA-4899: Fix parquet table writer dictionary leak

Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState
object pool to be freed at the end of the query. However, for clustered inserts
into a partitioned table, the OutputPartitions are only used one at a time.
They can be immediately freed once done writing to that partition.

In addition, the HdfsParquetTableWriter's ColumnWriters are also added to
this object pool. These constitute a significant amount of memory, as they
contain the dictionaries for Parquet encoding.

This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so
that they are cleaned up when the HdfsParquetTableWriter is deleted. It also
uses a unique_ptr on the PartitionPair for the OutputPartition.

The table writers maintain a pointer to the OutputPartition. This remains a
raw pointer. This is safe, because OutputPartition has a scoped_ptr to the
table writer. The table writer will never outlive the OutputPartition.

Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.h
5 files changed, 26 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/6181/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 4:

> is there a way to ascertain the leak-free behavior in a test?

I don't see a difference in the runtime profile with this patch vs without. The way I have been testing is by hand. I do a clustered insert into a partitioned table and look at the memory usage in top. It is a very clear difference, but I'm not sure how to write a test around it. I tried setting a mem_limit and testing without my patch, but the mem_limit is not enforced.

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 4:

is there a way to ascertain the leak-free behavior in a test?

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 3: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6181/3/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 172:           const std::unique_ptr<OutputPartition>& output_partition,
Just passing the raw pointer seems ok to me - I think you can do everything with the const unique_ptr that you can do with a raw pointer, so the type doesn't convey much additional info.

I don't feel that strongly about it though


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

Re: [Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by Tim Armstrong <ta...@cloudera.com>.
The problem is that the memory is untracked so it won't be accounted
against the query.

You can look at process RSS or the process memory consumption on the /memz
to see the total memory consumed including untracked allocations.

The last time I ran into something like this I used a script from
stackoverflow to graph the memory consumption:
https://issues.cloudera.org/browse/IMPALA-2940
http://stackoverflow.com/questions/7998302/graphing-a-processs-memory-usage

On Wed, Mar 1, 2017 at 4:21 PM, Marcel Kornacker (Code Review) <
gerrit@cloudera.org> wrote:

> Marcel Kornacker has posted comments on this change.
>
> Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
> ......................................................................
>
>
> Patch Set 4:
>
> (1 comment)
>
> http://gerrit.cloudera.org:8080/#/c/6181/4/be/src/exec/hdfs-table-sink.h
> File be/src/exec/hdfs-table-sink.h:
>
> Line 184:   typedef std::pair<std::unique_ptr<OutputPartition>,
> std::vector<int32_t>> PartitionPair;
> > please include updates to the class comments (of the affected classes)
> that
> to expand on that: your commit message explains the intention behind the
> unique_ptrs. however, that's not visible from the perspective of the next
> person looking at this code, so it's better to have this directly in the
> code.
>
>
> --
> To view, visit http://gerrit.cloudera.org:8080/6181
> To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
>
> Gerrit-MessageType: comment
> Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
> Gerrit-PatchSet: 4
> Gerrit-Project: Impala-ASF
> Gerrit-Branch: master
> Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
> Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
> Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
> Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
> Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
> Gerrit-HasComments: Yes
>

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6181/4/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 184:   typedef std::pair<std::unique_ptr<OutputPartition>, std::vector<int32_t>> PartitionPair;
> please include updates to the class comments (of the affected classes) that
to expand on that: your commit message explains the intention behind the unique_ptrs. however, that's not visible from the perspective of the next person looking at this code, so it's better to have this directly in the code.


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


IMPALA-4899: Fix parquet table writer dictionary leak

Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState
object pool to be freed at the end of the query. However, for clustered inserts
into a partitioned table, the OutputPartitions are only used one at a time.
They can be immediately freed once done writing to that partition.

In addition, the HdfsParquetTableWriter's ColumnWriters are also added to
this object pool. These constitute a significant amount of memory, as they
contain the dictionaries for Parquet encoding.

This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so
that they are cleaned up when the HdfsParquetTableWriter is deleted. It also
uses a unique_ptr on the PartitionPair for the OutputPartition.

The table writers maintain a pointer to the OutputPartition. This remains a
raw pointer. This is safe, because OutputPartition has a scoped_ptr to the
table writer. The table writer will never outlive the OutputPartition.

Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Reviewed-on: http://gerrit.cloudera.org:8080/6181
Reviewed-by: Marcel Kornacker <ma...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.h
5 files changed, 26 insertions(+), 14 deletions(-)

Approvals:
  Marcel Kornacker: Looks good to me, approved
  Impala Public Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] Fix parquet table writer dictionary leak

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: Fix parquet table writer dictionary leak
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6181/2//COMMIT_MSG
Commit Message:

Line 7: Fix parquet table writer dictionary leak
JIRA?


http://gerrit.cloudera.org:8080/#/c/6181/1/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 579:     OutputPartition* partition = new OutputPartition();
> The issue is that the OutputPartition is passed down to the table writers, 
The OutputPartition owns the HdfsTableWriter via a scoped_ptr so we can safely use a raw OutputPartition* pointer in HdfsTableWriter. This pattern of an owning unique_ptr/scoped_ptr in one direction and a raw pointer for the inverse shows up elsewhere in the code (e.g. for "parent" pointers). 

I'd probably add a comment on HdfsTableWriter.output_ to note that ownership relationship but I think that would be clear and safe.


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6181/4/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 184:   typedef std::pair<std::unique_ptr<OutputPartition>, std::vector<int32_t>> PartitionPair;
> to expand on that: your commit message explains the intention behind the un
Improved comments here and in hdfs-parquet-table-writer.h.


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6181/2//COMMIT_MSG
Commit Message:

Line 7: Fix parquet table writer dictionary leak
> JIRA?
Done


http://gerrit.cloudera.org:8080/#/c/6181/1/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 579:     OutputPartition* partition = new OutputPartition();
> The OutputPartition owns the HdfsTableWriter via a scoped_ptr so we can saf
Changed to use the unique_ptr with a raw pointer in HdfsTableWriter.output_.


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 6:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/329/

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6181/4/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 184:   typedef std::pair<std::unique_ptr<OutputPartition>, std::vector<int32_t>> PartitionPair;
please include updates to the class comments (of the affected classes) that briefly describe the memory management intentions. since we got this wrong last time around it appears to be subtle/non-obvious enough to warrant a description.


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has uploaded a new patch set (#3).

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................

IMPALA-4899: Fix parquet table writer dictionary leak

Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState
object pool to be freed at the end of the query. However, for clustered inserts
into a partitioned table, the OutputPartitions are only used one at a time.
They can be immediately freed once done writing to that partition.

In addition, the HdfsParquetTableWriter's ColumnWriters are also added to
this object pool. These constitute a significant amount of memory, as they
contain the dictionaries for Parquet encoding.

This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so
that they are cleaned up when the HdfsParquetTableWriter is deleted. It also
uses a unique_ptr on the PartitionPair for the OutputPartition.

The table writers maintain a pointer to the OutputPartition. This remains a
raw pointer. This is safe, because OutputPartition has a scoped_ptr to the
table writer. The table writer will never outlive the OutputPartition.

Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.h
5 files changed, 31 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/6181/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6181

to look at the new patch set (#4).

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................

IMPALA-4899: Fix parquet table writer dictionary leak

Currently, in HdfsTableSink, OutputPartitions are added to the RuntimeState
object pool to be freed at the end of the query. However, for clustered inserts
into a partitioned table, the OutputPartitions are only used one at a time.
They can be immediately freed once done writing to that partition.

In addition, the HdfsParquetTableWriter's ColumnWriters are also added to
this object pool. These constitute a significant amount of memory, as they
contain the dictionaries for Parquet encoding.

This change makes HdfsParquetTableWriter's ColumnWriters use unique_ptrs so
that they are cleaned up when the HdfsParquetTableWriter is deleted. It also
uses a unique_ptr on the PartitionPair for the OutputPartition.

The table writers maintain a pointer to the OutputPartition. This remains a
raw pointer. This is safe, because OutputPartition has a scoped_ptr to the
table writer. The table writer will never outlive the OutputPartition.

Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
---
M be/src/exec/hdfs-parquet-table-writer.cc
M be/src/exec/hdfs-parquet-table-writer.h
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-sink.h
M be/src/exec/hdfs-table-writer.h
5 files changed, 16 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/81/6181/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/6181/3/be/src/exec/hdfs-table-sink.h
File be/src/exec/hdfs-table-sink.h:

Line 172:           const std::unique_ptr<OutputPartition>& output_partition,
> Just passing the raw pointer seems ok to me - I think you can do everything
You're right. Now that I think about it, there isn't much reason to use a const unique_ptr reference. I've switched it back to a raw pointer.


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] Fix parquet table writer dictionary leak

Posted by "Joe McDonnell (Code Review)" <ge...@cloudera.org>.
Joe McDonnell has posted comments on this change.

Change subject: Fix parquet table writer dictionary leak
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/6181/1/be/src/exec/hdfs-table-sink.cc
File be/src/exec/hdfs-table-sink.cc:

Line 334:       OutputPartition* deletable_partition = current_clustered_partition_->first;
> Why not just delete it here and save the temp variable?
Done


Line 336:       delete(deletable_partition);
> delete deletable_partition; - delete is a keyword instead of a function
Done


Line 579:     OutputPartition* partition = new OutputPartition();
> Could we use a unique_ptr here and in PartitionPair to make the ownership e
The issue is that the OutputPartition is passed down to the table writers, which store it and access it later (see HdfsTableWriter.output_). Other smart pointer options:

1. PartitionPair has shared_ptr to OutputPartition. HdfsTableWriter has a shared_ptr to OutputPartition. When ready to free the OutputPartition, the writer needs to release its shared_ptr (or the OutputPartition needs to reset its scoped_ptr to the writer), etc. But we have to explicitly break the circle.
2. PartitionPair has a shared_ptr to OutputPartition. HdfsTableWriter has a weak_ptr to OutputPartition. The weak_ptr must be upgraded to shared_ptr to access the OutputPartition. When ready to free the OutputPartition, the HdfsTableWriter is in the weak_ptr state, so the free goes through normally.

I'm new to smart pointers, so there might be something I'm missing.


Line 587:       delete(partition);
> delete partition;
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4899: Fix parquet table writer dictionary leak

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-4899: Fix parquet table writer dictionary leak
......................................................................


Patch Set 5: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6181
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I06e354086ad24071d4fbf823f25f5df23933688f
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Joe McDonnell <jo...@cloudera.com>
Gerrit-Reviewer: Lars Volker <lv...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No