You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@impala.apache.org by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org> on 2017/08/10 02:54:23 UTC

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Taras Bobrovytsky has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7639

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 11 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#3).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
will contains partial tuple contant length and variable length data.

Testing:
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files.
- Ran scanner tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 42 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#3).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
will contains partial tuple contant length and variable length data.

Testing:
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files.
- Ran scanner tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 43 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 13 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 5: Code-Review+1

Thanks for your patience, this is a lot easier to understand. Will let Alex +2 once he's had a look.

-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.

Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Reviewed-on: http://gerrit.cloudera.org:8080/7639
Reviewed-by: Taras Bobrovytsky <tb...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 65 insertions(+), 51 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Taras Bobrovytsky: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.

Alex Behm has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 4: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/4/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

Line 217:     num_corruptions = rng.randint(0, int(math.log(len(data))))
Do these changes reproduce the bug?


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#4).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
will contains partial tuple contant length and variable length data.

Testing:
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files.
- Ran scanner tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 46 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.

Alex Behm has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 5: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 297:         boundary_column_.Clear();
> I tried it and it doesn't quite work (without clearing the boundary field)
Ahh you are right, I missed that one memcpy(). Thanks for investigating, I agree the current solution is least confusing.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.

Alex Behm has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

Patch Set 3:

(7 comments)

I like the new solution, it's easier to understand.

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 269: while (true) {
> It's used on line 277. Are you suggesting to remove it from the signature o
Oh I missed that sorry. Ok to leave.

http://gerrit.cloudera.org:8080/#/c/7639/3/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 804: // Copy over all materialized slots in the partial tuple.
Only describes what the code does. How about something like:

// Deep copy to simplify memory ownership.

Line 808: // prevent the accumulation of variable length data in it. We also recreate the
The partial tuple isn't re-created here

http://gerrit.cloudera.org:8080/#/c/7639/3/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 186: /// Utility function to write out 'num_fields' to 'tuple_'. This is used to parse
Comment is weird, I suggest rephrasing to something like:

Utility function to parse 'num_fields' and materialize the resulting slots into 'partial_tuple_'. The data of var-len fields is copied into 'partial_tuple_pool_'.

Line 227: /// Mem pool for the partial tuple.
Mention that it does not hold tuple data of returned batches because the data is always deep-copied into the output batch

Line 230: /// Memory to store partial tuples split across buffers. Memory comes from
Mention that this tuple is always deep copied into the output batch

Line 231: /// partial_tuple_pool_. There is only one tuple allocated for this object and reused
The 2nd sentence does not seem to apply anymore because we realloc the partial_tuple_

--
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5//COMMIT_MSG
Commit Message:

PS5, Line 7: MPALA
Oops.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#3).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
will contains partial tuple contant length and variable length data.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
2 files changed, 29 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 4:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7639/4/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 194:   boost::scoped_ptr<MemPool> boundary_pool_;
> I still don't understand the memory lifetime for boundary_row_, boundary_co
Data is not always copied out of boundary_col. It can end up here: http://github.mtv.cloudera.com/CDH/Impala/blob/e5e444a89008a22f987c517ef1ecaa9f1693b060/be/src/exec/text-converter.inline.h#L71

This happens if reuse_data is true, which it always is if codegen is enabled. So it's not safe to free the boundary pool. So the life time of the boundary pool is different than the partial tuple pool.

Maybe this can be addressed in a different patch?


http://gerrit.cloudera.org:8080/#/c/7639/4/tests/query_test/test_scanners_fuzz.py
File tests/query_test/test_scanners_fuzz.py:

Line 217:     num_corruptions = rng.randint(0, int(math.log(len(data))))
> Do these changes reproduce the bug?
No, I don't think they reproduce the bug, but I think these changes are good to make in general.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.

Alex Behm has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 297:         boundary_column_.Clear();
Can we use the existing CopyBoundaryField() function for consistency? Otherwise, a future reader might be confused why we are not using CopyBoundaryField() here.

I believe we could call that after delimited_text_parser_->FillColumns(), instead of coping before calling FillColumns().


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#5).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 65 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.

Alex Behm has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 269:     bool eosr = true;
Remove. This not needed/used.


Line 299:         boundary_column_.Clear();
> This is subtle because the contents of boundary_column_ are needed until Wr
What's the point of clearing the boundary column here? I don't think the scanner is going to touch it after this point.


Line 808:       boundary_row_.Reset();
> It might be clearer if we factored out this reset logic for boundary_ and p
Another idea: Add a separate function for copying the partial tuple into the output batch. All this logic can go inside there and we can document/DCHECK pre- and post-conditions.


Line 896:     int num_fields, bool copy_strings) {
Can you remove the 'copy_strings' parameter? It's not used in this function and not copying seems too complicated to reason about.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#6).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 65 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

Patch Set 2:

(10 comments)

This is very subtle. I think the solution works but maybe there are some ways we can make it clearer why it works.

http://gerrit.cloudera.org:8080/#/c/7639/2//COMMIT_MSG
Commit Message:

Line 7: MPALA-5776: Write partial tuple to the correct mempool
Missing I.

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), false);
Does this even make sense? are there any cases where it's valid for the row batch to reference data in the boundary pool directly?

It seems like either this is unnecessary or we should be transferring the contents of boundary_pool_ below instead of clearing it.

Line 299: boundary_column_.Clear();
This is subtle because the contents of boundary_column_ are needed until WriteFields() below, I think. Maybe it would be clearer to document this or even move it below to the point where the data is no longer needed.

PS2, Line 765: batches
I'm not sure that I understand the reference to batches. Do they mean blocks?

Line 806: partial_tuple_->DeepCopy(tuple_, *scan_node_->tuple_desc(), pool);
Can you add a column clarifying what the intent is. E.g. "Copy over all materialized slots in the partial tuple".

Line 808: boundary_row_.Reset();
It might be clearer if we factored out this reset logic for boundary_ and partial_tuple_ into a separate function and documented the pre-conditions for it. I.e. the row has been fully materialized and all memory has been copied into the RowBatch pool (I think that's the precondition but not sure).

PS2, Line 810: emptied
cleared? The comment makes me thing that all the memory has been freed, but Clear() just enables recycling of the chunks.

Line 814: partial_tuple_ = Tuple::Create(tuple_byte_size_, boundary_pool_.get());
Is it possible to just initialize partial_tuple_ when it's needed? I.e. before we call InitTuple(template_tuple_, partial_tuple_). Then maybe we can get rid of partial_tuple_empty_ and represent that as partition_tuple_ == nullptr.

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 188: /// the boundary pool.
Hah! If only the code had matched the comment...

Line 194: /// Mem pool for boundary_row_ and boundary_column_.
Needs updating since it also backs partial_tuple_.

--
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/4/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 194:   boost::scoped_ptr<MemPool> boundary_pool_;
> So WriteSlot() doesn't copy when processing it's input string, but why are 
So my question is really whether there's a good reason why the lifetime needs to be different. If it's just that we're missing a copy in some place then we should fix that instead of trying to work around it.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 13 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#4).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
constant and variable length data. The new memory pool does not hold
tuple data of returned batches because the data is always deep-copied
into the output batch.

Testing:
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files.
- Ran scanner tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 46 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/4/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 194:   boost::scoped_ptr<MemPool> boundary_pool_;
> Data is not always copied out of boundary_col. It can end up here: http://g
So WriteSlot() doesn't copy when processing it's input string, but why are we feeding WriteSlot() a string that's backed by boundary_column_? It seems like in other places we make efforts to avoid this with CopyBoundaryField().


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 6: Code-Review+2

Rebased and fixed the commit message. Forwarding the +2.

-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1096/

-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 13 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/7639/3/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 804:       // Copy over all materialized slots in the partial tuple.
> Only describes what the code does. How about something like:
Done


Line 808:       // prevent the accumulation of variable length data in it. We also recreate the
> The partial tuple isn't re-created here
Done


http://gerrit.cloudera.org:8080/#/c/7639/3/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 186:   /// Utility function to write out 'num_fields' to 'tuple_'.  This is used to parse
> Comment is weird, I suggest rephrasing to something like:
Done


Line 227:   /// Mem pool for the partial tuple.
> Mention that it does not hold tuple data of returned batches because the da
Done


Line 230:   /// Memory to store partial tuples split across buffers.  Memory comes from
> Mention that this tuple is always deep copied into the output batch
Done


Line 231:   /// partial_tuple_pool_.  There is only one tuple allocated for this object and reused
> The 2nd sentence does not seem to apply anymore because we realloc the part
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

Patch Set 2:

(6 comments)

This is already easier to understand but I'm hoping we can avoid adding yet another MemPool to the scanners, particularly when the difference from boundary_pool_ isn't clear.

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 169: row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), false);
> Actually, after running some tests with ASAN, it turns out that it's not ok
It seemed like the way data in boundary_pool_ is handled is inconsistent - the code is inconsistent whether it should be transferred or not, particularly since it makes some effort to copy out data from boundary_column_ already. I made some comments on the latest PS.

http://gerrit.cloudera.org:8080/#/c/7639/4/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

PS4, Line 317: still need to handl
Comment needs update

http://gerrit.cloudera.org:8080/#/c/7639/4/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 194: /// Mem pool for boundary_row_ and boundary_column_.
I still don't understand the memory lifetime for boundary_row_, boundary_col_ and this pool. It seems like it should be the same as partial_tuple_pool_. I.e memory is allocated from it only when processing a row straddling blocks, and the memory can be safely freed after we've finished materializing that row.

Line 199: /// the line as erroneous in case of parsing errors.
Can you clarify whether this data will be referenced by the output batches.

Line 202: /// Helper string for dealing with columns that span file blocks.
Same here. It looks like some attempt is made to copy data out of this column in CopyBoundaryField(), maybe it's just missing a case where it isn't copied out. This case looks suspicious:

// Missing columns or row delimiter at end of the file is ok, fill the row in.
char* col = boundary_column_.buffer();
int num_fields = 0;
RETURN_IF_ERROR(delimited_text_parser_->FillColumns<true>(boundary_column_.len(),
&col, &num_fields, field_locations_.data()));

Line 230: /// for boundary tuples.
Can you explain the lifetime of the memory? I.e. we allocate memory from this pool

--
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 169:     row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), false);
> I don't think that data from the boundary pool is ever used outside of this
Actually, after running some tests with ASAN, it turns out that it's not ok to clear the boundary pool. The boundary col contains some var len data that is used after Close() is called.

Also, what do you mean we should be transferring the contents of boundary pool below instead of clearing it? Do you mean in the else clause? If row batch is null, where would would we transfer the contents to?


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 6: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#3).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
will contains partial tuple contant length and variable length data.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
2 files changed, 29 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#2).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written.

Testing:
- Ran some tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
1 file changed, 14 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7639

to look at the new patch set (#5).

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................

MPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool_, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 65 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#4).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
constant and variable length data. The new memory pool does not hold
tuple data of returned batches because the data is always deep-copied
into the output batch.

Testing:
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files.
- Ran scanner tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 46 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Hello Alex Behm, Tim Armstrong,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7639

to look at the new patch set (#6).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 65 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has uploaded a new patch set (#3).

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................

IMPALA-5776: Write partial tuple to the correct mempool

In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by creating a new memory pool for the partial tuple
will contains partial tuple contant length and variable length data.

Testing:
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files.
- Ran scanner tests locally on ASAN build.
- No new tests were added, because it is difficult to construct test
  cases due to the issue being non-deterministic.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M tests/query_test/test_scanners_fuzz.py
3 files changed, 42 insertions(+), 37 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/39/7639/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7639/5/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 297:         boundary_column_.Clear();
> Can we use the existing CopyBoundaryField() function for consistency? Other
I tried it and it doesn't quite work (without clearing the boundary field)

This is a slightly different case. CopyBoundaryField() copies the contents of the field followed by the boundary column to memory allocated from the row batch.

In this case, after FillColumns(), field_locations is pointing at the data in the boundary column. We can trick CopyBoundaryField() by clearing the boundary column right before CopyBoundaryField(), but it's such a weird thing to do. (It looks weird because we just cleared the boundary column, why are we trying to copy out of it?)

Another thing we can do, is check that the field is pointing to a different memory location than the boundary column, but I also think that's weird.

I suggest we leave it as is (for me, the way it is now is the least confusing way). What do you think?


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 4:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 808:       // The partial tuple pool is cleared after the data has been copied out of it to
> Another idea: Add a separate function for copying the partial tuple into th
Created a function for copying the partial tuple into output batch.


http://gerrit.cloudera.org:8080/#/c/7639/4/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

PS4, Line 317: partial_tuple_empty_
> Comment needs update
Done


http://gerrit.cloudera.org:8080/#/c/7639/4/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 194:   boost::scoped_ptr<MemPool> boundary_pool_;
> So my question is really whether there's a good reason why the lifetime nee
Fixed the place where we forget to copy the data out. Removed the new memory pool that I added.


Line 199:   StringBuffer boundary_row_;
> Can you clarify whether this data will be referenced by the output batches.
Done


Line 202:   StringBuffer boundary_column_;
> Same here. It looks like some attempt is made to copy data out of this colu
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] MPALA-5776: Write partial tuple to the correct mempool

Posted by "Taras Bobrovytsky (Code Review)" <ge...@cloudera.org>.

Taras Bobrovytsky has posted comments on this change.

Change subject: MPALA-5776: Write partial tuple to the correct mempool
......................................................................


Patch Set 2:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/7639/2//COMMIT_MSG
Commit Message:

Line 7: MPALA-5776: Write partial tuple to the correct mempool
> Missing I.
Done


http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.cc
File be/src/exec/hdfs-text-scanner.cc:

Line 169:     row_batch->tuple_data_pool()->AcquireData(boundary_pool_.get(), false);
> Does this even make sense? are there any cases where it's valid for the row
I don't think that data from the boundary pool is ever used outside of this class, so it makes sense to clear it.

I analyzed the code, and it doesn't look to me like data in boundary pool is ever used, but how can we be 100% sure?


Line 269:     bool eosr = true;
> Remove. This not needed/used.
It's used on line 277. Are you suggesting to remove it from the signature of FillByteBuffer as well?


Line 299:         boundary_column_.Clear();
> What's the point of clearing the boundary column here? I don't think the sc
Removed this.


PS2, Line 765: batches
> I'm not sure that I understand the reference to batches. Do they mean block
Yeah, "batches" doesn't make sense here. Changed it to blocks.


Line 806:       partial_tuple_->DeepCopy(tuple_, *scan_node_->tuple_desc(), pool);
> Can you add a column clarifying what the intent is. E.g. "Copy over all mat
Done


Line 808:       boundary_row_.Reset();
> Another idea: Add a separate function for copying the partial tuple into th
I added a new mem pool, so boundary row and columns are not affected any more. Alex, do you guys think it's still worth creating a separate function?


PS2, Line 810: emptied
> cleared? The comment makes me thing that all the memory has been freed, but
Done


Line 814:       partial_tuple_ = Tuple::Create(tuple_byte_size_, boundary_pool_.get());
> Is it possible to just initialize partial_tuple_ when it's needed? I.e. bef
Done


Line 896:     int num_fields, bool copy_strings) {
> Can you remove the 'copy_strings' parameter? It's not used in this function
Done


http://gerrit.cloudera.org:8080/#/c/7639/2/be/src/exec/hdfs-text-scanner.h
File be/src/exec/hdfs-text-scanner.h:

Line 188:   /// the boundary pool.
> Hah! If only the code had matched the comment...
totally agree


Line 194:   /// Mem pool for boundary_row_ and boundary_column_.
> Needs updating since it also backs partial_tuple_.
I made changes in the next patch so that this comment is now true.


-- 
To view, visit http://gerrit.cloudera.org:8080/7639
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tb...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes