You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Henry Robinson (Code Review)" <ge...@cloudera.org> on 2016/06/28 21:09:58 UTC

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Henry Robinson has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3524

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................

IMPALA-3798: Disable per-split filtering for sequence-based scanners

If a runtime filter rejects a sequence-based format's header split (but
not the entire file, which may happen if the filter has not arrived in
time), the scanner will never mark all splits for that file
complete. This is because BaseSequenceScanner issues scan ranges after
parsing the header splits, and until those ranges are processed,
RangeComplete() and AddDiskIoRanges() will not be called - those methods
update progress_ and num_files_queued_
respectively. HdfsScanNode::ScannerThread() reads those variables to
decide whether to exit, and as a result will spin forever.

This bug therefore only shows up when there is >1 scan range per file.

This patch disables per-split filtering for Avro, RC and sequence files
in lieu of a permanent fix which marks all scan ranges for a file as
done as soon as one range is filtered out.

Testing:

A custom cluster test is added which disables file filtering, emulating
the race condition that leads to the hang when a query that filters
scan ranges is run. Without the fix, this test hangs, with the fix the
query completes as expected. MAX_SCAN_RANGE_LENGTH is used to ensure >1
scan range per file.

Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
---
M be/src/common/global-flags.cc
M be/src/exec/hdfs-scan-node.cc
A tests/custom_cluster/test_seq_file_filtering.py
3 files changed, 86 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/24/3524/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Hello Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3524

to look at the new patch set (#2).

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................

IMPALA-3798: Disable per-split filtering for sequence-based scanners

If a runtime filter rejects a sequence-based format's header split (but
not the entire file, which may happen if the filter has not arrived in
time), the scanner will never mark all splits for that file
complete. This is because BaseSequenceScanner issues scan ranges after
parsing the header splits, and until those ranges are processed,
RangeComplete() and AddDiskIoRanges() will not be called - those methods
update progress_ and num_unqueued_files_
respectively. HdfsScanNode::ScannerThread() reads those variables to
decide whether to exit, and as a result will spin forever.

This bug therefore only shows up when there is >1 scan range per file.

This patch disables per-split filtering for Avro, RC and sequence files
in lieu of a permanent fix which marks all scan ranges for a file as
done as soon as one range is filtered out.

Testing:

A custom cluster test is added which disables file filtering, emulating
the race condition that leads to the hang when a query that filters
scan ranges is run. Without the fix, this test hangs, with the fix the
query completes as expected. MAX_SCAN_RANGE_LENGTH is used to ensure >1
scan range per file.

Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
---
M be/src/common/global-flags.cc
M be/src/exec/hdfs-scan-node.cc
A tests/custom_cluster/test_seq_file_filtering.py
3 files changed, 86 insertions(+), 10 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/24/3524/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/3524/1//COMMIT_MSG
Commit Message:

PS1, Line 15: num_files_queued_
> num_unqueued_files_
Done


PS1, Line 19: This bug therefore only shows up when there is >1 scan range per file.
> I'm not sure that's true. Seems it can happen with 1 scan range per file as
The scan-range-filter path will do most of the work of RangeComplete() for the header split, which means that progress_ will get updated, which is sufficient to exit the scanner thread loop.


-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has submitted this change and it was merged.

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................


IMPALA-3798: Disable per-split filtering for sequence-based scanners

If a runtime filter rejects a sequence-based format's header split (but
not the entire file, which may happen if the filter has not arrived in
time), the scanner will never mark all splits for that file
complete. This is because BaseSequenceScanner issues scan ranges after
parsing the header splits, and until those ranges are processed,
RangeComplete() and AddDiskIoRanges() will not be called - those methods
update progress_ and num_unqueued_files_
respectively. HdfsScanNode::ScannerThread() reads those variables to
decide whether to exit, and as a result will spin forever.

This bug therefore only shows up when there is >1 scan range per file.

This patch disables per-split filtering for Avro, RC and sequence files
in lieu of a permanent fix which marks all scan ranges for a file as
done as soon as one range is filtered out.

Testing:

A custom cluster test is added which disables file filtering, emulating
the race condition that leads to the hang when a query that filters
scan ranges is run. Without the fix, this test hangs, with the fix the
query completes as expected. MAX_SCAN_RANGE_LENGTH is used to ensure >1
scan range per file.

Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Reviewed-on: http://gerrit.cloudera.org:8080/3524
Tested-by: Henry Robinson <he...@cloudera.com>
Reviewed-by: Henry Robinson <he...@cloudera.com>
---
M be/src/common/global-flags.cc
M be/src/exec/hdfs-scan-node.cc
A tests/custom_cluster/test_seq_file_filtering.py
3 files changed, 86 insertions(+), 10 deletions(-)

Approvals:
  Henry Robinson: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................


Patch Set 2: Code-Review+2

Carry forward Dan's +2

-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/3524/1//COMMIT_MSG
Commit Message:

PS1, Line 19: This bug therefore only shows up when there is >1 scan range per file.
> The scan-range-filter path will do most of the work of RangeComplete() for 
I see. So we won't get stuck forever. But we will still busy-loop while the scan is still in progress (once all scan ranges have been picked up by scanner threads, remaining scanner threads will busy loop which can cause things to slow down drastically).  But, okay, I see your point.


-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................


Patch Set 1: Code-Review+2

(2 comments)

Please complete the parquet testing/investigation but otherwise looks okay as a temporary workaround. Plan to put it in 5.8.0 branch though, right?

http://gerrit.cloudera.org:8080/#/c/3524/1//COMMIT_MSG
Commit Message:

PS1, Line 15: num_files_queued_
num_unqueued_files_


PS1, Line 19: This bug therefore only shows up when there is >1 scan range per file.
I'm not sure that's true. Seems it can happen with 1 scan range per file as well since even in that case we won't decrement num_unqueued_files_ properly.


-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3798: Disable per-split filtering for sequence-based scanners

Posted by "Henry Robinson (Code Review)" <ge...@cloudera.org>.
Henry Robinson has posted comments on this change.

Change subject: IMPALA-3798: Disable per-split filtering for sequence-based scanners
......................................................................


Patch Set 2: Verified+1

Passed a private build: http://sandbox.jenkins.cloudera.com/job/impala-umbrella-build-and-test/2149/

-- 
To view, visit http://gerrit.cloudera.org:8080/3524
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I4770dd77fd4258c24115d72b572c727b770bd75d
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-HasComments: No