You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2017/11/29 14:08:57 UTC

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8684


Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................

IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

IMPALA-3798 disabled per-scan filtering for sequence-
based scanners due to a race between runtime filter
arrival and header splits processing.

This commit enables per-scan filtering again by calling
PartitionPassesFilters() in BaseSequenceScanner::
GetNextInternal(). PartitionPassesFilters() is only called
after the header splits are already processed and all the
scan ranges are issued.

This commit doesn't affect the non-sequence-based file types.

Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scanner.cc
3 files changed, 14 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/84/8684/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Dec 2017 18:09:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................

IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

IMPALA-3798 disabled per-scan filtering for sequence-
based scanners due to a race between runtime filter
arrival and header splits processing.

This commit enables per-scan filtering again for the
sequence based files. In HdfsScanNode::ProcessSplit()
we check if the current range is the header of a
sequence file. If so, and the filters reject the file,
the whole file skipped.

If it is not a sequence header, but the filters reject
the partition, we call RangeComplete() on the current
scan range.

Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Reviewed-on: http://gerrit.cloudera.org:8080/8684
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scanner.cc
M tests/custom_cluster/test_always_false_filter.py
6 files changed, 51 insertions(+), 33 deletions(-)

Approvals:
  Tim Armstrong: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 1:

(2 comments)

Thanks for the review, your suggestion lead to a better solution indeed.

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/base-sequence-scanner.cc@174
PS1, Line 174:   if (!scan_node_->PartitionPassesFilters(context_->partition_descriptor()->id(),
> This is a bit inconsistent with the other file types because it checks once
Yeah, it also worried me a bit, I removed the filtering from here.


http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc@a497
PS1, Line 497: 
             : 
> After looking at the code I'm wondering if this idea would lead to a more e
Currently I added the header checking functionality to BaseSequenceScanner, other than that I think I've implemented this idea more or less.
It shouldn't be too hard to extend ScanRangeMetadata with an is_header_range field, but would require more code modifications. Should I do that as well?



-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Dec 2017 13:56:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 4:

(1 comment)

Thanks!

http://gerrit.cloudera.org:8080/#/c/8684/3/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8684/3/be/src/exec/base-sequence-scanner.cc@56
PS3, Line 56:     ScanRangeMetadata* header_metadata =
> We need to add this to an ObjectPool so it's freed when the query is torn d
Nice catch, Done!



-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Dec 2017 14:26:57 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Tim Armstrong, Dan Hecht, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8684

to look at the new patch set (#6).

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................

IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

IMPALA-3798 disabled per-scan filtering for sequence-
based scanners due to a race between runtime filter
arrival and header splits processing.

This commit enables per-scan filtering again for the
sequence based files. In HdfsScanNode::ProcessSplit()
we check if the current range is the header of a
sequence file. If so, and the filters reject the file,
the whole file skipped.

If it is not a sequence header, but the filters reject
the partition, we call RangeComplete() on the current
scan range.

Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scanner.cc
M tests/custom_cluster/test_always_false_filter.py
6 files changed, 51 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/84/8684/6
-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 6:

Sorry about the failed build, now all custom_cluster test passed for PS6 on my computer. test_runtime_filters.py also passed.
I started to run all the e2e tests on my local PC but I can only tell the results tomorrow.


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Dec 2017 17:58:24 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1584/


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Dec 2017 03:30:44 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 6: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Dec 2017 07:13:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1580/


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 07 Dec 2017 00:02:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1570/


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Dec 2017 18:13:58 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Dec 2017 18:13:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 1:

(2 comments)

The code looks correct to me, but I had some ideas about how a cleaner solution - let me know if they make sense or if I'm missing something.

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/base-sequence-scanner.cc@174
PS1, Line 174:   if (!scan_node_->PartitionPassesFilters(context_->partition_descriptor()->id(),
This is a bit inconsistent with the other file types because it checks once per 1024-row batch (GetNextInternal() is called many times per file) instead of once per scan range. Not the biggest deal but it adds another special case to this code (which unfortunately already has a lot of them).


http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc@a497
PS1, Line 497: 
             : 
After looking at the code I'm wondering if this idea would lead to a more elegant solution. If I understand correctly, the idea is that, instead of marking the scan range done by incrementing progress_, for sequence file header ranges, we should instead mark all the ranges in the file as done by calling RangeComplete() with all of the ranges in the file.

I think then we could maybe remove the special casing for FileFormatIsSequenceBased() and instead check something like 

  if (metadata->is_header_range()) 

Which seems to express the logic more directly.



-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Nov 2017 17:20:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8684/3/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8684/3/be/src/exec/base-sequence-scanner.cc@56
PS3, Line 56:     ScanRangeMetadata* header_metadata = new ScanRangeMetadata(*metadata);
We need to add this to an ObjectPool so it's freed when the query is torn down (see what is done for the other place where these are allocated).



-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Dec 2017 19:28:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Dec 2017 18:52:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1578/


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 06 Dec 2017 18:52:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8684

to look at the new patch set (#3).

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................

IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

IMPALA-3798 disabled per-scan filtering for sequence-
based scanners due to a race between runtime filter
arrival and header splits processing.

This commit enables per-scan filtering again for the
sequence based files. In HdfsScanNode::ProcessSplit()
we check if the current range is the header of a
sequence file. If so, and the filters reject the file,
the whole file skipped.

If it is not a sequence header, but the filters reject
the partition, we call RangeComplete() on the current
scan range.

Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scanner.cc
5 files changed, 48 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/84/8684/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8684

to look at the new patch set (#2).

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................

IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

IMPALA-3798 disabled per-scan filtering for sequence-
based scanners due to a race between runtime filter
arrival and header splits processing.

This commit enables per-scan filtering again for the
sequence based files. In HdfsScanNode::ProcessSplit()
we check if the current range is the header of a
sequence file. If so, and the filters reject the file,
the whole file skipped.

If it is not a sequence header, but the filters reject
the partition, we call RangeComplete() on the current
scan range.

Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
---
M be/src/exec/base-sequence-scanner.h
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scanner.cc
5 files changed, 39 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/84/8684/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 2:

(5 comments)

Thanks!

http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node-base.cc@691
PS2, Line 691:   for (int j = 0; j < file->splits.size(); ++j) {
> nit: i instead of j since we unnested the loop
Done


http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc@a497
PS1, Line 497: 
             : 
> The approaches seem pretty comparable in readability. I guess the major adv
OK, I did that in the new commit. Turned out it doesn't require much more code modifications.


http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc@512
PS2, Line 512:   bool range_is_filtered = !PartitionPassesFilters(
> nit: we could probably just put this expression inline in the if() right? N
Done


http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc@518
PS2, Line 518: bs_scanner
> nit: bs_scanner != nullptr, just to be explicit.
Not applicable with the ScanRangeMetadata::is_sequence_header approach


http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc@519
PS2, Line 519: 
> nit: unnecessary vertical whitespace
The new code is restructured, now these checks happen in their original place.



-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 04 Dec 2017 13:57:22 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Henry Robinson, Tim Armstrong, Dan Hecht, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8684

to look at the new patch set (#4).

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................

IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

IMPALA-3798 disabled per-scan filtering for sequence-
based scanners due to a race between runtime filter
arrival and header splits processing.

This commit enables per-scan filtering again for the
sequence based files. In HdfsScanNode::ProcessSplit()
we check if the current range is the header of a
sequence file. If so, and the filters reject the file,
the whole file skipped.

If it is not a sequence header, but the filters reject
the partition, we call RangeComplete() on the current
scan range.

Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/exec/hdfs-scan-node.cc
M be/src/exec/hdfs-scanner.cc
5 files changed, 49 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/84/8684/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node-base.cc@691
PS2, Line 691: 
nit: i instead of j since we unnested the loop


http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/8684/1/be/src/exec/hdfs-scan-node.cc@a497
PS1, Line 497: 
             : 
> Currently I added the header checking functionality to BaseSequenceScanner,
The approaches seem pretty comparable in readability. I guess the major advantage of putting it in ScanRangeMetadata is that we don't need to instantiate the scanner before checking the filters, which could be moderately expensive (although not a show stopper). So I think I have a slight preference for that approach.


http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc
File be/src/exec/hdfs-scan-node.cc:

http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc@512
PS2, Line 512:       runtime_state_, this, partition, scan_range, filter_ctxs, expr_results_pool);
nit: we could probably just put this expression inline in the if() right? Not sure that pulling it out makes it more readable.


http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc@518
PS2, Line 518: 
nit: bs_scanner != nullptr, just to be explicit.


http://gerrit.cloudera.org:8080/#/c/8684/2/be/src/exec/hdfs-scan-node.cc@519
PS2, Line 519:     if (VLOG_QUERY_IS_ON) {
nit: unnecessary vertical whitespace



-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 01 Dec 2017 19:37:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8684 )

Change subject: IMPALA-3804: Re-enable per-scan filtering for sequence-based scanners
......................................................................


Patch Set 5: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/1570/


-- 
To view, visit http://gerrit.cloudera.org:8080/8684
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I4b38c26bcbe67f83efcc65a1723d766626ae3d3e
Gerrit-Change-Number: 8684
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Henry Robinson <he...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 05 Dec 2017 21:49:52 +0000
Gerrit-HasComments: No