You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Kannan Muthukkaruppan (Created) (JIRA)" <ji...@apache.org> on 2011/11/19 03:16:53 UTC
[jira] [Created] (HBASE-4823) long running scan lose benefit of
bloomfilters and timerange hints
long running scan lose benefit of bloomfilters and timerange hints
------------------------------------------------------------------
Key: HBASE-4823
URL: https://issues.apache.org/jira/browse/HBASE-4823
Project: HBase
Issue Type: Bug
Reporter: Kannan Muthukkaruppan
Assignee: Kannan Muthukkaruppan
When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
{code}
private void resetScannerStack(KeyValue lastTopKey) throws IOException {
if (heap != null) {
throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
}
/* When we have the scan object, should we not pass it to getScanners()
* to get a limited set of scanners? We did so in the constructor and we
* could have done it now by storing the scan object from the constructor */
List<KeyValueScanner> scanners = getScanners();
{code}
The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Kannan Muthukkaruppan (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kannan Muthukkaruppan updated HBASE-4823:
-----------------------------------------
Assignee: Amitanand Aiyer (was: Kannan Muthukkaruppan)
Amitanand will be helping on this issue.
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Amitanand Aiyer
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Phabricator (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Phabricator updated HBASE-4823:
-------------------------------
Attachment: HBASE-4823.D519.1.patch
aaiyer requested code review of "HBASE-4823 [jira] long running scans lose benefit of bloomfilters and timerange hints".
Reviewers: JIRA
Changes to the StoreScanner so that whenever we do a resetScannerStack
we use the same getScanner() method as done in the constructor to ignore
files that are not going to be touched by the scan.
Includes a test to ensure correctness.
When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. <span class="error">[Note: The scanners can get reset say due to a flush or compaction]</span>.
In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles. <div class="code panel" style="border-width: 1px;"><div class="codeContent panelContent"> <pre class="code-java"><span class="code-keyword">private</span> void resetScannerStack(KeyValue lastTopKey) <span class="code-keyword">throws</span> IOException { <span class="code-keyword">if</span> (heap != <span class="code-keyword">null</span>) { <span class="code-keyword">throw</span> <span class="code-keyword">new</span> RuntimeException(<span class="code-quote">"StoreScanner.reseek run on an existing heap!"</span>); } /* When we have the scan object, should we not pass it to getScanners() * to get a limited set of scanners? We did so in the constructor and we
* could have done it now by storing the scan object from the constructor */ List<KeyValueScanner> scanners = getScanners();</pre> </div></div>
The comment in the code seems to be aware of this issue and even has the suggested fix!
TEST PLAN
EMPTY
REVISION DETAIL
https://reviews.facebook.net/D519
AFFECTED FILES
src/test/java/org/apache/hadoop/hbase/regionserver/TestScannerResets.java
src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
MANAGE HERALD DIFFERENTIAL RULES
https://reviews.facebook.net/herald/view/differential/
WHY DID I GET THIS EMAIL?
https://reviews.facebook.net/herald/transcript/1149/
Tip: use the X-Herald-Rules header to filter Herald messages in your client.
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Amitanand Aiyer
> Attachments: HBASE-4823.D519.1.patch, TestScannerResets-89fb.txt
>
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Prakash Khemani (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13153655#comment-13153655 ]
Prakash Khemani commented on HBASE-4823:
----------------------------------------
https://issues.apache.org/jira/browse/HBASE-3415 is also related
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Kannan Muthukkaruppan
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156204#comment-13156204 ]
Phabricator commented on HBASE-4823:
------------------------------------
Kannan has accepted the revision "HBASE-4823 [jira] long running scans lose benefit of bloomfilters and timerange hints".
Super!
+1 for commit.
REVISION DETAIL
https://reviews.facebook.net/D519
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Amitanand Aiyer
> Attachments: HBASE-4823.D519.1.patch, TestScannerResets-89fb.txt
>
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Kannan Muthukkaruppan (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kannan Muthukkaruppan updated HBASE-4823:
-----------------------------------------
Summary: long running scans lose benefit of bloomfilters and timerange hints (was: long running scan lose benefit of bloomfilters and timerange hints)
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Kannan Muthukkaruppan
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157975#comment-13157975 ]
Phabricator commented on HBASE-4823:
------------------------------------
khemani has commented on the revision "HBASE-4823 [jira] long running scans lose benefit of bloomfilters and timerange hints".
+1
looks good. This same change should also apply to trunk even though it has filesonly and memstote-only scans.
REVISION DETAIL
https://reviews.facebook.net/D519
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Amitanand Aiyer
> Attachments: HBASE-4823.D519.1.patch, TestScannerResets-89fb.txt
>
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Amitanand Aiyer (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Amitanand Aiyer resolved HBASE-4823.
------------------------------------
Resolution: Fixed
Fix Version/s: 0.89-fb
Hadoop Flags: Reviewed
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Amitanand Aiyer
> Fix For: 0.89-fb
>
> Attachments: HBASE-4823.D519.1.patch, TestScannerResets-89fb.txt
>
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Phabricator (Commented) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158105#comment-13158105 ]
Phabricator commented on HBASE-4823:
------------------------------------
lhofhansl has commented on the revision "HBASE-4823 [jira] long running scans lose benefit of bloomfilters and timerange hints".
+1 lgtm
INLINE COMMENTS
src/test/java/org/apache/hadoop/hbase/regionserver/TestScannerResets.java:1 Wanna add a Copyright notice?
REVISION DETAIL
https://reviews.facebook.net/D519
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Amitanand Aiyer
> Attachments: HBASE-4823.D519.1.patch, TestScannerResets-89fb.txt
>
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-4823) long running scans lose benefit of
bloomfilters and timerange hints
Posted by "Kannan Muthukkaruppan (Updated) (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kannan Muthukkaruppan updated HBASE-4823:
-----------------------------------------
Attachment: TestScannerResets-89fb.txt
Unit test demonstrating inefficiency that creeps in if say a flush were to happen in the middle of a scan. [I wrote the test wrt to 0.89-fb branch, but should more or less be reproducible against trunk as well.]
> long running scans lose benefit of bloomfilters and timerange hints
> -------------------------------------------------------------------
>
> Key: HBASE-4823
> URL: https://issues.apache.org/jira/browse/HBASE-4823
> Project: HBase
> Issue Type: Bug
> Reporter: Kannan Muthukkaruppan
> Assignee: Amitanand Aiyer
> Attachments: TestScannerResets-89fb.txt
>
>
> When you have a long running scan due to say an MR job, you can lose the benefit of timerange hints & bloom filters midway if your scanner gets reset. [Note: The scanners can get reset say due to a flush or compaction].
> In one of our workloads, we periodically want to do rollups on recent 15 minutes of data in a column family... but the timerange hint benefit is lost midway when this resetScannerStack (shown below) happens. And end result-- we end up reading all the old HFiles rather than just the recent HFiles.
> {code}
> private void resetScannerStack(KeyValue lastTopKey) throws IOException {
> if (heap != null) {
> throw new RuntimeException("StoreScanner.reseek run on an existing heap!");
> }
> /* When we have the scan object, should we not pass it to getScanners()
> * to get a limited set of scanners? We did so in the constructor and we
> * could have done it now by storing the scan object from the constructor */
> List<KeyValueScanner> scanners = getScanners();
> {code}
> The comment in the code seems to be aware of this issue and even has the suggested fix!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira