You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Ben Maurer (JIRA)" <ji...@apache.org> on 2009/02/20 05:36:04 UTC

[jira] Created: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Scanner spins when there are concurrent inserts to column family
----------------------------------------------------------------

                 Key: HBASE-1206
                 URL: https://issues.apache.org/jira/browse/HBASE-1206
             Project: Hadoop HBase
          Issue Type: Bug
    Affects Versions: 0.19.0
            Reporter: Ben Maurer
             Fix For: 0.19.1


I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.

However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.

The code that's causing this is:


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1206:
-------------------------

    Fix Version/s:     (was: 0.19.1)
                   0.19.2

Moving out of 0.19.1 with Ben's permission.

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.19.2
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680616#action_12680616 ] 

stack commented on HBASE-1206:
------------------------------

Can I push this refactor out to 0.19.2 Ben?  ColumnMatch should run faster after hbase-1253 if that'll help any.  

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.19.1
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "Ben Maurer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680698#action_12680698 ] 

Ben Maurer commented on HBASE-1206:
-----------------------------------

Yeah, this can be pushed back.

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.19.1
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-1206:
-------------------------

    Fix Version/s:     (was: 0.19.2)

Moving out of 0.19.2.

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.20.0
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12680674#action_12680674 ] 

Jonathan Gray commented on HBASE-1206:
--------------------------------------

This issue seems to be inline with the discussion in HBASE-1249.  Is something still broken related to this issue?  Or just inefficient?  If it's an inefficiency, we will definitely be looking closely at this for 0.20.  Up to you, Ben, if you want to attempt this for 0.19.x.  If not a bug, let's move it out of 0.19.1 so we can get the RC out.

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.19.1
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "Ben Maurer (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ben Maurer updated HBASE-1206:
------------------------------

    Description: 
I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.

However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.

The code that's causing this is:
01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
01:13 < BenM>         if (findFirstRow(i, firstRow)) {
01:13 < BenM>           continue;
01:13 < BenM>         }
01:13 < BenM>       }
01:13 < BenM>       while (getNext(i)) {
01:13 < BenM>         if (columnMatch(i)) {
01:13 < BenM>           break;
01:13 < BenM>         }
01:13 < BenM>       }

columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.

The fix for this is:

(10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
(10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
(10:58:50 PM) BenM: right now, the code just scans through each of the map files
(10:59:02 PM) BenM: without regard to the relative key positions
(10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
(11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
(11:01:50 PM) BenM: lets say we have two map files
(11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
(11:02:17 PM) BenM: (row/family:col)
(11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
(11:02:39 PM) BenM: the curent logic is
(11:02:44 PM) BenM: for each map file:
(11:02:56 PM) BenM:    find the first potential row in this file
(11:03:08 PM) BenM: look at min(all potential rows)
(11:03:34 PM) BenM: the algorith should be:
(11:03:43 PM) BenM: q = new PriorityQueue()
(11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
(11:04:17 PM) BenM: while(k = q.pop()) {
(11:04:37 PM) BenM:   if (k is intersting) break;
(11:04:37 PM) BenM:   advance k
(11:04:37 PM) BenM:   q.push(k)
(11:04:38 PM) BenM: }

  was:
I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.

However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.

The code that's causing this is:



> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.19.1
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack resolved HBASE-1206.
--------------------------

    Resolution: Fixed

Resolving as implemented.

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.20.0
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720208#action_12720208 ] 

Jonathan Gray commented on HBASE-1206:
--------------------------------------

This issue is resolved by HBASE-1503 and all changes in scanning from HBASE-1304.

The KeyValueHeap is now implemented using a PriorityQueue, as described in the original description of this issue.

StoreScanner, which is the object notified of changed readers, is used on a per-row basis now, so inserts to a row will not cause the interruption described.

We're also retaining our previous position when we do change the readers, so we're guaranteed to start where we left off last.

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.20.0
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720211#action_12720211 ] 

stack commented on HBASE-1206:
------------------------------

We can open a new issue if same thing happens in new context.

Added this issue to CHANGES.txt.

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.20.0
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-1206) Scanner spins when there are concurrent inserts to column family

Posted by "Jonathan Gray (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-1206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Gray updated HBASE-1206:
---------------------------------

    Fix Version/s: 0.20.0

Pulling in to 0.20

> Scanner spins when there are concurrent inserts to column family
> ----------------------------------------------------------------
>
>                 Key: HBASE-1206
>                 URL: https://issues.apache.org/jira/browse/HBASE-1206
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.19.0
>            Reporter: Ben Maurer
>             Fix For: 0.19.2, 0.20.0
>
>
> I had a MR job that would launch multiple scanners on a region that made updates to the same column family as they were scanning on (but not the same column). As a result, there were lots of processes that had to grep through all of the irrelevent inserts many times as flushes occurred.
> However, if I put the column that I was outputting to in the list of columns to scan for, everything worked quickly.
> The code that's causing this is:
> 01:13 < BenM>       keys[i] = new HStoreKey(HConstants.EMPTY_BYTE_ARRAY, this.store.getHRegionInfo());
> 01:13 < BenM>       if (firstRow != null && firstRow.length != 0) {
> 01:13 < BenM>         if (findFirstRow(i, firstRow)) {
> 01:13 < BenM>           continue;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> 01:13 < BenM>       while (getNext(i)) {
> 01:13 < BenM>         if (columnMatch(i)) {
> 01:13 < BenM>           break;
> 01:13 < BenM>         }
> 01:13 < BenM>       }
> columnMatch() on the stuff that just got flushed out never returns true. This caused lots of problems to build up.
> The fix for this is:
> (10:58:30 PM) BenM: IMHO, this is a somewhat easier issue to fix
> (10:58:38 PM) BenM: i think it could be done in a way that cleans up the code
> (10:58:50 PM) BenM: right now, the code just scans through each of the map files
> (10:59:02 PM) BenM: without regard to the relative key positions
> (10:59:12 PM) BenM: i think it could use a priority queue so that it only works on the relevent files
> (11:01:22 PM) St^Ack_: BenM: please expand, I don't follow exactly
> (11:01:50 PM) BenM: lets say we have two map files
> (11:02:09 PM) BenM: one with 1/foo:bar 2/foo:bar 3/foo:bar
> (11:02:17 PM) BenM: (row/family:col)
> (11:02:31 PM) BenM: and the other with 1000/blah:blah 1001/blah:blah
> (11:02:39 PM) BenM: the curent logic is
> (11:02:44 PM) BenM: for each map file:
> (11:02:56 PM) BenM:    find the first potential row in this file
> (11:03:08 PM) BenM: look at min(all potential rows)
> (11:03:34 PM) BenM: the algorith should be:
> (11:03:43 PM) BenM: q = new PriorityQueue()
> (11:04:05 PM) BenM: for each map file: insert the HStoreKey of the first key in the file
> (11:04:17 PM) BenM: while(k = q.pop()) {
> (11:04:37 PM) BenM:   if (k is intersting) break;
> (11:04:37 PM) BenM:   advance k
> (11:04:37 PM) BenM:   q.push(k)
> (11:04:38 PM) BenM: }

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.