You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Sean Sechrist (JIRA)" <ji...@apache.org> on 2011/03/22 20:00:05 UTC

[jira] [Created] (HBASE-3686) Scanner timeout on RegionServer but Client won't know what happened

Scanner timeout on RegionServer but Client won't know what happened
-------------------------------------------------------------------

                 Key: HBASE-3686
                 URL: https://issues.apache.org/jira/browse/HBASE-3686
             Project: HBase
          Issue Type: Bug
          Components: client
    Affects Versions: 0.89.20100924
            Reporter: Sean Sechrist
            Priority: Minor


This can cause rows to be lost from a scan.

See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1

If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 

1. Client is scanning along happily, and does something slow.
2. Scanner times out on region server
3. Client calls HTable.ClientScanner.next()
4. The region server throws an UnknownScannerException
5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.

Right now the workaround is to make sure the configs are consistent. 

A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011484#comment-13011484 ] 

Hudson commented on HBASE-3686:
-------------------------------

Integrated in HBase-TRUNK #1814 (See [https://hudson.apache.org/hudson/job/HBase-TRUNK/1814/])
    HBASE-3686 ClientScanner skips too many rows on recovery if using scanner caching


> ClientScanner skips too many rows on recovery if using scanner caching
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924, 0.90.1
>            Reporter: Sean Sechrist
>            Assignee: Sean Sechrist
>            Priority: Minor
>         Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

Posted by "Sean Sechrist (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Sechrist updated HBASE-3686:
---------------------------------

    Affects Version/s: 0.90.1
              Summary: ClientScanner skips too many rows on recovery if using scanner caching  (was: Scanner timeout on RegionServer but Client won't know what happened)

Updated title to be more accurate.

> ClientScanner skips too many rows on recovery if using scanner caching
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924, 0.90.1
>            Reporter: Sean Sechrist
>            Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3686) Scanner timeout on RegionServer but Client won't know what happened

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010158#comment-13010158 ] 

stack commented on HBASE-3686:
------------------------------

@Sean Does this happen with default configs?  Or was it result of local customizations?  Thanks.  Maybe we should return a better message in the UnknownScannerException too?  Or, should Scan tell server the lease period to use -- as you suggest.  That seems like best fix.

> Scanner timeout on RegionServer but Client won't know what happened
> -------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924
>            Reporter: Sean Sechrist
>            Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3686) Scanner timeout on RegionServer but Client won't know what happened

Posted by "Sean Sechrist (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010169#comment-13010169 ] 

Sean Sechrist commented on HBASE-3686:
--------------------------------------

It was an old customization that had been reverted on the server but not in the client's config. 

It would be hard to return a better message in the UnknownScannerException, because the region server doesn't know whether it was a scanner whose lease expired or if it is a genuine unknown scanner.

So, I think the scan telling the server the lease period does seem like the best bet.

> Scanner timeout on RegionServer but Client won't know what happened
> -------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924
>            Reporter: Sean Sechrist
>            Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-3686:
-------------------------

      Resolution: Fixed
        Assignee: Sean Sechrist
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed branch and trunk.  Nice one Sean (Made you a contributor and assigned you this issue)

> ClientScanner skips too many rows on recovery if using scanner caching
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924, 0.90.1
>            Reporter: Sean Sechrist
>            Assignee: Sean Sechrist
>            Priority: Minor
>         Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3686) Scanner timeout on RegionServer but Client won't know what happened

Posted by "Sean Sechrist (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011203#comment-13011203 ] 

Sean Sechrist commented on HBASE-3686:
--------------------------------------

I did a little more testing and it turns out this problem isn't limited to the misconfiguration.

You'll also lose rows if you kill -9 a region server in the middle of scan. In HTable.ClientScanner.next(), there's this skipFirst boolean that is supposed to skip the first row that was "already let out on a previous invocation". But instead of just skipping the first row, getConnection().getRegionServerWithRetries(callable) is called an extra time, which will skip [caching] rows.

So I think fixing it to only skip 1 row will also fixing the problem if there's a misconfiguration, so sending the timeout to the server won't be needed.

> Scanner timeout on RegionServer but Client won't know what happened
> -------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924
>            Reporter: Sean Sechrist
>            Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

Posted by "Sean Sechrist (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Sechrist updated HBASE-3686:
---------------------------------

    Release Note: Fixed bug where rows would be skipped if region server dies during scan and scanner caching > 1
          Status: Patch Available  (was: Open)

> ClientScanner skips too many rows on recovery if using scanner caching
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.90.1, 0.89.20100924
>            Reporter: Sean Sechrist
>            Priority: Minor
>         Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-3686) Scanner timeout on RegionServer but Client won't know what happened

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13010441#comment-13010441 ] 

stack commented on HBASE-3686:
------------------------------

Agreed Sean.  We should do this for more of our configs, let client set them.

> Scanner timeout on RegionServer but Client won't know what happened
> -------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924
>            Reporter: Sean Sechrist
>            Priority: Minor
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-3686) ClientScanner skips too many rows on recovery if using scanner caching

Posted by "Sean Sechrist (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-3686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Sechrist updated HBASE-3686:
---------------------------------

    Attachment: 3686.patch

Added patch that will set caching to 1 before getting the last row that should be skipped during recovery. Also added 2 unit tests to reproduce both situations (RS death and mismatched scanner timeouts).

> ClientScanner skips too many rows on recovery if using scanner caching
> ----------------------------------------------------------------------
>
>                 Key: HBASE-3686
>                 URL: https://issues.apache.org/jira/browse/HBASE-3686
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100924, 0.90.1
>            Reporter: Sean Sechrist
>            Priority: Minor
>         Attachments: 3686.patch
>
>
> This can cause rows to be lost from a scan.
> See this thread where the issue was brought up: http://search-hadoop.com/m/xITBQ136xGJ1
> If hbase.regionserver.lease.period is higher on the client than the server we can get this series of events: 
> 1. Client is scanning along happily, and does something slow.
> 2. Scanner times out on region server
> 3. Client calls HTable.ClientScanner.next()
> 4. The region server throws an UnknownScannerException
> 5. Client catches exception and sees that it's not longer then it's hbase.regionserver.lease.period config, so it doesn't throw a ScannerTimeoutException. Instead, it treats it like a NSRE.
> Right now the workaround is to make sure the configs are consistent. 
> A possible fix would be to use whatever the region server's scanner timeout is, rather than the local one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira