You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/06/16 02:38:47 UTC

[jira] [Created] (HBASE-3994) SplitTransaction has a window where clients can get RegionOfflineException

SplitTransaction has a window where clients can get RegionOfflineException
--------------------------------------------------------------------------

                 Key: HBASE-3994
                 URL: https://issues.apache.org/jira/browse/HBASE-3994
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.3
            Reporter: Jean-Daniel Cryans
            Priority: Critical
             Fix For: 0.90.4


I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.

Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3994) SplitTransaction has a window where clients can get RegionOfflineException

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050665#comment-13050665 ] 

Jean-Daniel Cryans commented on HBASE-3994:
-------------------------------------------

I'm still digging in my logs, but it appears that the region server took 40 secs to open a single file from one of the daughters and that's why the clients eventually ran out of retries. It seems at first that it didn't retry at all, but now I think we should just have a better error message.

> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
>                 Key: HBASE-3994
>                 URL: https://issues.apache.org/jira/browse/HBASE-3994
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3994) SplitTransaction has a window where clients can get RegionOfflineException

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056855#comment-13056855 ] 

Hudson commented on HBASE-3994:
-------------------------------

Integrated in HBase-TRUNK #1995 (See [https://builds.apache.org/job/HBase-TRUNK/1995/])
    

> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
>                 Key: HBASE-3994
>                 URL: https://issues.apache.org/jira/browse/HBASE-3994
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Minor
>             Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-3994) SplitTransaction has a window where clients can get RegionOfflineException

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated HBASE-3994:
--------------------------------------

      Priority: Minor  (was: Critical)
    Issue Type: Improvement  (was: Bug)

> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
>                 Key: HBASE-3994
>                 URL: https://issues.apache.org/jira/browse/HBASE-3994
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Priority: Minor
>             Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-3994) SplitTransaction has a window where clients can get RegionOfflineException

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050670#comment-13050670 ] 

stack commented on HBASE-3994:
------------------------------

So it did retry? And we just ran out of them?  Yeah better message!

> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
>                 Key: HBASE-3994
>                 URL: https://issues.apache.org/jira/browse/HBASE-3994
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Priority: Critical
>             Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (HBASE-3994) SplitTransaction has a window where clients can get RegionOfflineException

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans resolved HBASE-3994.
---------------------------------------

      Resolution: Fixed
        Assignee: Jean-Daniel Cryans
    Release Note: Added better error messages for regions that are offline or split parents

Committed to trunk and branch the new error messages. Doesn't change any behavior.

> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
>                 Key: HBASE-3994
>                 URL: https://issues.apache.org/jira/browse/HBASE-3994
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 0.90.3
>            Reporter: Jean-Daniel Cryans
>            Assignee: Jean-Daniel Cryans
>            Priority: Minor
>             Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira