You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2011/06/16 02:38:47 UTC
[jira] [Created] (HBASE-3994) SplitTransaction has a window where
clients can get RegionOfflineException
SplitTransaction has a window where clients can get RegionOfflineException
--------------------------------------------------------------------------
Key: HBASE-3994
URL: https://issues.apache.org/jira/browse/HBASE-3994
Project: HBase
Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Jean-Daniel Cryans
Priority: Critical
Fix For: 0.90.4
I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3994) SplitTransaction has a window where
clients can get RegionOfflineException
Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050665#comment-13050665 ]
Jean-Daniel Cryans commented on HBASE-3994:
-------------------------------------------
I'm still digging in my logs, but it appears that the region server took 40 secs to open a single file from one of the daughters and that's why the clients eventually ran out of retries. It seems at first that it didn't retry at all, but now I think we should just have a better error message.
> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
> Key: HBASE-3994
> URL: https://issues.apache.org/jira/browse/HBASE-3994
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Priority: Critical
> Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3994) SplitTransaction has a window where
clients can get RegionOfflineException
Posted by "Hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056855#comment-13056855 ]
Hudson commented on HBASE-3994:
-------------------------------
Integrated in HBase-TRUNK #1995 (See [https://builds.apache.org/job/HBase-TRUNK/1995/])
> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
> Key: HBASE-3994
> URL: https://issues.apache.org/jira/browse/HBASE-3994
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Priority: Minor
> Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HBASE-3994) SplitTransaction has a window where
clients can get RegionOfflineException
Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans updated HBASE-3994:
--------------------------------------
Priority: Minor (was: Critical)
Issue Type: Improvement (was: Bug)
> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
> Key: HBASE-3994
> URL: https://issues.apache.org/jira/browse/HBASE-3994
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Priority: Minor
> Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-3994) SplitTransaction has a window where
clients can get RegionOfflineException
Posted by "stack (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13050670#comment-13050670 ]
stack commented on HBASE-3994:
------------------------------
So it did retry? And we just ran out of them? Yeah better message!
> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
> Key: HBASE-3994
> URL: https://issues.apache.org/jira/browse/HBASE-3994
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Priority: Critical
> Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HBASE-3994) SplitTransaction has a window where
clients can get RegionOfflineException
Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/HBASE-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans resolved HBASE-3994.
---------------------------------------
Resolution: Fixed
Assignee: Jean-Daniel Cryans
Release Note: Added better error messages for regions that are offline or split parents
Committed to trunk and branch the new error messages. Doesn't change any behavior.
> SplitTransaction has a window where clients can get RegionOfflineException
> --------------------------------------------------------------------------
>
> Key: HBASE-3994
> URL: https://issues.apache.org/jira/browse/HBASE-3994
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 0.90.3
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Priority: Minor
> Fix For: 0.90.4
>
>
> I just witnessed a job having failed tasks because of RegionOfflineException. This should normally happen because the table is disabled, but this can also happen because the parent is offline. Probably 99.999% of the time users don't hit it because SplitTransaction is able to offline the parent and add the first daughter quickly enough, but in my case the cluster was so slow that I was able to see.
> Maybe we should check in HCM not only if the region is offline but also if it's split, in which case we should retry?
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira