You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Evgeny Ryabitskiy (JIRA)" <ji...@apache.org> on 2009/03/01 01:56:12 UTC

[jira] Updated: (HBASE-1084) Reinitializable DFS client

     [ https://issues.apache.org/jira/browse/HBASE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Evgeny Ryabitskiy updated HBASE-1084:
-------------------------------------

    Attachment: HBASE-1084_HRegionServer.java.patch

Change in   protected boolean checkFileSystem() to try reinitialize DFS first and if fails ShutDown.

> Reinitializable DFS client
> --------------------------
>
>                 Key: HBASE-1084
>                 URL: https://issues.apache.org/jira/browse/HBASE-1084
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: io, master, regionserver
>            Reporter: Andrew Purtell
>            Assignee: Evgeny Ryabitskiy
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1084_HRegionServer.java.patch
>
>
> HBase is the only long lived DFS client. Tasks handle DFS errors by dying. HBase daemons do not and instead depend on dfsclient error recovery capability, but that is not sufficiently developed or tested. Several issues are a result:
> * HBASE-846: hbase looses its mind when hdfs fills
> * HBASE-879: When dfs restarts or moves blocks around, hbase regionservers don't notice
> * HBASE-932: Regionserver restart
> * HBASE-1078: "java.io.IOException: Could not obtain block": allthough file is there and accessible through the dfs client
> * hlog indefinitely hung on getting new blocks from dfs on apurtell cluster
> * regions closed due to transient DFS problems during loaded cluster restart
> These issues might also be related:
> * HBASE-15: Could not complete hdfs write out to flush file forcing regionserver restart
> * HBASE-667: Hung regionserver; hung on hdfs: writeChunk, DFSClient.java:2126, DataStreamer socketWrite
> HBase should reinitialize the fs a few times upon catching fs exceptions, with backoff, to compensate. This can be done by making a wrapper around all fs operations that releases references to the old fs instance and makes and initializes a new instance to retry. All fs users would need to be fixed up to handle loss of state around fs wrapper invocations: hlog, memcache flusher, hstore, etc. 
> Cases of clear unrecoverable failure (are there any?) should be excepted.
> Once the fs wrapper is in place, error recovery scenarios can be tested by forcing reinitialization of the fs during PE or other test cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client

Posted by Ryan Rawson <ry...@gmail.com>.
Wait it out... forever if necessary.

I recall a time when our datacenter got re-wired.  HDFS came back,
map-reduce came back, but every hbase process quit.

I'm thinking of avoiding a similar scenario for someone else.

One issue however is determining the difference between 'incorrectly
configured' and 'DFS just down right now'.  It's not clear to me yet exactly
how this can be resolved.

-ryan

On Sat, Feb 28, 2009 at 11:23 PM, Jim Kellerman (POWERSET) <
Jim.Kellerman@microsoft.com> wrote:

> Ryan,
>
> If the DFS is down, what should we do?
>
> Once we have Zookeeper doing a lot of the current master duties,
> then having the master going down is not a cluster killing event.
>
> If we lose Zookeeper quorum, how do we recover?
>
> ---
> Jim Kellerman, Powerset (Live Search, Microsoft Corporation)
>
>
> > -----Original Message-----
> > From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> > Sent: Saturday, February 28, 2009 5:12 PM
> > To: hbase-dev@hadoop.apache.org
> > Subject: Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client
> >
> > I don't think it's appropriate to die if the DFS is down.  This forces
> the
> > admin into an active recovery because all the regionservers went away.
> >
> > I think regionservers should only die if they will never be able to
> > continue
> > on in the future - DFS being down, master being down, zookeeper down, are
> > not unrecoverable errors.
> >
> >
> >
> > On Sat, Feb 28, 2009 at 4:56 PM, Evgeny Ryabitskiy (JIRA)
> > <ji...@apache.org>wrote:
> >
> > >
> > >     [
> > > https://issues.apache.org/jira/browse/HBASE-
> > 1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> > >
> > > Evgeny Ryabitskiy updated HBASE-1084:
> > > -------------------------------------
> > >
> > >    Attachment: HBASE-1084_HRegionServer.java.patch
> > >
> > > Change in   protected boolean checkFileSystem() to try reinitialize DFS
> > > first and if fails ShutDown.
> > >
> > > > Reinitializable DFS client
> > > > --------------------------
> > > >
> > > >                 Key: HBASE-1084
> > > >                 URL:
> https://issues.apache.org/jira/browse/HBASE-1084
> > > >             Project: Hadoop HBase
> > > >          Issue Type: Improvement
> > > >          Components: io, master, regionserver
> > > >            Reporter: Andrew Purtell
> > > >            Assignee: Evgeny Ryabitskiy
> > > >             Fix For: 0.20.0
> > > >
> > > >         Attachments: HBASE-1084_HRegionServer.java.patch
> > > >
> > > >
> > > > HBase is the only long lived DFS client. Tasks handle DFS errors by
> > > dying. HBase daemons do not and instead depend on dfsclient error
> > recovery
> > > capability, but that is not sufficiently developed or tested. Several
> > issues
> > > are a result:
> > > > * HBASE-846: hbase looses its mind when hdfs fills
> > > > * HBASE-879: When dfs restarts or moves blocks around, hbase
> > > regionservers don't notice
> > > > * HBASE-932: Regionserver restart
> > > > * HBASE-1078: "java.io.IOException: Could not obtain block":
> allthough
> > > file is there and accessible through the dfs client
> > > > * hlog indefinitely hung on getting new blocks from dfs on apurtell
> > > cluster
> > > > * regions closed due to transient DFS problems during loaded cluster
> > > restart
> > > > These issues might also be related:
> > > > * HBASE-15: Could not complete hdfs write out to flush file forcing
> > > regionserver restart
> > > > * HBASE-667: Hung regionserver; hung on hdfs: writeChunk,
> > > DFSClient.java:2126, DataStreamer socketWrite
> > > > HBase should reinitialize the fs a few times upon catching fs
> > exceptions,
> > > with backoff, to compensate. This can be done by making a wrapper
> around
> > all
> > > fs operations that releases references to the old fs instance and makes
> > and
> > > initializes a new instance to retry. All fs users would need to be
> fixed
> > up
> > > to handle loss of state around fs wrapper invocations: hlog, memcache
> > > flusher, hstore, etc.
> > > > Cases of clear unrecoverable failure (are there any?) should be
> > excepted.
> > > > Once the fs wrapper is in place, error recovery scenarios can be
> > tested
> > > by forcing reinitialization of the fs during PE or other test cases.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
>

RE: [jira] Updated: (HBASE-1084) Reinitializable DFS client

Posted by "Jim Kellerman (POWERSET)" <Ji...@microsoft.com>.
Ryan,

If the DFS is down, what should we do?

Once we have Zookeeper doing a lot of the current master duties,
then having the master going down is not a cluster killing event.

If we lose Zookeeper quorum, how do we recover?

---
Jim Kellerman, Powerset (Live Search, Microsoft Corporation)


> -----Original Message-----
> From: Ryan Rawson [mailto:ryanobjc@gmail.com]
> Sent: Saturday, February 28, 2009 5:12 PM
> To: hbase-dev@hadoop.apache.org
> Subject: Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client
> 
> I don't think it's appropriate to die if the DFS is down.  This forces the
> admin into an active recovery because all the regionservers went away.
> 
> I think regionservers should only die if they will never be able to
> continue
> on in the future - DFS being down, master being down, zookeeper down, are
> not unrecoverable errors.
> 
> 
> 
> On Sat, Feb 28, 2009 at 4:56 PM, Evgeny Ryabitskiy (JIRA)
> <ji...@apache.org>wrote:
> 
> >
> >     [
> > https://issues.apache.org/jira/browse/HBASE-
> 1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
> >
> > Evgeny Ryabitskiy updated HBASE-1084:
> > -------------------------------------
> >
> >    Attachment: HBASE-1084_HRegionServer.java.patch
> >
> > Change in   protected boolean checkFileSystem() to try reinitialize DFS
> > first and if fails ShutDown.
> >
> > > Reinitializable DFS client
> > > --------------------------
> > >
> > >                 Key: HBASE-1084
> > >                 URL: https://issues.apache.org/jira/browse/HBASE-1084
> > >             Project: Hadoop HBase
> > >          Issue Type: Improvement
> > >          Components: io, master, regionserver
> > >            Reporter: Andrew Purtell
> > >            Assignee: Evgeny Ryabitskiy
> > >             Fix For: 0.20.0
> > >
> > >         Attachments: HBASE-1084_HRegionServer.java.patch
> > >
> > >
> > > HBase is the only long lived DFS client. Tasks handle DFS errors by
> > dying. HBase daemons do not and instead depend on dfsclient error
> recovery
> > capability, but that is not sufficiently developed or tested. Several
> issues
> > are a result:
> > > * HBASE-846: hbase looses its mind when hdfs fills
> > > * HBASE-879: When dfs restarts or moves blocks around, hbase
> > regionservers don't notice
> > > * HBASE-932: Regionserver restart
> > > * HBASE-1078: "java.io.IOException: Could not obtain block": allthough
> > file is there and accessible through the dfs client
> > > * hlog indefinitely hung on getting new blocks from dfs on apurtell
> > cluster
> > > * regions closed due to transient DFS problems during loaded cluster
> > restart
> > > These issues might also be related:
> > > * HBASE-15: Could not complete hdfs write out to flush file forcing
> > regionserver restart
> > > * HBASE-667: Hung regionserver; hung on hdfs: writeChunk,
> > DFSClient.java:2126, DataStreamer socketWrite
> > > HBase should reinitialize the fs a few times upon catching fs
> exceptions,
> > with backoff, to compensate. This can be done by making a wrapper around
> all
> > fs operations that releases references to the old fs instance and makes
> and
> > initializes a new instance to retry. All fs users would need to be fixed
> up
> > to handle loss of state around fs wrapper invocations: hlog, memcache
> > flusher, hstore, etc.
> > > Cases of clear unrecoverable failure (are there any?) should be
> excepted.
> > > Once the fs wrapper is in place, error recovery scenarios can be
> tested
> > by forcing reinitialization of the fs during PE or other test cases.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >

Re: [jira] Updated: (HBASE-1084) Reinitializable DFS client

Posted by Ryan Rawson <ry...@gmail.com>.
I don't think it's appropriate to die if the DFS is down.  This forces the
admin into an active recovery because all the regionservers went away.

I think regionservers should only die if they will never be able to continue
on in the future - DFS being down, master being down, zookeeper down, are
not unrecoverable errors.



On Sat, Feb 28, 2009 at 4:56 PM, Evgeny Ryabitskiy (JIRA)
<ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/HBASE-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Evgeny Ryabitskiy updated HBASE-1084:
> -------------------------------------
>
>    Attachment: HBASE-1084_HRegionServer.java.patch
>
> Change in   protected boolean checkFileSystem() to try reinitialize DFS
> first and if fails ShutDown.
>
> > Reinitializable DFS client
> > --------------------------
> >
> >                 Key: HBASE-1084
> >                 URL: https://issues.apache.org/jira/browse/HBASE-1084
> >             Project: Hadoop HBase
> >          Issue Type: Improvement
> >          Components: io, master, regionserver
> >            Reporter: Andrew Purtell
> >            Assignee: Evgeny Ryabitskiy
> >             Fix For: 0.20.0
> >
> >         Attachments: HBASE-1084_HRegionServer.java.patch
> >
> >
> > HBase is the only long lived DFS client. Tasks handle DFS errors by
> dying. HBase daemons do not and instead depend on dfsclient error recovery
> capability, but that is not sufficiently developed or tested. Several issues
> are a result:
> > * HBASE-846: hbase looses its mind when hdfs fills
> > * HBASE-879: When dfs restarts or moves blocks around, hbase
> regionservers don't notice
> > * HBASE-932: Regionserver restart
> > * HBASE-1078: "java.io.IOException: Could not obtain block": allthough
> file is there and accessible through the dfs client
> > * hlog indefinitely hung on getting new blocks from dfs on apurtell
> cluster
> > * regions closed due to transient DFS problems during loaded cluster
> restart
> > These issues might also be related:
> > * HBASE-15: Could not complete hdfs write out to flush file forcing
> regionserver restart
> > * HBASE-667: Hung regionserver; hung on hdfs: writeChunk,
> DFSClient.java:2126, DataStreamer socketWrite
> > HBase should reinitialize the fs a few times upon catching fs exceptions,
> with backoff, to compensate. This can be done by making a wrapper around all
> fs operations that releases references to the old fs instance and makes and
> initializes a new instance to retry. All fs users would need to be fixed up
> to handle loss of state around fs wrapper invocations: hlog, memcache
> flusher, hstore, etc.
> > Cases of clear unrecoverable failure (are there any?) should be excepted.
> > Once the fs wrapper is in place, error recovery scenarios can be tested
> by forcing reinitialization of the fs during PE or other test cases.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>