You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dan Harvey <da...@mendeley.com> on 2010/06/29 15:43:26 UTC

Rolling out Hadoop/HBase updates

Hey,

I've been thinking about how we do out configuration and code updates for
Hadoop and HBase and was wondering what others do and what is the best
practice to avoid errors with HBase.

Currently we do a rolling update where we restart the services on one node
at a time, so shutting down the region server then restarting the datanode
and task trackers depending on what we are updating and what has change. But
with this I have occasional found errors with the HBase cluster afterwards
due to corrupt META table which I think could have been caused by restarting
the datanode, or maybe not waiting long enough for the cluster to sort out
loosing a region server before moving on to the next.

The most resent error upon restarting a node was :-

2010-06-29 10:46:44,970 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)

2010-06-29 10:46:44,970 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
HRegionServer: file system not available
java.io.IOException: File system is not available
        at
org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)


Followed by this for every region being served :-

2010-06-29 10:46:44,996 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
java.io.IOException: Filesystem closed
        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)


After updating all the nodes all the region server shut down after a
few minutes reporting the following :-

2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
10.0.11.4:50010

2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
Could not append. Requesting close of hlog
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


2010-06-29 11:22:09,482 FATAL
org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
ioe:
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)

2010-06-29 11:22:10,344 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
abort
java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)


This was fixed by restarting the master and starting the region servers
again, but it would be nice to know how to roll out changes cleaner.

How do other people here roll out updates to HBase / Hadoop? What order do
you restart services in and how long do you wait before moving to the next
node?

Just so you know we currently have 5 nodes and are getting another 10 to add
soon.

Thanks,

-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Re: Rolling out Hadoop/HBase updates

Posted by Stack <sa...@gmail.com>.
Hey Dan 

Are you using raw apache hadoop?   If so any patches?   Do you have hdfs630?

Looking at errors below the rs thinks the fs is gone.  Nothing in the log before the exception pasted below?

In general you want to let the regionservers finish their shut down.  Any chance you are not letting this happen?

For rolling restart you should do master first then the regionservers.  Best results if cluster is quiet at the time else regions in transition can be "lost" over master restart ( to be fixed in hbase 0.90.0)

Stack

On Jun 29, 2010, at 6:43 AM, Dan Harvey <da...@mendeley.com> wrote:

> Hey,
> 
> I've been thinking about how we do out configuration and code updates for
> Hadoop and HBase and was wondering what others do and what is the best
> practice to avoid errors with HBase.
> 
> Currently we do a rolling update where we restart the services on one node
> at a time, so shutting down the region server then restarting the datanode
> and task trackers depending on what we are updating and what has change. But
> with this I have occasional found errors with the HBase cluster afterwards
> due to corrupt META table which I think could have been caused by restarting
> the datanode, or maybe not waiting long enough for the cluster to sort out
> loosing a region server before moving on to the next.
> 
> The most resent error upon restarting a node was :-
> 
> 2010-06-29 10:46:44,970 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 2010-06-29 10:46:44,970 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> HRegionServer: file system not available
> java.io.IOException: File system is not available
>        at
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
> 
> 
> Followed by this for every region being served :-
> 
> 2010-06-29 10:46:44,996 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
> java.io.IOException: Filesystem closed
>        at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 
> After updating all the nodes all the region server shut down after a
> few minutes reporting the following :-
> 
> 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
> 10.0.11.4:50010
> 
> 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
> Could not append. Requesting close of hlog
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> 2010-06-29 11:22:09,482 FATAL
> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
> ioe:
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 2010-06-29 11:22:10,344 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
> abort
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> This was fixed by restarting the master and starting the region servers
> again, but it would be nice to know how to roll out changes cleaner.
> 
> How do other people here roll out updates to HBase / Hadoop? What order do
> you restart services in and how long do you wait before moving to the next
> node?
> 
> Just so you know we currently have 5 nodes and are getting another 10 to add
> soon.
> 
> Thanks,
> 
> -- 
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
> 
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015

Re: Rolling out Hadoop/HBase updates

Posted by Stack <st...@duboce.net>.
On Sun, Jul 4, 2010 at 10:36 AM, Dan Harvey <da...@mendeley.com> wrote:
> Just looked into hdfs630 and it looks like it was added in
> cdh2 0.20.1+169.89 and we're currently on  0.20.1+169.68. So would it help
> prevent some of these issues by updating to that so we have the patch?
>

For sure Dan.  HDFS-630 will help at a minimum.
St.Ack


> Thanks,
>
> On 4 July 2010 18:12, Dan Harvey <da...@mendeley.com> wrote:
>
>> Hey,
>>
>> We're using stock CHD2 without any patches so I'm not sure if we have
>> hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and
>> moving to 0.20.5 soon
>>
>> What I did with this rollout of just config changes was take one region
>> server down at a time and restart the datanode on the same server. So what I
>> gather I should have done was shutdown all the region servers before
>> restarting any of the data nodes?
>>
>> I guess if I split it into different parts it would be :-
>>
>> - HBase Rolling update for point/config releases is supported
>>   - Update masters first
>>   - Then update region servers in turn
>>
>> - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs
>> list I guess)
>>   - Take down HBase
>>   - Take down datanodes
>>   - Update all the datanodes code/configs
>>   - Start datanodes
>>   - Start HBase
>>
>> Would you be able to let me know which of these I've got right/wrong?
>>
>> Thanks,
>>
>> On 29 June 2010 15:50, Michael Segel <mi...@hotmail.com> wrote:
>>
>>>
>>> Dan,
>>>
>>> I don't think you can do that because your 'new/updated' node will clash
>>> with the rest of the cloud.
>>> (We're talking code and not just cloud tuning parameters.) [Read different
>>> jars...]
>>>
>>> If you're going to push an update out, then it has to be an 'all or
>>> nothing' push.
>>>
>>> Since we're using Cloudera's release, moving from CDH2 to CDH3 represents
>>> a full backup, down the cloud, remove the software completely, and then then
>>> install new CDH3. Outside of that major switch, if we were going from one
>>> sub release to another, it would be just a $> yum update hadoop-0.20 call on
>>> each node.
>>> Again, you have to take the cloud down to do that.
>>>
>>> So the bottom line... if you're going to do upgrades, you'll need to plan
>>> for some down time.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> > From: dan.harvey@mendeley.com
>>> > Date: Tue, 29 Jun 2010 14:43:26 +0100
>>> > Subject: Rolling out Hadoop/HBase updates
>>> > To: user@hbase.apache.org
>>> >
>>> > Hey,
>>> >
>>> > I've been thinking about how we do out configuration and code updates
>>> for
>>> > Hadoop and HBase and was wondering what others do and what is the best
>>> > practice to avoid errors with HBase.
>>> >
>>> > Currently we do a rolling update where we restart the services on one
>>> node
>>> > at a time, so shutting down the region server then restarting the
>>> datanode
>>> > and task trackers depending on what we are updating and what has change.
>>> But
>>> > with this I have occasional found errors with the HBase cluster
>>> afterwards
>>> > due to corrupt META table which I think could have been caused by
>>> restarting
>>> > the datanode, or maybe not waiting long enough for the cluster to sort
>>> out
>>> > loosing a region server before moving on to the next.
>>> >
>>> > The most resent error upon restarting a node was :-
>>> >
>>> > 2010-06-29 10:46:44,970 ERROR
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
>>> > files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
>>> > java.io.IOException: Filesystem closed
>>> >         at
>>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
>>> >
>>> > 2010-06-29 10:46:44,970 FATAL
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
>>> > HRegionServer: file system not available
>>> > java.io.IOException: File system is not available
>>> >         at
>>> >
>>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
>>> >
>>> >
>>> > Followed by this for every region being served :-
>>> >
>>> > 2010-06-29 10:46:44,996 ERROR
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
>>> > documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
>>> > java.io.IOException: Filesystem closed
>>> >         at
>>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
>>> >
>>> >
>>> > After updating all the nodes all the region server shut down after a
>>> > few minutes reporting the following :-
>>> >
>>> > 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>> > Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
>>> > 10.0.11.4:50010
>>> >
>>> > 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
>>> > Could not append. Requesting close of hlog
>>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>>> >         at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>>> >
>>> >
>>> > 2010-06-29 11:22:09,482 FATAL
>>> > org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
>>> > ioe:
>>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>>> >         at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>>> >
>>> > 2010-06-29 11:22:10,344 ERROR
>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log
>>> in
>>> > abort
>>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>>> >         at
>>> >
>>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>>> >
>>> >
>>> > This was fixed by restarting the master and starting the region servers
>>> > again, but it would be nice to know how to roll out changes cleaner.
>>> >
>>> > How do other people here roll out updates to HBase / Hadoop? What order
>>> do
>>> > you restart services in and how long do you wait before moving to the
>>> next
>>> > node?
>>> >
>>> > Just so you know we currently have 5 nodes and are getting another 10 to
>>> add
>>> > soon.
>>> >
>>> > Thanks,
>>> >
>>> > --
>>> > Dan Harvey | Datamining Engineer
>>> > www.mendeley.com/profiles/dan-harvey
>>> >
>>> > Mendeley Limited | London, UK | www.mendeley.com
>>> > Registered in England and Wales | Company Number 6419015
>>>
>>> _________________________________________________________________
>>> Hotmail has tools for the New Busy. Search, chat and e-mail from your
>>> inbox.
>>>
>>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
>>>
>>
>>
>>
>> --
>> Dan Harvey | Datamining Engineer
>> www.mendeley.com/profiles/dan-harvey
>>
>> Mendeley Limited | London, UK | www.mendeley.com
>> Registered in England and Wales | Company Number 6419015
>>
>
>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>

Re: Rolling out Hadoop/HBase updates

Posted by Dan Harvey <da...@mendeley.com>.
Just looked into hdfs630 and it looks like it was added in
cdh2 0.20.1+169.89 and we're currently on  0.20.1+169.68. So would it help
prevent some of these issues by updating to that so we have the patch?

Thanks,

On 4 July 2010 18:12, Dan Harvey <da...@mendeley.com> wrote:

> Hey,
>
> We're using stock CHD2 without any patches so I'm not sure if we have
> hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and
> moving to 0.20.5 soon
>
> What I did with this rollout of just config changes was take one region
> server down at a time and restart the datanode on the same server. So what I
> gather I should have done was shutdown all the region servers before
> restarting any of the data nodes?
>
> I guess if I split it into different parts it would be :-
>
> - HBase Rolling update for point/config releases is supported
>   - Update masters first
>   - Then update region servers in turn
>
> - HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs
> list I guess)
>   - Take down HBase
>   - Take down datanodes
>   - Update all the datanodes code/configs
>   - Start datanodes
>   - Start HBase
>
> Would you be able to let me know which of these I've got right/wrong?
>
> Thanks,
>
> On 29 June 2010 15:50, Michael Segel <mi...@hotmail.com> wrote:
>
>>
>> Dan,
>>
>> I don't think you can do that because your 'new/updated' node will clash
>> with the rest of the cloud.
>> (We're talking code and not just cloud tuning parameters.) [Read different
>> jars...]
>>
>> If you're going to push an update out, then it has to be an 'all or
>> nothing' push.
>>
>> Since we're using Cloudera's release, moving from CDH2 to CDH3 represents
>> a full backup, down the cloud, remove the software completely, and then then
>> install new CDH3. Outside of that major switch, if we were going from one
>> sub release to another, it would be just a $> yum update hadoop-0.20 call on
>> each node.
>> Again, you have to take the cloud down to do that.
>>
>> So the bottom line... if you're going to do upgrades, you'll need to plan
>> for some down time.
>>
>> HTH
>>
>> -Mike
>>
>> > From: dan.harvey@mendeley.com
>> > Date: Tue, 29 Jun 2010 14:43:26 +0100
>> > Subject: Rolling out Hadoop/HBase updates
>> > To: user@hbase.apache.org
>> >
>> > Hey,
>> >
>> > I've been thinking about how we do out configuration and code updates
>> for
>> > Hadoop and HBase and was wondering what others do and what is the best
>> > practice to avoid errors with HBase.
>> >
>> > Currently we do a rolling update where we restart the services on one
>> node
>> > at a time, so shutting down the region server then restarting the
>> datanode
>> > and task trackers depending on what we are updating and what has change.
>> But
>> > with this I have occasional found errors with the HBase cluster
>> afterwards
>> > due to corrupt META table which I think could have been caused by
>> restarting
>> > the datanode, or maybe not waiting long enough for the cluster to sort
>> out
>> > loosing a region server before moving on to the next.
>> >
>> > The most resent error upon restarting a node was :-
>> >
>> > 2010-06-29 10:46:44,970 ERROR
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
>> > files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
>> > java.io.IOException: Filesystem closed
>> >         at
>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
>> >
>> > 2010-06-29 10:46:44,970 FATAL
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
>> > HRegionServer: file system not available
>> > java.io.IOException: File system is not available
>> >         at
>> >
>> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
>> >
>> >
>> > Followed by this for every region being served :-
>> >
>> > 2010-06-29 10:46:44,996 ERROR
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
>> > documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
>> > java.io.IOException: Filesystem closed
>> >         at
>> org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
>> >
>> >
>> > After updating all the nodes all the region server shut down after a
>> > few minutes reporting the following :-
>> >
>> > 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
>> > 10.0.11.4:50010
>> >
>> > 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
>> > Could not append. Requesting close of hlog
>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>> >         at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>> >
>> >
>> > 2010-06-29 11:22:09,482 FATAL
>> > org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
>> > ioe:
>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>> >         at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>> >
>> > 2010-06-29 11:22:10,344 ERROR
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log
>> in
>> > abort
>> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>> >         at
>> >
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
>> >
>> >
>> > This was fixed by restarting the master and starting the region servers
>> > again, but it would be nice to know how to roll out changes cleaner.
>> >
>> > How do other people here roll out updates to HBase / Hadoop? What order
>> do
>> > you restart services in and how long do you wait before moving to the
>> next
>> > node?
>> >
>> > Just so you know we currently have 5 nodes and are getting another 10 to
>> add
>> > soon.
>> >
>> > Thanks,
>> >
>> > --
>> > Dan Harvey | Datamining Engineer
>> > www.mendeley.com/profiles/dan-harvey
>> >
>> > Mendeley Limited | London, UK | www.mendeley.com
>> > Registered in England and Wales | Company Number 6419015
>>
>> _________________________________________________________________
>> Hotmail has tools for the New Busy. Search, chat and e-mail from your
>> inbox.
>>
>> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
>>
>
>
>
> --
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>



-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

Re: Rolling out Hadoop/HBase updates

Posted by Dan Harvey <da...@mendeley.com>.
Hey,

We're using stock CHD2 without any patches so I'm not sure if we have
hdfs630 or not. For HBase we're currently on 0.20.3 and will be testing and
moving to 0.20.5 soon

What I did with this rollout of just config changes was take one region
server down at a time and restart the datanode on the same server. So what I
gather I should have done was shutdown all the region servers before
restarting any of the data nodes?

I guess if I split it into different parts it would be :-

- HBase Rolling update for point/config releases is supported
  - Update masters first
  - Then update region servers in turn

- HDFS Data nodes don't support rolling updates? (Maybe better in the hdfs
list I guess)
  - Take down HBase
  - Take down datanodes
  - Update all the datanodes code/configs
  - Start datanodes
  - Start HBase

Would you be able to let me know which of these I've got right/wrong?

Thanks,

On 29 June 2010 15:50, Michael Segel <mi...@hotmail.com> wrote:

>
> Dan,
>
> I don't think you can do that because your 'new/updated' node will clash
> with the rest of the cloud.
> (We're talking code and not just cloud tuning parameters.) [Read different
> jars...]
>
> If you're going to push an update out, then it has to be an 'all or
> nothing' push.
>
> Since we're using Cloudera's release, moving from CDH2 to CDH3 represents a
> full backup, down the cloud, remove the software completely, and then then
> install new CDH3. Outside of that major switch, if we were going from one
> sub release to another, it would be just a $> yum update hadoop-0.20 call on
> each node.
> Again, you have to take the cloud down to do that.
>
> So the bottom line... if you're going to do upgrades, you'll need to plan
> for some down time.
>
> HTH
>
> -Mike
>
> > From: dan.harvey@mendeley.com
> > Date: Tue, 29 Jun 2010 14:43:26 +0100
> > Subject: Rolling out Hadoop/HBase updates
> > To: user@hbase.apache.org
> >
> > Hey,
> >
> > I've been thinking about how we do out configuration and code updates for
> > Hadoop and HBase and was wondering what others do and what is the best
> > practice to avoid errors with HBase.
> >
> > Currently we do a rolling update where we restart the services on one
> node
> > at a time, so shutting down the region server then restarting the
> datanode
> > and task trackers depending on what we are updating and what has change.
> But
> > with this I have occasional found errors with the HBase cluster
> afterwards
> > due to corrupt META table which I think could have been caused by
> restarting
> > the datanode, or maybe not waiting long enough for the cluster to sort
> out
> > loosing a region server before moving on to the next.
> >
> > The most resent error upon restarting a node was :-
> >
> > 2010-06-29 10:46:44,970 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> > files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
> > java.io.IOException: Filesystem closed
> >         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> >
> > 2010-06-29 10:46:44,970 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> > HRegionServer: file system not available
> > java.io.IOException: File system is not available
> >         at
> >
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
> >
> >
> > Followed by this for every region being served :-
> >
> > 2010-06-29 10:46:44,996 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> > documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
> > java.io.IOException: Filesystem closed
> >         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> >
> >
> > After updating all the nodes all the region server shut down after a
> > few minutes reporting the following :-
> >
> > 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
> > Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
> > 10.0.11.4:50010
> >
> > 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
> > Could not append. Requesting close of hlog
> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
> >         at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> >
> >
> > 2010-06-29 11:22:09,482 FATAL
> > org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
> > ioe:
> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
> >         at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> >
> > 2010-06-29 11:22:10,344 ERROR
> > org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log
> in
> > abort
> > java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
> >         at
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> >
> >
> > This was fixed by restarting the master and starting the region servers
> > again, but it would be nice to know how to roll out changes cleaner.
> >
> > How do other people here roll out updates to HBase / Hadoop? What order
> do
> > you restart services in and how long do you wait before moving to the
> next
> > node?
> >
> > Just so you know we currently have 5 nodes and are getting another 10 to
> add
> > soon.
> >
> > Thanks,
> >
> > --
> > Dan Harvey | Datamining Engineer
> > www.mendeley.com/profiles/dan-harvey
> >
> > Mendeley Limited | London, UK | www.mendeley.com
> > Registered in England and Wales | Company Number 6419015
>
> _________________________________________________________________
> Hotmail has tools for the New Busy. Search, chat and e-mail from your
> inbox.
>
> http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
>



-- 
Dan Harvey | Datamining Engineer
www.mendeley.com/profiles/dan-harvey

Mendeley Limited | London, UK | www.mendeley.com
Registered in England and Wales | Company Number 6419015

RE: Rolling out Hadoop/HBase updates

Posted by Michael Segel <mi...@hotmail.com>.
Dan,

I don't think you can do that because your 'new/updated' node will clash with the rest of the cloud.
(We're talking code and not just cloud tuning parameters.) [Read different jars...]

If you're going to push an update out, then it has to be an 'all or nothing' push.

Since we're using Cloudera's release, moving from CDH2 to CDH3 represents a full backup, down the cloud, remove the software completely, and then then install new CDH3. Outside of that major switch, if we were going from one sub release to another, it would be just a $> yum update hadoop-0.20 call on each node.
Again, you have to take the cloud down to do that.

So the bottom line... if you're going to do upgrades, you'll need to plan for some down time.

HTH

-Mike

> From: dan.harvey@mendeley.com
> Date: Tue, 29 Jun 2010 14:43:26 +0100
> Subject: Rolling out Hadoop/HBase updates
> To: user@hbase.apache.org
> 
> Hey,
> 
> I've been thinking about how we do out configuration and code updates for
> Hadoop and HBase and was wondering what others do and what is the best
> practice to avoid errors with HBase.
> 
> Currently we do a rolling update where we restart the services on one node
> at a time, so shutting down the region server then restarting the datanode
> and task trackers depending on what we are updating and what has change. But
> with this I have occasional found errors with the HBase cluster afterwards
> due to corrupt META table which I think could have been caused by restarting
> the datanode, or maybe not waiting long enough for the cluster to sort out
> loosing a region server before moving on to the next.
> 
> The most resent error upon restarting a node was :-
> 
> 2010-06-29 10:46:44,970 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> files,3822b1ea8ae015f3ec932cafaa282dd211d768ad,1275145898366
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 2010-06-29 10:46:44,970 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: Shutting down
> HRegionServer: file system not available
> java.io.IOException: File system is not available
>         at
> org.apache.hadoop.hbase.util.FSUtils.checkFileSystemAvailable(FSUtils.java:129)
> 
> 
> Followed by this for every region being served :-
> 
> 2010-06-29 10:46:44,996 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error closing
> documents,082595c0-6d01-11df-936c-0026b95e484c,1275676410202
> java.io.IOException: Filesystem closed
>         at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:230)
> 
> 
> After updating all the nodes all the region server shut down after a
> few minutes reporting the following :-
> 
> 2010-06-29 11:21:59,508 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_-1437671530216085093_2565663 bad datanode[0]
> 10.0.11.4:50010
> 
> 2010-06-29 11:22:09,481 FATAL org.apache.hadoop.hbase.regionserver.HLog:
> Could not append. Requesting close of hlog
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> 2010-06-29 11:22:09,482 FATAL
> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
> ioe:
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 2010-06-29 11:22:10,344 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to close log in
> abort
> java.io.IOException: All datanodes 10.0.11.4:50010 are bad. Aborting...
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2542)
> 
> 
> This was fixed by restarting the master and starting the region servers
> again, but it would be nice to know how to roll out changes cleaner.
> 
> How do other people here roll out updates to HBase / Hadoop? What order do
> you restart services in and how long do you wait before moving to the next
> node?
> 
> Just so you know we currently have 5 nodes and are getting another 10 to add
> soon.
> 
> Thanks,
> 
> -- 
> Dan Harvey | Datamining Engineer
> www.mendeley.com/profiles/dan-harvey
> 
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
 		 	   		  
_________________________________________________________________
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1