You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Venkatesh <vr...@aol.com> on 2011/02/10 19:42:57 UTC

region servers shutdown

 

 Hi
I've had this before ..but not to 70% of the cluster..region servers all dying..Any insight is helpful.
Using hbase-0.20.6, hadoop-0.20.2
I don't see any error in the datanode or the namenode
many thanks


Here's the relevant log entires

..in master...
Got while writing region XXXXXXlog java.io.IOException: Bad connect ack with firstBadLink YYYYYYY

2011-02-10 01:31:26,052 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Waiting for hlog writers to terminate, iteration #9
2011-02-10 01:31:28,974 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1053173551314261780_21097871 bad datanode[2] nodes == null
2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase_data/AAAA/1560386868/oldlogfile.log" - Aborting...


in region server..(one of them)

2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_2549916783344080232_21096412 bad datanode[0] nodes == null
2011-02-10 01:29:41,029 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase_data/user_activity/1710495506/activities/8613593457794008999" - Aborting...
2011-02-10 01:29:41,029 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. Forcing server shutdown
org.apache.hadoop.hbase.DroppedSnapshotException: region: AAAAAAAA,1297217998178
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)
        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)
        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readByte(DataInputStream.java:250)
        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
        at org.apache.hadoop.io.Text.readString(Text.java:400)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)






Re: region servers shutdown

Posted by Venkatesh <vr...@aol.com>.
Keys are randomized..Requests are nearly equally distributed across region servers (use to have sequential but that created
region hot spots)..However current scheme requires our map reduce job to look for events in all regions (using hbase time stamp)..
which hurts the map-reduce performance..but did help the real puts

 

 


 

 

-----Original Message-----
From: Ted Dunning <td...@maprtech.com>
To: user@hbase.apache.org
Sent: Thu, Feb 10, 2011 3:45 pm
Subject: Re: region servers shutdown


Are your keys sequential or randomized?



On Thu, Feb 10, 2011 at 12:35 PM, Venkatesh <vr...@aol.com> wrote:



>  iii) Processing about 600 million events per day (real-time put) - 200

> bytes per put. Each event is a row in a hbase table.

> so 600 mill records, 1 column family, 6-10 columns

>

> iv) About 50,000 regions so far.

>

> v) we run map reduce job every nite that takes the 600 mil records &

> updates/creates aggregate data (1 get per record)

> aggregate data translates to 25 mill..x 3 puts

>


 

Re: region servers shutdown

Posted by Ted Dunning <td...@maprtech.com>.
Are your keys sequential or randomized?

On Thu, Feb 10, 2011 at 12:35 PM, Venkatesh <vr...@aol.com> wrote:

>  iii) Processing about 600 million events per day (real-time put) - 200
> bytes per put. Each event is a row in a hbase table.
> so 600 mill records, 1 column family, 6-10 columns
>
> iv) About 50,000 regions so far.
>
> v) we run map reduce job every nite that takes the 600 mil records &
> updates/creates aggregate data (1 get per record)
> aggregate data translates to 25 mill..x 3 puts
>

Re: region servers shutdown

Posted by Venkatesh <vr...@aol.com>.
Thanks J-D
will increase MAX_FILESIZE as u suggested...I could truncate one of the tables which constitutes
80% of the regions

will try compression after that

 

 


 

 

-----Original Message-----
From: Jean-Daniel Cryans <jd...@apache.org>
To: user@hbase.apache.org
Sent: Thu, Feb 10, 2011 4:37 pm
Subject: Re: region servers shutdown


2500 regions per region server can be a lot of files to keep opened,

which is probably one of the main reason for your instability (as your

regions were growing, it started poking into those dark corners of

xcievers and eventually ulimits).



You need to set your regions to be bigger, and use LZO compression to

even lower the cost of storing those events and at the same time

improve performance across the board. Check the MAX_FILESIZE config

for your table in the shell, I would recommend 1GB instead of the

default 256MB.



Then, follow this wiki to setup LZO:

http://wiki.apache.org/hadoop/UsingLzoCompression



Finally, you cannot merge regions (like it was said in other threads

this week) to bring the count back down, so one option you might

consider is copying all the content from the first table to a second

better configured table. It's probably going to be a pain to do it in

0.20.6 because you cannot create a table with multiple regions, so

maybe that would be another reason to upgrade :)



Oh and one other thing, if your zk servers are of the same class of

hardware as the region servers and you're not using them for anything

else than HBase, then you should only use 1 zk server and collocate it

with the master and the namenode, then use those 3 machines as region

servers to help spread the region load.



J-D



On Thu, Feb 10, 2011 at 12:35 PM, Venkatesh <vr...@aol.com> wrote:

> Thanks J-D..

> I was quite happy in the 1st 3 months..Last month or so, lots of 

instabilities..

>

> i) It's good to know that 0.90.x fixes lots of instabilities..will consider 

upgrading..It is not listed as "stable production release"

>  hence the hesitation :)

>

>

>  ii) Our cluster is 20 - node (20 data nodes + 20 region servers) (data/region 

server on every box)..besides that

>   1 name node, 1 hmaster, 3 zookeper all on diff physical machines

>

> iii) hardware pentium .., 36 gig memory on each node

>

>

>  iii) Processing about 600 million events per day (real-time put) - 200 bytes 

per put. Each event is a row in a hbase table.

> so 600 mill records, 1 column family, 6-10 columns

>

> iv) About 50,000 regions so far.

>

> v) we run map reduce job every nite that takes the 600 mil records & 

updates/creates aggregate data (1 get per record)

> aggregate data translates to 25 mill..x 3 puts

>

>

>

> vi) region splits occur quite frequently..every 5 minutes or so

>

> How big are the tables?

> - have n't run a count on tables lately..

> - events table we keep for 90 days - 600 mill record per day..we process each 

days data

> - 3 additional tables for aggregate.

>

> How many region servers

> - 20

> and how many regions do they serve?

> - 50,000 regions..x-new regions get created every day..(don't have that #)

>

> Are you using lots of families per table?

> - No..just 1 family in all tables...# of columns < 20

>

> Are you using LZO compression?

> - NO

>

>

> thanks again for your help

>

>

>

>

>

>

> -----Original Message-----

> From: Jean-Daniel Cryans <jd...@apache.org>

> To: user@hbase.apache.org

> Sent: Thu, Feb 10, 2011 2:40 pm

> Subject: Re: region servers shutdown

>

>

> I see you are running on a very old version of hbase, and under that

>

> you have a version of hadoop that doesn't support appends so you are

>

> bound to have data loss on machine failure and when a region server

>

> needs to abort like it just did.

>

>

>

> I suggest you upgrade to 0.90.0, or even consider the release

>

> candidate of 0.90.1 which can be found here

>

> http://people.apache.org/~todd/hbase-0.90.1.rc0/, this is going to

>

> help solving a lot of stability problems.

>

>

>

> Also if you were able to reach 4097 xceivers on your datanodes, it

>

> means that you are keeping a LOT of files opened. This suggests that

>

> you either have a very small cluster or way too many files. Can you

>

> tell us more about your cluster? How big are the tables? How many

>

> region servers and how many regions do they serve? Are you using lots

>

> of families per table? Are you using LZO compression?

>

>

>

> Thanks for helping us helping you :)

>

>

>

> J-D

>

>

>

> On Thu, Feb 10, 2011 at 11:32 AM, Venkatesh <vr...@aol.com> wrote:

>

>> Thanks J-D..

>

>> Can't believe i missed that..I have had it before ..i did look for it..(not

>

> hard/carefull enough, i guess)

>

>> this time deflt that's the reason

>

>>

>

>> ...xceiverCount 4097 exceeds the limit of concurrent xcievers 4096...

>

>>

>

>> ..thinking of doubling this..

>

>>

>

>> I've had had so many issues in the last month..holes in meta, data node

>

> hung,..etc..this time it

>

>> was enmass

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>> -----Original Message-----

>

>> From: Jean-Daniel Cryans <jd...@apache.org>

>

>> To: user@hbase.apache.org

>

>> Sent: Thu, Feb 10, 2011 1:56 pm

>

>> Subject: Re: region servers shutdown

>

>>

>

>>

>

>> The first thing to do would be to look at the datanode logs a the time

>

>>

>

>> of the outage. Very often it's caused by either ulimit or xcievers

>

>>

>

>> that weren't properly configured, checkout

>

>>

>

>> http://hbase.apache.org/notsoquick.html#ulimit

>

>>

>

>>

>

>>

>

>> J-D

>

>>

>

>>

>

>>

>

>> On Thu, Feb 10, 2011 at 10:42 AM, Venkatesh <vr...@aol.com> wrote:

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>>  Hi

>

>>

>

>>> I've had this before ..but not to 70% of the cluster..region servers all

>

>>

>

>> dying..Any insight is helpful.

>

>>

>

>>> Using hbase-0.20.6, hadoop-0.20.2

>

>>

>

>>> I don't see any error in the datanode or the namenode

>

>>

>

>>> many thanks

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>> Here's the relevant log entires

>

>>

>

>>>

>

>>

>

>>> ..in master...

>

>>

>

>>> Got while writing region XXXXXXlog java.io.IOException: Bad connect ack with

>

>>

>

>> firstBadLink YYYYYYY

>

>>

>

>>>

>

>>

>

>>> 2011-02-10 01:31:26,052 DEBUG org.apache.hadoop.hbase.regionserver.HLog:

>

>>

>

>> Waiting for hlog writers to terminate, iteration #9

>

>>

>

>>> 2011-02-10 01:31:28,974 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer

>

>>

>

>> Exception: java.io.IOException: Unable to create new block.

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

>>

>

>>>

>

>>

>

>>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Error 

Recovery

>

>>

>

>> for block blk_1053173551314261780_21097871 bad datanode[2] nodes == null

>

>>

>

>>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Could not get

>

>>

>

>> block locations. Source file "/hbase_data/AAAA/1560386868/oldlogfile.log" -

>

>>

>

>> Aborting...

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>> in region server..(one of them)

>

>>

>

>>>

>

>>

>

>>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer

>

>>

>

>> Exception: java.io.IOException: Unable to create new block.

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

>>

>

>>>

>

>>

>

>>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: Error 

Recovery

>

>>

>

>> for block blk_2549916783344080232_21096412 bad datanode[0] nodes == null

>

>>

>

>>> 2011-02-10 01:29:41,029 WARN org.apache.hadoop.hdfs.DFSClient: Could not get

>

>>

>

>> block locations. Source file "/hbase_data/user_activity/1710495506/activities/8613593457794008999"

>

>>

>

>> - Aborting...

>

>>

>

>>> 2011-02-10 01:29:41,029 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher:

>

>>

>

>> Replay of hlog required. Forcing server shutdown

>

>>

>

>>> org.apache.hadoop.hbase.DroppedSnapshotException: region:

>

> AAAAAAAA,1297217998178

>

>>

>

>>>        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)

>

>>

>

>>>        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)

>

>>

>

>>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)

>

>>

>

>>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)

>

>>

>

>>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)

>

>>

>

>>> Caused by: java.io.EOFException

>

>>

>

>>>        at java.io.DataInputStream.readByte(DataInputStream.java:250)

>

>>

>

>>>        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)

>

>>

>

>>>        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)

>

>>

>

>>>        at org.apache.hadoop.io.Text.readString(Text.java:400)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>

>>

>

>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>>

>

>>

>

>>

>

>>

>

>>

>

>

>

>


 

Re: region servers shutdown

Posted by Jean-Daniel Cryans <jd...@apache.org>.
2500 regions per region server can be a lot of files to keep opened,
which is probably one of the main reason for your instability (as your
regions were growing, it started poking into those dark corners of
xcievers and eventually ulimits).

You need to set your regions to be bigger, and use LZO compression to
even lower the cost of storing those events and at the same time
improve performance across the board. Check the MAX_FILESIZE config
for your table in the shell, I would recommend 1GB instead of the
default 256MB.

Then, follow this wiki to setup LZO:
http://wiki.apache.org/hadoop/UsingLzoCompression

Finally, you cannot merge regions (like it was said in other threads
this week) to bring the count back down, so one option you might
consider is copying all the content from the first table to a second
better configured table. It's probably going to be a pain to do it in
0.20.6 because you cannot create a table with multiple regions, so
maybe that would be another reason to upgrade :)

Oh and one other thing, if your zk servers are of the same class of
hardware as the region servers and you're not using them for anything
else than HBase, then you should only use 1 zk server and collocate it
with the master and the namenode, then use those 3 machines as region
servers to help spread the region load.

J-D

On Thu, Feb 10, 2011 at 12:35 PM, Venkatesh <vr...@aol.com> wrote:
> Thanks J-D..
> I was quite happy in the 1st 3 months..Last month or so, lots of instabilities..
>
> i) It's good to know that 0.90.x fixes lots of instabilities..will consider upgrading..It is not listed as "stable production release"
>  hence the hesitation :)
>
>
>  ii) Our cluster is 20 - node (20 data nodes + 20 region servers) (data/region server on every box)..besides that
>   1 name node, 1 hmaster, 3 zookeper all on diff physical machines
>
> iii) hardware pentium .., 36 gig memory on each node
>
>
>  iii) Processing about 600 million events per day (real-time put) - 200 bytes per put. Each event is a row in a hbase table.
> so 600 mill records, 1 column family, 6-10 columns
>
> iv) About 50,000 regions so far.
>
> v) we run map reduce job every nite that takes the 600 mil records & updates/creates aggregate data (1 get per record)
> aggregate data translates to 25 mill..x 3 puts
>
>
>
> vi) region splits occur quite frequently..every 5 minutes or so
>
> How big are the tables?
> - have n't run a count on tables lately..
> - events table we keep for 90 days - 600 mill record per day..we process each days data
> - 3 additional tables for aggregate.
>
> How many region servers
> - 20
> and how many regions do they serve?
> - 50,000 regions..x-new regions get created every day..(don't have that #)
>
> Are you using lots of families per table?
> - No..just 1 family in all tables...# of columns < 20
>
> Are you using LZO compression?
> - NO
>
>
> thanks again for your help
>
>
>
>
>
>
> -----Original Message-----
> From: Jean-Daniel Cryans <jd...@apache.org>
> To: user@hbase.apache.org
> Sent: Thu, Feb 10, 2011 2:40 pm
> Subject: Re: region servers shutdown
>
>
> I see you are running on a very old version of hbase, and under that
>
> you have a version of hadoop that doesn't support appends so you are
>
> bound to have data loss on machine failure and when a region server
>
> needs to abort like it just did.
>
>
>
> I suggest you upgrade to 0.90.0, or even consider the release
>
> candidate of 0.90.1 which can be found here
>
> http://people.apache.org/~todd/hbase-0.90.1.rc0/, this is going to
>
> help solving a lot of stability problems.
>
>
>
> Also if you were able to reach 4097 xceivers on your datanodes, it
>
> means that you are keeping a LOT of files opened. This suggests that
>
> you either have a very small cluster or way too many files. Can you
>
> tell us more about your cluster? How big are the tables? How many
>
> region servers and how many regions do they serve? Are you using lots
>
> of families per table? Are you using LZO compression?
>
>
>
> Thanks for helping us helping you :)
>
>
>
> J-D
>
>
>
> On Thu, Feb 10, 2011 at 11:32 AM, Venkatesh <vr...@aol.com> wrote:
>
>> Thanks J-D..
>
>> Can't believe i missed that..I have had it before ..i did look for it..(not
>
> hard/carefull enough, i guess)
>
>> this time deflt that's the reason
>
>>
>
>> ...xceiverCount 4097 exceeds the limit of concurrent xcievers 4096...
>
>>
>
>> ..thinking of doubling this..
>
>>
>
>> I've had had so many issues in the last month..holes in meta, data node
>
> hung,..etc..this time it
>
>> was enmass
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>> -----Original Message-----
>
>> From: Jean-Daniel Cryans <jd...@apache.org>
>
>> To: user@hbase.apache.org
>
>> Sent: Thu, Feb 10, 2011 1:56 pm
>
>> Subject: Re: region servers shutdown
>
>>
>
>>
>
>> The first thing to do would be to look at the datanode logs a the time
>
>>
>
>> of the outage. Very often it's caused by either ulimit or xcievers
>
>>
>
>> that weren't properly configured, checkout
>
>>
>
>> http://hbase.apache.org/notsoquick.html#ulimit
>
>>
>
>>
>
>>
>
>> J-D
>
>>
>
>>
>
>>
>
>> On Thu, Feb 10, 2011 at 10:42 AM, Venkatesh <vr...@aol.com> wrote:
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>  Hi
>
>>
>
>>> I've had this before ..but not to 70% of the cluster..region servers all
>
>>
>
>> dying..Any insight is helpful.
>
>>
>
>>> Using hbase-0.20.6, hadoop-0.20.2
>
>>
>
>>> I don't see any error in the datanode or the namenode
>
>>
>
>>> many thanks
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> Here's the relevant log entires
>
>>
>
>>>
>
>>
>
>>> ..in master...
>
>>
>
>>> Got while writing region XXXXXXlog java.io.IOException: Bad connect ack with
>
>>
>
>> firstBadLink YYYYYYY
>
>>
>
>>>
>
>>
>
>>> 2011-02-10 01:31:26,052 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
>
>>
>
>> Waiting for hlog writers to terminate, iteration #9
>
>>
>
>>> 2011-02-10 01:31:28,974 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>
>>
>
>> Exception: java.io.IOException: Unable to create new block.
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
>>
>
>>>
>
>>
>
>>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
>
>>
>
>> for block blk_1053173551314261780_21097871 bad datanode[2] nodes == null
>
>>
>
>>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
>
>>
>
>> block locations. Source file "/hbase_data/AAAA/1560386868/oldlogfile.log" -
>
>>
>
>> Aborting...
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>> in region server..(one of them)
>
>>
>
>>>
>
>>
>
>>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>
>>
>
>> Exception: java.io.IOException: Unable to create new block.
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
>>
>
>>>
>
>>
>
>>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
>
>>
>
>> for block blk_2549916783344080232_21096412 bad datanode[0] nodes == null
>
>>
>
>>> 2011-02-10 01:29:41,029 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
>
>>
>
>> block locations. Source file "/hbase_data/user_activity/1710495506/activities/8613593457794008999"
>
>>
>
>> - Aborting...
>
>>
>
>>> 2011-02-10 01:29:41,029 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
>
>>
>
>> Replay of hlog required. Forcing server shutdown
>
>>
>
>>> org.apache.hadoop.hbase.DroppedSnapshotException: region:
>
> AAAAAAAA,1297217998178
>
>>
>
>>>        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)
>
>>
>
>>>        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)
>
>>
>
>>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)
>
>>
>
>>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)
>
>>
>
>>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)
>
>>
>
>>> Caused by: java.io.EOFException
>
>>
>
>>>        at java.io.DataInputStream.readByte(DataInputStream.java:250)
>
>>
>
>>>        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>
>>
>
>>>        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>
>>
>
>>>        at org.apache.hadoop.io.Text.readString(Text.java:400)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>
>>
>
>>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>>
>
>>
>
>>
>
>>
>
>>
>
>
>
>

Re: region servers shutdown

Posted by Venkatesh <vr...@aol.com>.
Thanks J-D..
I was quite happy in the 1st 3 months..Last month or so, lots of instabilities..

i) It's good to know that 0.90.x fixes lots of instabilities..will consider upgrading..It is not listed as "stable production release"
  hence the hesitation :)


 ii) Our cluster is 20 - node (20 data nodes + 20 region servers) (data/region server on every box)..besides that
   1 name node, 1 hmaster, 3 zookeper all on diff physical machines

iii) hardware pentium .., 36 gig memory on each node


 iii) Processing about 600 million events per day (real-time put) - 200 bytes per put. Each event is a row in a hbase table.
so 600 mill records, 1 column family, 6-10 columns

iv) About 50,000 regions so far.

v) we run map reduce job every nite that takes the 600 mil records & updates/creates aggregate data (1 get per record)
aggregate data translates to 25 mill..x 3 puts


 
vi) region splits occur quite frequently..every 5 minutes or so

How big are the tables? 
- have n't run a count on tables lately..
- events table we keep for 90 days - 600 mill record per day..we process each days data
- 3 additional tables for aggregate.

How many region servers 
- 20
and how many regions do they serve? 
- 50,000 regions..x-new regions get created every day..(don't have that #)

Are you using lots of families per table? 
- No..just 1 family in all tables...# of columns < 20 

Are you using LZO compression?
- NO


thanks again for your help




 

-----Original Message-----
From: Jean-Daniel Cryans <jd...@apache.org>
To: user@hbase.apache.org
Sent: Thu, Feb 10, 2011 2:40 pm
Subject: Re: region servers shutdown


I see you are running on a very old version of hbase, and under that

you have a version of hadoop that doesn't support appends so you are

bound to have data loss on machine failure and when a region server

needs to abort like it just did.



I suggest you upgrade to 0.90.0, or even consider the release

candidate of 0.90.1 which can be found here

http://people.apache.org/~todd/hbase-0.90.1.rc0/, this is going to

help solving a lot of stability problems.



Also if you were able to reach 4097 xceivers on your datanodes, it

means that you are keeping a LOT of files opened. This suggests that

you either have a very small cluster or way too many files. Can you

tell us more about your cluster? How big are the tables? How many

region servers and how many regions do they serve? Are you using lots

of families per table? Are you using LZO compression?



Thanks for helping us helping you :)



J-D



On Thu, Feb 10, 2011 at 11:32 AM, Venkatesh <vr...@aol.com> wrote:

> Thanks J-D..

> Can't believe i missed that..I have had it before ..i did look for it..(not 

hard/carefull enough, i guess)

> this time deflt that's the reason

>

> ...xceiverCount 4097 exceeds the limit of concurrent xcievers 4096...

>

> ..thinking of doubling this..

>

> I've had had so many issues in the last month..holes in meta, data node 

hung,..etc..this time it

> was enmass

>

>

>

>

>

>

>

>

>

>

> -----Original Message-----

> From: Jean-Daniel Cryans <jd...@apache.org>

> To: user@hbase.apache.org

> Sent: Thu, Feb 10, 2011 1:56 pm

> Subject: Re: region servers shutdown

>

>

> The first thing to do would be to look at the datanode logs a the time

>

> of the outage. Very often it's caused by either ulimit or xcievers

>

> that weren't properly configured, checkout

>

> http://hbase.apache.org/notsoquick.html#ulimit

>

>

>

> J-D

>

>

>

> On Thu, Feb 10, 2011 at 10:42 AM, Venkatesh <vr...@aol.com> wrote:

>

>>

>

>>

>

>>

>

>>  Hi

>

>> I've had this before ..but not to 70% of the cluster..region servers all

>

> dying..Any insight is helpful.

>

>> Using hbase-0.20.6, hadoop-0.20.2

>

>> I don't see any error in the datanode or the namenode

>

>> many thanks

>

>>

>

>>

>

>> Here's the relevant log entires

>

>>

>

>> ..in master...

>

>> Got while writing region XXXXXXlog java.io.IOException: Bad connect ack with

>

> firstBadLink YYYYYYY

>

>>

>

>> 2011-02-10 01:31:26,052 DEBUG org.apache.hadoop.hbase.regionserver.HLog:

>

> Waiting for hlog writers to terminate, iteration #9

>

>> 2011-02-10 01:31:28,974 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer

>

> Exception: java.io.IOException: Unable to create new block.

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

>>

>

>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery

>

> for block blk_1053173551314261780_21097871 bad datanode[2] nodes == null

>

>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Could not get

>

> block locations. Source file "/hbase_data/AAAA/1560386868/oldlogfile.log" -

>

> Aborting...

>

>>

>

>>

>

>> in region server..(one of them)

>

>>

>

>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer

>

> Exception: java.io.IOException: Unable to create new block.

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

>>

>

>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery

>

> for block blk_2549916783344080232_21096412 bad datanode[0] nodes == null

>

>> 2011-02-10 01:29:41,029 WARN org.apache.hadoop.hdfs.DFSClient: Could not get

>

> block locations. Source file "/hbase_data/user_activity/1710495506/activities/8613593457794008999"

>

> - Aborting...

>

>> 2011-02-10 01:29:41,029 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher:

>

> Replay of hlog required. Forcing server shutdown

>

>> org.apache.hadoop.hbase.DroppedSnapshotException: region: 

AAAAAAAA,1297217998178

>

>>        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)

>

>>        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)

>

>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)

>

>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)

>

>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)

>

>> Caused by: java.io.EOFException

>

>>        at java.io.DataInputStream.readByte(DataInputStream.java:250)

>

>>        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)

>

>>        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)

>

>>        at org.apache.hadoop.io.Text.readString(Text.java:400)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>

>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

>>

>

>>

>

>>

>

>>

>

>>

>

>>

>

>

>

>


 

Re: region servers shutdown

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I see you are running on a very old version of hbase, and under that
you have a version of hadoop that doesn't support appends so you are
bound to have data loss on machine failure and when a region server
needs to abort like it just did.

I suggest you upgrade to 0.90.0, or even consider the release
candidate of 0.90.1 which can be found here
http://people.apache.org/~todd/hbase-0.90.1.rc0/, this is going to
help solving a lot of stability problems.

Also if you were able to reach 4097 xceivers on your datanodes, it
means that you are keeping a LOT of files opened. This suggests that
you either have a very small cluster or way too many files. Can you
tell us more about your cluster? How big are the tables? How many
region servers and how many regions do they serve? Are you using lots
of families per table? Are you using LZO compression?

Thanks for helping us helping you :)

J-D

On Thu, Feb 10, 2011 at 11:32 AM, Venkatesh <vr...@aol.com> wrote:
> Thanks J-D..
> Can't believe i missed that..I have had it before ..i did look for it..(not hard/carefull enough, i guess)
> this time deflt that's the reason
>
> ...xceiverCount 4097 exceeds the limit of concurrent xcievers 4096...
>
> ..thinking of doubling this..
>
> I've had had so many issues in the last month..holes in meta, data node hung,..etc..this time it
> was enmass
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> From: Jean-Daniel Cryans <jd...@apache.org>
> To: user@hbase.apache.org
> Sent: Thu, Feb 10, 2011 1:56 pm
> Subject: Re: region servers shutdown
>
>
> The first thing to do would be to look at the datanode logs a the time
>
> of the outage. Very often it's caused by either ulimit or xcievers
>
> that weren't properly configured, checkout
>
> http://hbase.apache.org/notsoquick.html#ulimit
>
>
>
> J-D
>
>
>
> On Thu, Feb 10, 2011 at 10:42 AM, Venkatesh <vr...@aol.com> wrote:
>
>>
>
>>
>
>>
>
>>  Hi
>
>> I've had this before ..but not to 70% of the cluster..region servers all
>
> dying..Any insight is helpful.
>
>> Using hbase-0.20.6, hadoop-0.20.2
>
>> I don't see any error in the datanode or the namenode
>
>> many thanks
>
>>
>
>>
>
>> Here's the relevant log entires
>
>>
>
>> ..in master...
>
>> Got while writing region XXXXXXlog java.io.IOException: Bad connect ack with
>
> firstBadLink YYYYYYY
>
>>
>
>> 2011-02-10 01:31:26,052 DEBUG org.apache.hadoop.hbase.regionserver.HLog:
>
> Waiting for hlog writers to terminate, iteration #9
>
>> 2011-02-10 01:31:28,974 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>
> Exception: java.io.IOException: Unable to create new block.
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
>>
>
>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
>
> for block blk_1053173551314261780_21097871 bad datanode[2] nodes == null
>
>> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
>
> block locations. Source file "/hbase_data/AAAA/1560386868/oldlogfile.log" -
>
> Aborting...
>
>>
>
>>
>
>> in region server..(one of them)
>
>>
>
>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
>
> Exception: java.io.IOException: Unable to create new block.
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
>>
>
>> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery
>
> for block blk_2549916783344080232_21096412 bad datanode[0] nodes == null
>
>> 2011-02-10 01:29:41,029 WARN org.apache.hadoop.hdfs.DFSClient: Could not get
>
> block locations. Source file "/hbase_data/user_activity/1710495506/activities/8613593457794008999"
>
> - Aborting...
>
>> 2011-02-10 01:29:41,029 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher:
>
> Replay of hlog required. Forcing server shutdown
>
>> org.apache.hadoop.hbase.DroppedSnapshotException: region: AAAAAAAA,1297217998178
>
>>        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)
>
>>        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)
>
>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)
>
>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)
>
>>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)
>
>> Caused by: java.io.EOFException
>
>>        at java.io.DataInputStream.readByte(DataInputStream.java:250)
>
>>        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>
>>        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>
>>        at org.apache.hadoop.io.Text.readString(Text.java:400)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
>>
>
>>
>
>>
>
>>
>
>>
>
>>
>
>
>
>

Re: region servers shutdown

Posted by Venkatesh <vr...@aol.com>.
Thanks J-D..
Can't believe i missed that..I have had it before ..i did look for it..(not hard/carefull enough, i guess)
this time deflt that's the reason

...xceiverCount 4097 exceeds the limit of concurrent xcievers 4096...

..thinking of doubling this..

I've had had so many issues in the last month..holes in meta, data node hung,..etc..this time it 
was enmass 

 

 


 

 

-----Original Message-----
From: Jean-Daniel Cryans <jd...@apache.org>
To: user@hbase.apache.org
Sent: Thu, Feb 10, 2011 1:56 pm
Subject: Re: region servers shutdown


The first thing to do would be to look at the datanode logs a the time

of the outage. Very often it's caused by either ulimit or xcievers

that weren't properly configured, checkout

http://hbase.apache.org/notsoquick.html#ulimit



J-D



On Thu, Feb 10, 2011 at 10:42 AM, Venkatesh <vr...@aol.com> wrote:

>

>

>

>  Hi

> I've had this before ..but not to 70% of the cluster..region servers all 

dying..Any insight is helpful.

> Using hbase-0.20.6, hadoop-0.20.2

> I don't see any error in the datanode or the namenode

> many thanks

>

>

> Here's the relevant log entires

>

> ..in master...

> Got while writing region XXXXXXlog java.io.IOException: Bad connect ack with 

firstBadLink YYYYYYY

>

> 2011-02-10 01:31:26,052 DEBUG org.apache.hadoop.hbase.regionserver.HLog: 

Waiting for hlog writers to terminate, iteration #9

> 2011-02-10 01:31:28,974 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 

Exception: java.io.IOException: Unable to create new block.

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 

for block blk_1053173551314261780_21097871 bad datanode[2] nodes == null

> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 

block locations. Source file "/hbase_data/AAAA/1560386868/oldlogfile.log" - 

Aborting...

>

>

> in region server..(one of them)

>

> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer 

Exception: java.io.IOException: Unable to create new block.

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery 

for block blk_2549916783344080232_21096412 bad datanode[0] nodes == null

> 2011-02-10 01:29:41,029 WARN org.apache.hadoop.hdfs.DFSClient: Could not get 

block locations. Source file "/hbase_data/user_activity/1710495506/activities/8613593457794008999" 

- Aborting...

> 2011-02-10 01:29:41,029 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: 

Replay of hlog required. Forcing server shutdown

> org.apache.hadoop.hbase.DroppedSnapshotException: region: AAAAAAAA,1297217998178

>        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)

>        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)

>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)

>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)

>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)

> Caused by: java.io.EOFException

>        at java.io.DataInputStream.readByte(DataInputStream.java:250)

>        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)

>        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)

>        at org.apache.hadoop.io.Text.readString(Text.java:400)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)

>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)

>

>

>

>

>

>


 

Re: region servers shutdown

Posted by Jean-Daniel Cryans <jd...@apache.org>.
The first thing to do would be to look at the datanode logs a the time
of the outage. Very often it's caused by either ulimit or xcievers
that weren't properly configured, checkout
http://hbase.apache.org/notsoquick.html#ulimit

J-D

On Thu, Feb 10, 2011 at 10:42 AM, Venkatesh <vr...@aol.com> wrote:
>
>
>
>  Hi
> I've had this before ..but not to 70% of the cluster..region servers all dying..Any insight is helpful.
> Using hbase-0.20.6, hadoop-0.20.2
> I don't see any error in the datanode or the namenode
> many thanks
>
>
> Here's the relevant log entires
>
> ..in master...
> Got while writing region XXXXXXlog java.io.IOException: Bad connect ack with firstBadLink YYYYYYY
>
> 2011-02-10 01:31:26,052 DEBUG org.apache.hadoop.hbase.regionserver.HLog: Waiting for hlog writers to terminate, iteration #9
> 2011-02-10 01:31:28,974 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1053173551314261780_21097871 bad datanode[2] nodes == null
> 2011-02-10 01:31:28,975 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase_data/AAAA/1560386868/oldlogfile.log" - Aborting...
>
>
> in region server..(one of them)
>
> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2845)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
> 2011-02-10 01:29:41,028 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_2549916783344080232_21096412 bad datanode[0] nodes == null
> 2011-02-10 01:29:41,029 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/hbase_data/user_activity/1710495506/activities/8613593457794008999" - Aborting...
> 2011-02-10 01:29:41,029 FATAL org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Replay of hlog required. Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region: AAAAAAAA,1297217998178
>        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1041)
>        at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:896)
>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:258)
>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:231)
>        at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:154)
> Caused by: java.io.EOFException
>        at java.io.DataInputStream.readByte(DataInputStream.java:250)
>        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
>        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
>        at org.apache.hadoop.io.Text.readString(Text.java:400)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2901)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288)
>
>
>
>
>
>