You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Chris Tarnas <cf...@email.com> on 2011/03/02 10:03:32 UTC

Errors in regionserver logs

Under heavy loads I've seen a few of EOFException errors in my regionserver logs:

2011-03-02 02:27:03,669 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8.
java.io.EOFException: hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)

Checking the same timeframe in the namenode logs on lcbt006-pvt reveals no ominous messages (no warns, errors, anything), just the same file being opened by a different node:

2011-03-02 02:27:05,466 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop      ip=/10.56.24.13 cmd=open        src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364        dst=null        perm=null


The Troubleshooting Wiki mentions it is related to swapping, but none of the nodes are swapping - they all have plenty of RAM. Are there other common causes? Is this anything I should be worried about or just "normal" exceptions, anything else I should look for? I'm on cdh3b3 and will be moving to b4 once I get a chance to run it through a test cluster.

thank you,
-chris

Re: Errors in regionserver logs

Posted by Jean-Daniel Cryans <jd...@apache.org>.

Oh, well that's very nice to hear Chris!

Let us know if there's anything we can improve in this new release,

J-D

On Wed, Mar 2, 2011 at 4:23 PM, Chris Tarnas <cf...@email.com> wrote:
> Thanks for your help. I pushed up my upgrade plans and just finished installing 0.90.1 (cdh3b4) and that solved the EOF error as well as a general performance boost with my initial testing.
>
> -chris
>
> On Mar 2, 2011, at 9:18 AM, Jean-Daniel Cryans wrote:
>
>> I think you could try applying both patches instead on whatever you're
>> running right now, they are pretty small.
>>
>> Another option is using the version of 0.89 we're using here in
>> production that's already patched https://github.com/stumbleupon/hbase
>>
>> J-D
>>
>> On Wed, Mar 2, 2011 at 8:55 AM, Chris Tarnas <cf...@email.com> wrote:
>>> If HBASE-3038 is the problem is there anything I should be aware of during upgrading while this region is in this state?
>>>
>>> thanks,
>>> -chris
>>>
>>> On Mar 2, 2011, at 8:22 AM, Chris Tarnas wrote:
>>>
>>>> I'm pretty sure I hit HBASE-3038, the recovered.edits file is over 2GB
>>>>
>>>> I'll push up my upgrade plans.
>>>>
>>>> -chris
>>>>
>>>> On Mar 2, 2011, at 2:44 AM, Chris Tarnas wrote:
>>>>
>>>>> Actually I see now that this EOFException is keeping a region offline, are there anyways around this error to bring the region back online? I don't have the logs from the regionservers when it went offline but here is the section of the master log from then:
>>>>>
>>>>> http://pastebin.com/4ZBKGbnZ
>>>>>
>>>>> thanks again
>>>>> -chris
>>>>>
>>>>> On Mar 2, 2011, at 1:03 AM, Chris Tarnas wrote:
>>>>>
>>>>>> Under heavy loads I've seen a few of EOFException errors in my regionserver logs:
>>>>>>
>>>>>> 2011-03-02 02:27:03,669 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8.
>>>>>> java.io.EOFException: hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370
>>>>>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>>>>>     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>>>>>
>>>>>> Checking the same timeframe in the namenode logs on lcbt006-pvt reveals no ominous messages (no warns, errors, anything), just the same file being opened by a different node:
>>>>>>
>>>>>> 2011-03-02 02:27:05,466 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop      ip=/10.56.24.13 cmd=open        src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364        dst=null        perm=null
>>>>>>
>>>>>>
>>>>>> The Troubleshooting Wiki mentions it is related to swapping, but none of the nodes are swapping - they all have plenty of RAM. Are there other common causes? Is this anything I should be worried about or just "normal" exceptions, anything else I should look for? I'm on cdh3b3 and will be moving to b4 once I get a chance to run it through a test cluster.
>>>>>>
>>>>>> thank you,
>>>>>> -chris
>>>>>
>>>>
>>>
>>>
>
>

Re: Errors in regionserver logs

Posted by Chris Tarnas <cf...@email.com>.

Thanks for your help. I pushed up my upgrade plans and just finished installing 0.90.1 (cdh3b4) and that solved the EOF error as well as a general performance boost with my initial testing.

-chris

On Mar 2, 2011, at 9:18 AM, Jean-Daniel Cryans wrote:

> I think you could try applying both patches instead on whatever you're
> running right now, they are pretty small.
> 
> Another option is using the version of 0.89 we're using here in
> production that's already patched https://github.com/stumbleupon/hbase
> 
> J-D
> 
> On Wed, Mar 2, 2011 at 8:55 AM, Chris Tarnas <cf...@email.com> wrote:
>> If HBASE-3038 is the problem is there anything I should be aware of during upgrading while this region is in this state?
>> 
>> thanks,
>> -chris
>> 
>> On Mar 2, 2011, at 8:22 AM, Chris Tarnas wrote:
>> 
>>> I'm pretty sure I hit HBASE-3038, the recovered.edits file is over 2GB
>>> 
>>> I'll push up my upgrade plans.
>>> 
>>> -chris
>>> 
>>> On Mar 2, 2011, at 2:44 AM, Chris Tarnas wrote:
>>> 
>>>> Actually I see now that this EOFException is keeping a region offline, are there anyways around this error to bring the region back online? I don't have the logs from the regionservers when it went offline but here is the section of the master log from then:
>>>> 
>>>> http://pastebin.com/4ZBKGbnZ
>>>> 
>>>> thanks again
>>>> -chris
>>>> 
>>>> On Mar 2, 2011, at 1:03 AM, Chris Tarnas wrote:
>>>> 
>>>>> Under heavy loads I've seen a few of EOFException errors in my regionserver logs:
>>>>> 
>>>>> 2011-03-02 02:27:03,669 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8.
>>>>> java.io.EOFException: hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370
>>>>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>>>>     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>>>> 
>>>>> Checking the same timeframe in the namenode logs on lcbt006-pvt reveals no ominous messages (no warns, errors, anything), just the same file being opened by a different node:
>>>>> 
>>>>> 2011-03-02 02:27:05,466 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop      ip=/10.56.24.13 cmd=open        src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364        dst=null        perm=null
>>>>> 
>>>>> 
>>>>> The Troubleshooting Wiki mentions it is related to swapping, but none of the nodes are swapping - they all have plenty of RAM. Are there other common causes? Is this anything I should be worried about or just "normal" exceptions, anything else I should look for? I'm on cdh3b3 and will be moving to b4 once I get a chance to run it through a test cluster.
>>>>> 
>>>>> thank you,
>>>>> -chris
>>>> 
>>> 
>> 
>>

Re: Errors in regionserver logs

Posted by Jean-Daniel Cryans <jd...@apache.org>.

I think you could try applying both patches instead on whatever you're
running right now, they are pretty small.

Another option is using the version of 0.89 we're using here in
production that's already patched https://github.com/stumbleupon/hbase

J-D

On Wed, Mar 2, 2011 at 8:55 AM, Chris Tarnas <cf...@email.com> wrote:
> If HBASE-3038 is the problem is there anything I should be aware of during upgrading while this region is in this state?
>
> thanks,
> -chris
>
> On Mar 2, 2011, at 8:22 AM, Chris Tarnas wrote:
>
>> I'm pretty sure I hit HBASE-3038, the recovered.edits file is over 2GB
>>
>> I'll push up my upgrade plans.
>>
>> -chris
>>
>> On Mar 2, 2011, at 2:44 AM, Chris Tarnas wrote:
>>
>>> Actually I see now that this EOFException is keeping a region offline, are there anyways around this error to bring the region back online? I don't have the logs from the regionservers when it went offline but here is the section of the master log from then:
>>>
>>> http://pastebin.com/4ZBKGbnZ
>>>
>>> thanks again
>>> -chris
>>>
>>> On Mar 2, 2011, at 1:03 AM, Chris Tarnas wrote:
>>>
>>>> Under heavy loads I've seen a few of EOFException errors in my regionserver logs:
>>>>
>>>> 2011-03-02 02:27:03,669 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8.
>>>> java.io.EOFException: hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370
>>>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>>>     at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>>>     at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>>>
>>>> Checking the same timeframe in the namenode logs on lcbt006-pvt reveals no ominous messages (no warns, errors, anything), just the same file being opened by a different node:
>>>>
>>>> 2011-03-02 02:27:05,466 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop      ip=/10.56.24.13 cmd=open        src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364        dst=null        perm=null
>>>>
>>>>
>>>> The Troubleshooting Wiki mentions it is related to swapping, but none of the nodes are swapping - they all have plenty of RAM. Are there other common causes? Is this anything I should be worried about or just "normal" exceptions, anything else I should look for? I'm on cdh3b3 and will be moving to b4 once I get a chance to run it through a test cluster.
>>>>
>>>> thank you,
>>>> -chris
>>>
>>
>
>

Re: Errors in regionserver logs

Posted by Chris Tarnas <cf...@email.com>.

If HBASE-3038 is the problem is there anything I should be aware of during upgrading while this region is in this state?

thanks,
-chris

On Mar 2, 2011, at 8:22 AM, Chris Tarnas wrote:

> I'm pretty sure I hit HBASE-3038, the recovered.edits file is over 2GB
> 
> I'll push up my upgrade plans.
> 
> -chris
> 
> On Mar 2, 2011, at 2:44 AM, Chris Tarnas wrote:
> 
>> Actually I see now that this EOFException is keeping a region offline, are there anyways around this error to bring the region back online? I don't have the logs from the regionservers when it went offline but here is the section of the master log from then:
>> 
>> http://pastebin.com/4ZBKGbnZ
>> 
>> thanks again
>> -chris
>> 
>> On Mar 2, 2011, at 1:03 AM, Chris Tarnas wrote:
>> 
>>> Under heavy loads I've seen a few of EOFException errors in my regionserver logs:
>>> 
>>> 2011-03-02 02:27:03,669 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8.
>>> java.io.EOFException: hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>>> 
>>> Checking the same timeframe in the namenode logs on lcbt006-pvt reveals no ominous messages (no warns, errors, anything), just the same file being opened by a different node:
>>> 
>>> 2011-03-02 02:27:05,466 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop      ip=/10.56.24.13 cmd=open        src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364        dst=null        perm=null
>>> 
>>> 
>>> The Troubleshooting Wiki mentions it is related to swapping, but none of the nodes are swapping - they all have plenty of RAM. Are there other common causes? Is this anything I should be worried about or just "normal" exceptions, anything else I should look for? I'm on cdh3b3 and will be moving to b4 once I get a chance to run it through a test cluster.
>>> 
>>> thank you,
>>> -chris
>> 
>

Re: Errors in regionserver logs

Posted by Chris Tarnas <cf...@email.com>.

I'm pretty sure I hit HBASE-3038, the recovered.edits file is over 2GB

I'll push up my upgrade plans.

-chris

On Mar 2, 2011, at 2:44 AM, Chris Tarnas wrote:

> Actually I see now that this EOFException is keeping a region offline, are there anyways around this error to bring the region back online? I don't have the logs from the regionservers when it went offline but here is the section of the master log from then:
> 
> http://pastebin.com/4ZBKGbnZ
> 
> thanks again
> -chris
> 
> On Mar 2, 2011, at 1:03 AM, Chris Tarnas wrote:
> 
>> Under heavy loads I've seen a few of EOFException errors in my regionserver logs:
>> 
>> 2011-03-02 02:27:03,669 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8.
>> java.io.EOFException: hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>> 
>> Checking the same timeframe in the namenode logs on lcbt006-pvt reveals no ominous messages (no warns, errors, anything), just the same file being opened by a different node:
>> 
>> 2011-03-02 02:27:05,466 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop      ip=/10.56.24.13 cmd=open        src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364        dst=null        perm=null
>> 
>> 
>> The Troubleshooting Wiki mentions it is related to swapping, but none of the nodes are swapping - they all have plenty of RAM. Are there other common causes? Is this anything I should be worried about or just "normal" exceptions, anything else I should look for? I'm on cdh3b3 and will be moving to b4 once I get a chance to run it through a test cluster.
>> 
>> thank you,
>> -chris
>

Re: Errors in regionserver logs

Posted by Chris Tarnas <cf...@email.com>.

Actually I see now that this EOFException is keeping a region offline, are there anyways around this error to bring the region back online? I don't have the logs from the regionservers when it went offline but here is the section of the master log from then:

http://pastebin.com/4ZBKGbnZ

thanks again
-chris

On Mar 2, 2011, at 1:03 AM, Chris Tarnas wrote:

> Under heavy loads I've seen a few of EOFException errors in my regionserver logs:
> 
> 2011-03-02 02:27:03,669 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening sequence,h7BpVjo07UDYrkBZBLwWfg\x09fc00fc97be11e00d731605f8e061462c-A2610001-1\x09,1298335975607.8a5d1e4a300792d74f516ba26de869c8.
> java.io.EOFException: hdfs://lxbt006-pvt:8020/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364, entryStart=2336278916, pos=2336278916, end=4672557832, edit=13370
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> 
> Checking the same timeframe in the namenode logs on lcbt006-pvt reveals no ominous messages (no warns, errors, anything), just the same file being opened by a different node:
> 
> 2011-03-02 02:27:05,466 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop      ip=/10.56.24.13 cmd=open        src=/hbase/sequence/8a5d1e4a300792d74f516ba26de869c8/recovered.edits/0000000000054475364        dst=null        perm=null
> 
> 
> The Troubleshooting Wiki mentions it is related to swapping, but none of the nodes are swapping - they all have plenty of RAM. Are there other common causes? Is this anything I should be worried about or just "normal" exceptions, anything else I should look for? I'm on cdh3b3 and will be moving to b4 once I get a chance to run it through a test cluster.
> 
> thank you,
> -chris