You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jack Levin <ma...@gmail.com> on 2010/09/25 00:07:27 UTC

Regions loading too fast

My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?

-Jack

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

Jack:  You might want to try applying hbase-3038 (there's two patches
up there.  you'll need both).  Thought is that it might be cause of
the EOFE you were running into (even though your files seemed less
than the 2G that hbase-3038 is about).
St.Ack

On Tue, Sep 28, 2010 at 12:12 PM, Stack <st...@duboce.net> wrote:
> I made https://issues.apache.org/jira/browse/HBASE-3046 for looking
> into this.  We though we're repro'd it here but it seems like we were
> running into hbase-3038... which was not your case, at least, not for
> the two files you made available to me.
>
> St.Ack
>
> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>> It was EOF exception, but now that I deleted edits files:
>>
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>
>> Like these.  All of the regions have loaded... What could that have
>> been?   I assume I lost some writes, but this is not a big deal to
>> me... question is how to avoid something like that, is that a bug?
>>
>> -Jack
>>
>>
>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>> What is the complaint in regionserver log when region load fails?
>>> St.Ack
>>>
>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>> am unable to load _some_ regions, what could it be?
>>>>
>>>> 2010-09-24 16:38:42,729 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>> 2010-09-24 16:38:44,317 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>> 2010-09-24 16:38:44,318 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>> HDFS_READ, cliID:
>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>> 2010-09-24 16:38:44,510 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>> HDFS_READ, cliID:
>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>> 2010-09-24 16:38:44,525 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>> 2010-09-24 16:38:44,526 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>> HDFS_READ, cliID:
>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>> (Good one Ryan)
>>>>>
>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>> config change.
>>>>>
>>>>> St.Ack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Only regionserver, do I need to restart both?
>>>>>>
>>>>>> -jack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>
>>>>>>> -ryan
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>
>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>
>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> Still having a problem:
>>>>>>>>>
>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>> java.io.EOFException
>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>> Try
>>>>>>>>>>
>>>>>>>>>>  <property>
>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>    <value>10</value>
>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>    to a region server.
>>>>>>>>>>    </description>
>>>>>>>>>>  </property>
>>>>>>>>>>
>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>
>>>>>>>>>> St.Ack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

I made https://issues.apache.org/jira/browse/HBASE-3046 for looking
into this.  We though we're repro'd it here but it seems like we were
running into hbase-3038... which was not your case, at least, not for
the two files you made available to me.

St.Ack

On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
> It was EOF exception, but now that I deleted edits files:
>
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>
> Like these.  All of the regions have loaded... What could that have
> been?   I assume I lost some writes, but this is not a big deal to
> me... question is how to avoid something like that, is that a bug?
>
> -Jack
>
>
> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>> What is the complaint in regionserver log when region load fails?
>> St.Ack
>>
>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>> so, datanode log shows no errors whatsoever, however I do see same
>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>> am unable to load _some_ regions, what could it be?
>>>
>>> 2010-09-24 16:38:42,729 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>> 2010-09-24 16:38:44,317 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>> 2010-09-24 16:38:44,318 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>> HDFS_READ, cliID:
>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>> 2010-09-24 16:38:44,510 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>> HDFS_READ, cliID:
>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>> 2010-09-24 16:38:44,525 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>> 2010-09-24 16:38:44,526 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>> HDFS_READ, cliID:
>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>
>>>
>>>
>>>
>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>> (Good one Ryan)
>>>>
>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>> config change.
>>>>
>>>> St.Ack
>>>>
>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Only regionserver, do I need to restart both?
>>>>>
>>>>> -jack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>
>>>>>> -ryan
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Also, even with '1' value, I see:
>>>>>>>
>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>
>>>>>>> Which is pretty close to concurrent.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Still having a problem:
>>>>>>>>
>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>> java.io.EOFException
>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>
>>>>>>>>
>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>> Try
>>>>>>>>>
>>>>>>>>>  <property>
>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>    <value>10</value>
>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>    to a region server.
>>>>>>>>>    </description>
>>>>>>>>>  </property>
>>>>>>>>>
>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

Oh, yeah, could be the OOME... But is OOME in regionserver or in
Master?  If in regionserver, I'd think that splitting we'd skip the
incomplete record.  Something is going on here.  I'd like to figure
it.  The flag should get you going in that you'll recover all up to
the last edit.

St.Ack


On Mon, Sep 27, 2010 at 11:48 AM, Jack Levin <ma...@gmail.com> wrote:
> Yay, my first bug.  I have not tried the flag yet (thanks), I will do
> that shortly.   I was not able to find those files are created.  Its
> likely the last addition not flushed properly as the server runs out
> of memory, so may as well call it lost data (in our case this is fine,
> we have source data permanently stored).
>
> -Jack
>
> On Mon, Sep 27, 2010 at 11:02 AM, Stack <st...@duboce.net> wrote:
>> I took a look.  When we go to read the last edit in each file, we
>> overshoot.  Its as though the last addition is not properly flushed.
>> I looked at the writing of the recovered.edits file contents and we're
>> using straight sequencefile#close so flush should be going on. You
>> pasted again from regionserver log.  Could you look for these files in
>> master log, where they were created, and see if any exceptions around
>> their creation? (I've been playing over here w/ crashing servers and I
>> don't have the issues you are running into over here on replay of
>> recovered edits -- I'll play some more....)
>>
>> Did you try setting the flag suggested in previous email?
>>
>> Thanks Jack,
>> St.Ack
>>
>>
>> On Sun, Sep 26, 2010 at 5:36 PM, Jack Levin <ma...@gmail.com> wrote:
>>> They were about 64MB, I will put them somewhere... like here:
>>> http://img2.imageshack.us/0000000000618601094
>>> http://img2.imageshack.us/0000000000618601136
>>>
>>> They are not gzippable, sorry... full of jpeg data I think.
>>> Here is an error snipper from master: http://pastebin.com/TdbYbDyy
>>>
>>> -Jack
>>>
>>> On Sun, Sep 26, 2010 at 4:04 PM, Stack <st...@duboce.net> wrote:
>>>> On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> I had the same issue this morning, some of the regions
>>>>> 'recovered.edits' was corrupt and no single region server was able to
>>>>> load them.  I saved them if someone is interested to see why they can
>>>>> not be processed.
>>>>
>>>>
>>>> Are they zero-length?
>>>>
>>>> If not, please put them somewhere I can pull them to take a look.
>>>>
>>>> You pasted the snippet from regionserver where we were unable to load
>>>> the recovered edits.   Were there any exceptions in the master log
>>>> when it processing the crash of the regionserver?
>>>>
>>>>> I think here is what happens:
>>>>>
>>>>> 1. I am writing data to hbase, and it hits the regionserver
>>>>> 2. It runs out of memory and kills itself.
>>>>
>>>>
>>>> Server dumps heap when it crashes (by defaut).  If you put this
>>>> somewhere that I can pull, I'll take a look at it.
>>>>
>>>>
>>>>> 3. WAL HLOG produces recovered.edits file thats corrupt.
>>>>> 4. On region reassignment no single region server can load the
>>>>> affected region, and tries over and over again, causing undo load on
>>>>> HDFS (network traffic spikes between datanodes).
>>>>>
>>>>> Is there a patch maybe that can detect this condition and throw an
>>>>> exception that deletes the recovered.edits file, we can't load it
>>>>> anyway, and this stops hbase from being able to auto-recover after one
>>>>> or more region servers die.
>>>>
>>>> I could make you something that kept going through an EOFException but
>>>> it looks like there is a flag you could set.  See
>>>> http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917
>>>>
>>>> If hbase.skip.errors = true, it should keep going rather than fail.
>>>>
>>>> St.Ack
>>>>
>>>> Now of course, one could say, just don't
>>>>> let region servers die, this even being careful is not enough here, we
>>>>> should expect RS to die cleanly.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> Thanks,
>>>>> -Jack
>>>>>
>>>>> PS. here is http://pastebin.com/81iV243r regionserver repeatedly
>>>>> unable to load a region.
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> http://pastebin.com/bD3JJ0sD
>>>>>>
>>>>>> The logs were 17MB in size max, and variable sizes like that.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <st...@duboce.net> wrote:
>>>>>>> Please paste the section from regionserver where you were getting the
>>>>>>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>>>>>>> idea moving the files aside).  Check the files too.  Are they
>>>>>>> zero-length?  If so, please look for them in the master log and paste
>>>>>>> me the section where we are splitting.
>>>>>>>
>>>>>>> Thanks Jack,
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> It was EOF exception, but now that I deleted edits files:
>>>>>>>>
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>>>>>>> Moved to trash:
>>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>>>>>>
>>>>>>>> Like these.  All of the regions have loaded... What could that have
>>>>>>>> been?   I assume I lost some writes, but this is not a big deal to
>>>>>>>> me... question is how to avoid something like that, is that a bug?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>> What is the complaint in regionserver log when region load fails?
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>>>>>>> am unable to load _some_ regions, what could it be?
>>>>>>>>>>
>>>>>>>>>> 2010-09-24 16:38:42,729 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>>>>>>> 2010-09-24 16:38:44,317 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>>>>>>> 2010-09-24 16:38:44,318 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>>>>>>> 2010-09-24 16:38:44,510 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>>>>>>> 2010-09-24 16:38:44,525 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>>>>>>> 2010-09-24 16:38:44,526 INFO
>>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>> (Good one Ryan)
>>>>>>>>>>>
>>>>>>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>>>>>>> config change.
>>>>>>>>>>>
>>>>>>>>>>> St.Ack
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>>>>>>
>>>>>>>>>>>> -jack
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -ryan
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>> Still having a problem:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>>>> Try
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  <property>
>>>>>>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>>>>>>>    to a region server.
>>>>>>>>>>>>>>>>    </description>
>>>>>>>>>>>>>>>>  </property>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

Yay, my first bug.  I have not tried the flag yet (thanks), I will do
that shortly.   I was not able to find those files are created.  Its
likely the last addition not flushed properly as the server runs out
of memory, so may as well call it lost data (in our case this is fine,
we have source data permanently stored).

-Jack

On Mon, Sep 27, 2010 at 11:02 AM, Stack <st...@duboce.net> wrote:
> I took a look.  When we go to read the last edit in each file, we
> overshoot.  Its as though the last addition is not properly flushed.
> I looked at the writing of the recovered.edits file contents and we're
> using straight sequencefile#close so flush should be going on. You
> pasted again from regionserver log.  Could you look for these files in
> master log, where they were created, and see if any exceptions around
> their creation? (I've been playing over here w/ crashing servers and I
> don't have the issues you are running into over here on replay of
> recovered edits -- I'll play some more....)
>
> Did you try setting the flag suggested in previous email?
>
> Thanks Jack,
> St.Ack
>
>
> On Sun, Sep 26, 2010 at 5:36 PM, Jack Levin <ma...@gmail.com> wrote:
>> They were about 64MB, I will put them somewhere... like here:
>> http://img2.imageshack.us/0000000000618601094
>> http://img2.imageshack.us/0000000000618601136
>>
>> They are not gzippable, sorry... full of jpeg data I think.
>> Here is an error snipper from master: http://pastebin.com/TdbYbDyy
>>
>> -Jack
>>
>> On Sun, Sep 26, 2010 at 4:04 PM, Stack <st...@duboce.net> wrote:
>>> On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> I had the same issue this morning, some of the regions
>>>> 'recovered.edits' was corrupt and no single region server was able to
>>>> load them.  I saved them if someone is interested to see why they can
>>>> not be processed.
>>>
>>>
>>> Are they zero-length?
>>>
>>> If not, please put them somewhere I can pull them to take a look.
>>>
>>> You pasted the snippet from regionserver where we were unable to load
>>> the recovered edits.   Were there any exceptions in the master log
>>> when it processing the crash of the regionserver?
>>>
>>>> I think here is what happens:
>>>>
>>>> 1. I am writing data to hbase, and it hits the regionserver
>>>> 2. It runs out of memory and kills itself.
>>>
>>>
>>> Server dumps heap when it crashes (by defaut).  If you put this
>>> somewhere that I can pull, I'll take a look at it.
>>>
>>>
>>>> 3. WAL HLOG produces recovered.edits file thats corrupt.
>>>> 4. On region reassignment no single region server can load the
>>>> affected region, and tries over and over again, causing undo load on
>>>> HDFS (network traffic spikes between datanodes).
>>>>
>>>> Is there a patch maybe that can detect this condition and throw an
>>>> exception that deletes the recovered.edits file, we can't load it
>>>> anyway, and this stops hbase from being able to auto-recover after one
>>>> or more region servers die.
>>>
>>> I could make you something that kept going through an EOFException but
>>> it looks like there is a flag you could set.  See
>>> http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917
>>>
>>> If hbase.skip.errors = true, it should keep going rather than fail.
>>>
>>> St.Ack
>>>
>>> Now of course, one could say, just don't
>>>> let region servers die, this even being careful is not enough here, we
>>>> should expect RS to die cleanly.
>>>>
>>>> Thoughts?
>>>>
>>>> Thanks,
>>>> -Jack
>>>>
>>>> PS. here is http://pastebin.com/81iV243r regionserver repeatedly
>>>> unable to load a region.
>>>>
>>>> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> http://pastebin.com/bD3JJ0sD
>>>>>
>>>>> The logs were 17MB in size max, and variable sizes like that.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <st...@duboce.net> wrote:
>>>>>> Please paste the section from regionserver where you were getting the
>>>>>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>>>>>> idea moving the files aside).  Check the files too.  Are they
>>>>>> zero-length?  If so, please look for them in the master log and paste
>>>>>> me the section where we are splitting.
>>>>>>
>>>>>> Thanks Jack,
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> It was EOF exception, but now that I deleted edits files:
>>>>>>>
>>>>>>> Moved to trash:
>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>>>>>> Moved to trash:
>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>>>>>> Moved to trash:
>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>>>>>> Moved to trash:
>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>>>>>> Moved to trash:
>>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>>>>>
>>>>>>> Like these.  All of the regions have loaded... What could that have
>>>>>>> been?   I assume I lost some writes, but this is not a big deal to
>>>>>>> me... question is how to avoid something like that, is that a bug?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>>>>>>> What is the complaint in regionserver log when region load fails?
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>>>>>> am unable to load _some_ regions, what could it be?
>>>>>>>>>
>>>>>>>>> 2010-09-24 16:38:42,729 INFO
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>>>>>> 2010-09-24 16:38:44,317 INFO
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>>>>>> 2010-09-24 16:38:44,318 INFO
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>>>>>> 2010-09-24 16:38:44,510 INFO
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>>>>>> 2010-09-24 16:38:44,525 INFO
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>>>>>> 2010-09-24 16:38:44,526 INFO
>>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>>>>>> HDFS_READ, cliID:
>>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>> (Good one Ryan)
>>>>>>>>>>
>>>>>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>>>>>> config change.
>>>>>>>>>>
>>>>>>>>>> St.Ack
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>>>>>
>>>>>>>>>>> -jack
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>>>>>>
>>>>>>>>>>>> -ryan
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>> Still having a problem:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>>> Try
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  <property>
>>>>>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>>>>>>    to a region server.
>>>>>>>>>>>>>>>    </description>
>>>>>>>>>>>>>>>  </property>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

I took a look.  When we go to read the last edit in each file, we
overshoot.  Its as though the last addition is not properly flushed.
I looked at the writing of the recovered.edits file contents and we're
using straight sequencefile#close so flush should be going on. You
pasted again from regionserver log.  Could you look for these files in
master log, where they were created, and see if any exceptions around
their creation? (I've been playing over here w/ crashing servers and I
don't have the issues you are running into over here on replay of
recovered edits -- I'll play some more....)

Did you try setting the flag suggested in previous email?

Thanks Jack,
St.Ack


On Sun, Sep 26, 2010 at 5:36 PM, Jack Levin <ma...@gmail.com> wrote:
> They were about 64MB, I will put them somewhere... like here:
> http://img2.imageshack.us/0000000000618601094
> http://img2.imageshack.us/0000000000618601136
>
> They are not gzippable, sorry... full of jpeg data I think.
> Here is an error snipper from master: http://pastebin.com/TdbYbDyy
>
> -Jack
>
> On Sun, Sep 26, 2010 at 4:04 PM, Stack <st...@duboce.net> wrote:
>> On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <ma...@gmail.com> wrote:
>>> I had the same issue this morning, some of the regions
>>> 'recovered.edits' was corrupt and no single region server was able to
>>> load them.  I saved them if someone is interested to see why they can
>>> not be processed.
>>
>>
>> Are they zero-length?
>>
>> If not, please put them somewhere I can pull them to take a look.
>>
>> You pasted the snippet from regionserver where we were unable to load
>> the recovered edits.   Were there any exceptions in the master log
>> when it processing the crash of the regionserver?
>>
>>> I think here is what happens:
>>>
>>> 1. I am writing data to hbase, and it hits the regionserver
>>> 2. It runs out of memory and kills itself.
>>
>>
>> Server dumps heap when it crashes (by defaut).  If you put this
>> somewhere that I can pull, I'll take a look at it.
>>
>>
>>> 3. WAL HLOG produces recovered.edits file thats corrupt.
>>> 4. On region reassignment no single region server can load the
>>> affected region, and tries over and over again, causing undo load on
>>> HDFS (network traffic spikes between datanodes).
>>>
>>> Is there a patch maybe that can detect this condition and throw an
>>> exception that deletes the recovered.edits file, we can't load it
>>> anyway, and this stops hbase from being able to auto-recover after one
>>> or more region servers die.
>>
>> I could make you something that kept going through an EOFException but
>> it looks like there is a flag you could set.  See
>> http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917
>>
>> If hbase.skip.errors = true, it should keep going rather than fail.
>>
>> St.Ack
>>
>> Now of course, one could say, just don't
>>> let region servers die, this even being careful is not enough here, we
>>> should expect RS to die cleanly.
>>>
>>> Thoughts?
>>>
>>> Thanks,
>>> -Jack
>>>
>>> PS. here is http://pastebin.com/81iV243r regionserver repeatedly
>>> unable to load a region.
>>>
>>> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> http://pastebin.com/bD3JJ0sD
>>>>
>>>> The logs were 17MB in size max, and variable sizes like that.
>>>>
>>>> -Jack
>>>>
>>>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <st...@duboce.net> wrote:
>>>>> Please paste the section from regionserver where you were getting the
>>>>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>>>>> idea moving the files aside).  Check the files too.  Are they
>>>>> zero-length?  If so, please look for them in the master log and paste
>>>>> me the section where we are splitting.
>>>>>
>>>>> Thanks Jack,
>>>>> St.Ack
>>>>>
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> It was EOF exception, but now that I deleted edits files:
>>>>>>
>>>>>> Moved to trash:
>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>>>>> Moved to trash:
>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>>>>> Moved to trash:
>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>>>>> Moved to trash:
>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>>>>> Moved to trash:
>>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>>>>
>>>>>> Like these.  All of the regions have loaded... What could that have
>>>>>> been?   I assume I lost some writes, but this is not a big deal to
>>>>>> me... question is how to avoid something like that, is that a bug?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>>>>>> What is the complaint in regionserver log when region load fails?
>>>>>>> St.Ack
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>>>>> am unable to load _some_ regions, what could it be?
>>>>>>>>
>>>>>>>> 2010-09-24 16:38:42,729 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>>>>> 2010-09-24 16:38:44,317 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>>>>> 2010-09-24 16:38:44,318 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>>>>> HDFS_READ, cliID:
>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>>>>> 2010-09-24 16:38:44,510 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>>>>> HDFS_READ, cliID:
>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>>>>> 2010-09-24 16:38:44,525 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>>>>> 2010-09-24 16:38:44,526 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>>>>> HDFS_READ, cliID:
>>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>> (Good one Ryan)
>>>>>>>>>
>>>>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>>>>> config change.
>>>>>>>>>
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>>>>
>>>>>>>>>> -jack
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>>>>>
>>>>>>>>>>> -ryan
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>>>>
>>>>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>>>>
>>>>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>> Still having a problem:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>>> Try
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  <property>
>>>>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>>>>>    to a region server.
>>>>>>>>>>>>>>    </description>
>>>>>>>>>>>>>>  </property>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

They were about 64MB, I will put them somewhere... like here:
http://img2.imageshack.us/0000000000618601094
http://img2.imageshack.us/0000000000618601136

They are not gzippable, sorry... full of jpeg data I think.
Here is an error snipper from master: http://pastebin.com/TdbYbDyy

-Jack

On Sun, Sep 26, 2010 at 4:04 PM, Stack <st...@duboce.net> wrote:
> On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <ma...@gmail.com> wrote:
>> I had the same issue this morning, some of the regions
>> 'recovered.edits' was corrupt and no single region server was able to
>> load them.  I saved them if someone is interested to see why they can
>> not be processed.
>
>
> Are they zero-length?
>
> If not, please put them somewhere I can pull them to take a look.
>
> You pasted the snippet from regionserver where we were unable to load
> the recovered edits.   Were there any exceptions in the master log
> when it processing the crash of the regionserver?
>
>> I think here is what happens:
>>
>> 1. I am writing data to hbase, and it hits the regionserver
>> 2. It runs out of memory and kills itself.
>
>
> Server dumps heap when it crashes (by defaut).  If you put this
> somewhere that I can pull, I'll take a look at it.
>
>
>> 3. WAL HLOG produces recovered.edits file thats corrupt.
>> 4. On region reassignment no single region server can load the
>> affected region, and tries over and over again, causing undo load on
>> HDFS (network traffic spikes between datanodes).
>>
>> Is there a patch maybe that can detect this condition and throw an
>> exception that deletes the recovered.edits file, we can't load it
>> anyway, and this stops hbase from being able to auto-recover after one
>> or more region servers die.
>
> I could make you something that kept going through an EOFException but
> it looks like there is a flag you could set.  See
> http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917
>
> If hbase.skip.errors = true, it should keep going rather than fail.
>
> St.Ack
>
> Now of course, one could say, just don't
>> let region servers die, this even being careful is not enough here, we
>> should expect RS to die cleanly.
>>
>> Thoughts?
>>
>> Thanks,
>> -Jack
>>
>> PS. here is http://pastebin.com/81iV243r regionserver repeatedly
>> unable to load a region.
>>
>> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <ma...@gmail.com> wrote:
>>> http://pastebin.com/bD3JJ0sD
>>>
>>> The logs were 17MB in size max, and variable sizes like that.
>>>
>>> -Jack
>>>
>>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <st...@duboce.net> wrote:
>>>> Please paste the section from regionserver where you were getting the
>>>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>>>> idea moving the files aside).  Check the files too.  Are they
>>>> zero-length?  If so, please look for them in the master log and paste
>>>> me the section where we are splitting.
>>>>
>>>> Thanks Jack,
>>>> St.Ack
>>>>
>>>>
>>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> It was EOF exception, but now that I deleted edits files:
>>>>>
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>>>> Moved to trash:
>>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>>>
>>>>> Like these.  All of the regions have loaded... What could that have
>>>>> been?   I assume I lost some writes, but this is not a big deal to
>>>>> me... question is how to avoid something like that, is that a bug?
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>>>>> What is the complaint in regionserver log when region load fails?
>>>>>> St.Ack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>>>> am unable to load _some_ regions, what could it be?
>>>>>>>
>>>>>>> 2010-09-24 16:38:42,729 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>>>> 2010-09-24 16:38:44,317 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>>>> 2010-09-24 16:38:44,318 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>>>> HDFS_READ, cliID:
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>>>> 2010-09-24 16:38:44,510 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>>>> HDFS_READ, cliID:
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>>>> 2010-09-24 16:38:44,525 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>>>> 2010-09-24 16:38:44,526 INFO
>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>>>> HDFS_READ, cliID:
>>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>>>>> (Good one Ryan)
>>>>>>>>
>>>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>>>> config change.
>>>>>>>>
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>>>
>>>>>>>>> -jack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>>>>
>>>>>>>>>> -ryan
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>>>
>>>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>>>
>>>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>> Still having a problem:
>>>>>>>>>>>>
>>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>>> Try
>>>>>>>>>>>>>
>>>>>>>>>>>>>  <property>
>>>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>>>>    to a region server.
>>>>>>>>>>>>>    </description>
>>>>>>>>>>>>>  </property>
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>>>
>>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

On Sun, Sep 26, 2010 at 1:53 PM, Jack Levin <ma...@gmail.com> wrote:
> I had the same issue this morning, some of the regions
> 'recovered.edits' was corrupt and no single region server was able to
> load them.  I saved them if someone is interested to see why they can
> not be processed.


Are they zero-length?

If not, please put them somewhere I can pull them to take a look.

You pasted the snippet from regionserver where we were unable to load
the recovered edits.   Were there any exceptions in the master log
when it processing the crash of the regionserver?

> I think here is what happens:
>
> 1. I am writing data to hbase, and it hits the regionserver
> 2. It runs out of memory and kills itself.


Server dumps heap when it crashes (by defaut).  If you put this
somewhere that I can pull, I'll take a look at it.


> 3. WAL HLOG produces recovered.edits file thats corrupt.
> 4. On region reassignment no single region server can load the
> affected region, and tries over and over again, causing undo load on
> HDFS (network traffic spikes between datanodes).
>
> Is there a patch maybe that can detect this condition and throw an
> exception that deletes the recovered.edits file, we can't load it
> anyway, and this stops hbase from being able to auto-recover after one
> or more region servers die.

I could make you something that kept going through an EOFException but
it looks like there is a flag you could set.  See
http://hbase.apache.org/docs/r0.89.20100726/xref/org/apache/hadoop/hbase/regionserver/HRegion.html#1917

If hbase.skip.errors = true, it should keep going rather than fail.

St.Ack

Now of course, one could say, just don't
> let region servers die, this even being careful is not enough here, we
> should expect RS to die cleanly.
>
> Thoughts?
>
> Thanks,
> -Jack
>
> PS. here is http://pastebin.com/81iV243r regionserver repeatedly
> unable to load a region.
>
> On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <ma...@gmail.com> wrote:
>> http://pastebin.com/bD3JJ0sD
>>
>> The logs were 17MB in size max, and variable sizes like that.
>>
>> -Jack
>>
>> On Fri, Sep 24, 2010 at 4:56 PM, Stack <st...@duboce.net> wrote:
>>> Please paste the section from regionserver where you were getting the
>>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>>> idea moving the files aside).  Check the files too.  Are they
>>> zero-length?  If so, please look for them in the master log and paste
>>> me the section where we are splitting.
>>>
>>> Thanks Jack,
>>> St.Ack
>>>
>>>
>>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> It was EOF exception, but now that I deleted edits files:
>>>>
>>>> Moved to trash:
>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>>> Moved to trash:
>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>>> Moved to trash:
>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>>> Moved to trash:
>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>>> Moved to trash:
>>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>>
>>>> Like these.  All of the regions have loaded... What could that have
>>>> been?   I assume I lost some writes, but this is not a big deal to
>>>> me... question is how to avoid something like that, is that a bug?
>>>>
>>>> -Jack
>>>>
>>>>
>>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>>>> What is the complaint in regionserver log when region load fails?
>>>>> St.Ack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>>> am unable to load _some_ regions, what could it be?
>>>>>>
>>>>>> 2010-09-24 16:38:42,729 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>>> 2010-09-24 16:38:44,317 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>>> 2010-09-24 16:38:44,318 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>>> HDFS_READ, cliID:
>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>>> 2010-09-24 16:38:44,510 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>>> HDFS_READ, cliID:
>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>>> 2010-09-24 16:38:44,525 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>>> 2010-09-24 16:38:44,526 INFO
>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>>> HDFS_READ, cliID:
>>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>>>> (Good one Ryan)
>>>>>>>
>>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>>> config change.
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>>
>>>>>>>> -jack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>>>
>>>>>>>>> -ryan
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>>
>>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>>
>>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> Still having a problem:
>>>>>>>>>>>
>>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>>> java.io.EOFException
>>>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>>> Try
>>>>>>>>>>>>
>>>>>>>>>>>>  <property>
>>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>>>    to a region server.
>>>>>>>>>>>>    </description>
>>>>>>>>>>>>  </property>
>>>>>>>>>>>>
>>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>>
>>>>>>>>>>>> St.Ack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Jack
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

I had the same issue this morning, some of the regions
'recovered.edits' was corrupt and no single region server was able to
load them.  I saved them if someone is interested to see why they can
not be processed.  I think here is what happens:

1. I am writing data to hbase, and it hits the regionserver
2. It runs out of memory and kills itself.
3. WAL HLOG produces recovered.edits file thats corrupt.
4. On region reassignment no single region server can load the
affected region, and tries over and over again, causing undo load on
HDFS (network traffic spikes between datanodes).

Is there a patch maybe that can detect this condition and throw an
exception that deletes the recovered.edits file, we can't load it
anyway, and this stops hbase from being able to auto-recover after one
or more region servers die.   Now of course, one could say, just don't
let region servers die, this even being careful is not enough here, we
should expect RS to die cleanly.

Thoughts?

Thanks,
-Jack

PS. here is http://pastebin.com/81iV243r regionserver repeatedly
unable to load a region.

On Fri, Sep 24, 2010 at 4:59 PM, Jack Levin <ma...@gmail.com> wrote:
> http://pastebin.com/bD3JJ0sD
>
> The logs were 17MB in size max, and variable sizes like that.
>
> -Jack
>
> On Fri, Sep 24, 2010 at 4:56 PM, Stack <st...@duboce.net> wrote:
>> Please paste the section from regionserver where you were getting the
>> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
>> idea moving the files aside).  Check the files too.  Are they
>> zero-length?  If so, please look for them in the master log and paste
>> me the section where we are splitting.
>>
>> Thanks Jack,
>> St.Ack
>>
>>
>> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>>> It was EOF exception, but now that I deleted edits files:
>>>
>>> Moved to trash:
>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>>> Moved to trash:
>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>>> Moved to trash:
>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>>> Moved to trash:
>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>>> Moved to trash:
>>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>>
>>> Like these.  All of the regions have loaded... What could that have
>>> been?   I assume I lost some writes, but this is not a big deal to
>>> me... question is how to avoid something like that, is that a bug?
>>>
>>> -Jack
>>>
>>>
>>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>>> What is the complaint in regionserver log when region load fails?
>>>> St.Ack
>>>>
>>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>>> am unable to load _some_ regions, what could it be?
>>>>>
>>>>> 2010-09-24 16:38:42,729 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>>> 2010-09-24 16:38:44,317 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>>> 2010-09-24 16:38:44,318 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>>> HDFS_READ, cliID:
>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>>> 2010-09-24 16:38:44,510 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>>> HDFS_READ, cliID:
>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>>> 2010-09-24 16:38:44,525 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>>> 2010-09-24 16:38:44,526 INFO
>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>>> HDFS_READ, cliID:
>>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>>> (Good one Ryan)
>>>>>>
>>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>>> config change.
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Only regionserver, do I need to restart both?
>>>>>>>
>>>>>>> -jack
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>>
>>>>>>>> -ryan
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>>
>>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>>
>>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> Still having a problem:
>>>>>>>>>>
>>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>>> java.io.EOFException
>>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>>> Try
>>>>>>>>>>>
>>>>>>>>>>>  <property>
>>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>>    <value>10</value>
>>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>>    to a region server.
>>>>>>>>>>>    </description>
>>>>>>>>>>>  </property>
>>>>>>>>>>>
>>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>>
>>>>>>>>>>> St.Ack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>>
>>>>>>>>>>>> -Jack
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

http://pastebin.com/bD3JJ0sD

The logs were 17MB in size max, and variable sizes like that.

-Jack

On Fri, Sep 24, 2010 at 4:56 PM, Stack <st...@duboce.net> wrote:
> Please paste the section from regionserver where you were getting the
> EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
> idea moving the files aside).  Check the files too.  Are they
> zero-length?  If so, please look for them in the master log and paste
> me the section where we are splitting.
>
> Thanks Jack,
> St.Ack
>
>
> On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
>> It was EOF exception, but now that I deleted edits files:
>>
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
>> Moved to trash:
>> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>>
>> Like these.  All of the regions have loaded... What could that have
>> been?   I assume I lost some writes, but this is not a big deal to
>> me... question is how to avoid something like that, is that a bug?
>>
>> -Jack
>>
>>
>> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>>> What is the complaint in regionserver log when region load fails?
>>> St.Ack
>>>
>>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> so, datanode log shows no errors whatsoever, however I do see same
>>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>>> am unable to load _some_ regions, what could it be?
>>>>
>>>> 2010-09-24 16:38:42,729 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>>> 2010-09-24 16:38:44,317 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>>> 2010-09-24 16:38:44,318 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>>> HDFS_READ, cliID:
>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>>> 2010-09-24 16:38:44,510 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>>> HDFS_READ, cliID:
>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>>> 2010-09-24 16:38:44,525 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>>> 2010-09-24 16:38:44,526 INFO
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>>> HDFS_READ, cliID:
>>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>>> (Good one Ryan)
>>>>>
>>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>>> config change.
>>>>>
>>>>> St.Ack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Only regionserver, do I need to restart both?
>>>>>>
>>>>>> -jack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>>
>>>>>>> -ryan
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Also, even with '1' value, I see:
>>>>>>>>
>>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>>
>>>>>>>> Which is pretty close to concurrent.
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> Still having a problem:
>>>>>>>>>
>>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>>> java.io.EOFException
>>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>>> Try
>>>>>>>>>>
>>>>>>>>>>  <property>
>>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>>    <value>10</value>
>>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>>    to a region server.
>>>>>>>>>>    </description>
>>>>>>>>>>  </property>
>>>>>>>>>>
>>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>>
>>>>>>>>>> St.Ack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>>
>>>>>>>>>>> -Jack
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

Please paste the section from regionserver where you were getting the
EOF to pastebin.  I'd like to see exactly where (but yeah, you get the
idea moving the files aside).  Check the files too.  Are they
zero-length?  If so, please look for them in the master log and paste
me the section where we are splitting.

Thanks Jack,
St.Ack


On Fri, Sep 24, 2010 at 4:52 PM, Jack Levin <ma...@gmail.com> wrote:
> It was EOF exception, but now that I deleted edits files:
>
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
> Moved to trash:
> hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019
>
> Like these.  All of the regions have loaded... What could that have
> been?   I assume I lost some writes, but this is not a big deal to
> me... question is how to avoid something like that, is that a bug?
>
> -Jack
>
>
> On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
>> What is the complaint in regionserver log when region load fails?
>> St.Ack
>>
>> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>>> so, datanode log shows no errors whatsoever, however I do see same
>>> blocks fetched repeatedly, and the network speed is quite high, but I
>>> am unable to load _some_ regions, what could it be?
>>>
>>> 2010-09-24 16:38:42,729 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_5556468858269577961_1550101, duration: 127413
>>> 2010-09-24 16:38:44,317 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_364673737339632029_1347910, duration: 1140653
>>> 2010-09-24 16:38:44,318 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>>> HDFS_READ, cliID:
>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_364673737339632029_1347910, duration: 691929
>>> 2010-09-24 16:38:44,510 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>>> HDFS_READ, cliID:
>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>>> 2010-09-24 16:38:44,525 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_-176750251227749356_1535293, duration: 77045
>>> 2010-09-24 16:38:44,526 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>>> HDFS_READ, cliID:
>>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>>
>>>
>>>
>>>
>>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>>> (Good one Ryan)
>>>>
>>>> Master is doing the assigning.  It needs to be restarted to see the
>>>> config change.
>>>>
>>>> St.Ack
>>>>
>>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Only regionserver, do I need to restart both?
>>>>>
>>>>> -jack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>>
>>>>>> -ryan
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Also, even with '1' value, I see:
>>>>>>>
>>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>>
>>>>>>> Which is pretty close to concurrent.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> Still having a problem:
>>>>>>>>
>>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>>> java.io.EOFException
>>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>>
>>>>>>>>
>>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>>> that my hdfs is not having a problem.
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>>> Try
>>>>>>>>>
>>>>>>>>>  <property>
>>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>>    <value>10</value>
>>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>>    to a region server.
>>>>>>>>>    </description>
>>>>>>>>>  </property>
>>>>>>>>>
>>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>>
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>>
>>>>>>>>>> -Jack
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

It was EOF exception, but now that I deleted edits files:

Moved to trash:
hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1062260343/recovered.edits/0000000000617305532
Moved to trash:
hdfs://namenode-rd.imageshack.us:9000/hbase/img96/1321772129/recovered.edits/0000000000617328530
Moved to trash:
hdfs://namenode-rd.imageshack.us:9000/hbase/img96/257974055/recovered.edits/0000000000617238642
Moved to trash:
hdfs://namenode-rd.imageshack.us:9000/hbase/img97/117679080/recovered.edits/0000000000617306059
Moved to trash:
hdfs://namenode-rd.imageshack.us:9000/hbase/img97/221569766/recovered.edits/0000000000617242019

Like these.  All of the regions have loaded... What could that have
been?   I assume I lost some writes, but this is not a big deal to
me... question is how to avoid something like that, is that a bug?

-Jack


On Fri, Sep 24, 2010 at 4:44 PM, Stack <st...@duboce.net> wrote:
> What is the complaint in regionserver log when region load fails?
> St.Ack
>
> On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
>> so, datanode log shows no errors whatsoever, however I do see same
>> blocks fetched repeatedly, and the network speed is quite high, but I
>> am unable to load _some_ regions, what could it be?
>>
>> 2010-09-24 16:38:42,729 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>> blockid: blk_5556468858269577961_1550101, duration: 127413
>> 2010-09-24 16:38:44,317 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>> blockid: blk_364673737339632029_1347910, duration: 1140653
>> 2010-09-24 16:38:44,318 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
>> HDFS_READ, cliID:
>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>> blockid: blk_364673737339632029_1347910, duration: 691929
>> 2010-09-24 16:38:44,510 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
>> HDFS_READ, cliID:
>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>> blockid: blk_-3781179144642915580_1571141, duration: 173548261
>> 2010-09-24 16:38:44,525 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
>> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>> blockid: blk_-176750251227749356_1535293, duration: 77045
>> 2010-09-24 16:38:44,526 INFO
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
>> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
>> HDFS_READ, cliID:
>> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
>> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
>> blockid: blk_-176750251227749356_1535293, duration: 128270
>>
>>
>>
>>
>> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>>> (Good one Ryan)
>>>
>>> Master is doing the assigning.  It needs to be restarted to see the
>>> config change.
>>>
>>> St.Ack
>>>
>>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> Only regionserver, do I need to restart both?
>>>>
>>>> -jack
>>>>
>>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>>
>>>>> -ryan
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Also, even with '1' value, I see:
>>>>>>
>>>>>> 2010-09-24 16:20:29,983 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>>> 2010-09-24 16:20:29,984 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>>> 2010-09-24 16:20:29,985 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>>
>>>>>> Which is pretty close to concurrent.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> Still having a problem:
>>>>>>>
>>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>>> java.io.EOFException
>>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>>
>>>>>>>
>>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>>> that my hdfs is not having a problem.
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>>> Try
>>>>>>>>
>>>>>>>>  <property>
>>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>>    <value>10</value>
>>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>>    to a region server.
>>>>>>>>    </description>
>>>>>>>>  </property>
>>>>>>>>
>>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>>
>>>>>>>> St.Ack
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>>
>>>>>>>>> -Jack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

What is the complaint in regionserver log when region load fails?
St.Ack

On Fri, Sep 24, 2010 at 4:40 PM, Jack Levin <ma...@gmail.com> wrote:
> so, datanode log shows no errors whatsoever, however I do see same
> blocks fetched repeatedly, and the network speed is quite high, but I
> am unable to load _some_ regions, what could it be?
>
> 2010-09-24 16:38:42,729 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
> offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
> blockid: blk_5556468858269577961_1550101, duration: 127413
> 2010-09-24 16:38:44,317 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
> offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
> blockid: blk_364673737339632029_1347910, duration: 1140653
> 2010-09-24 16:38:44,318 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
> HDFS_READ, cliID:
> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
> offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
> blockid: blk_364673737339632029_1347910, duration: 691929
> 2010-09-24 16:38:44,510 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
> HDFS_READ, cliID:
> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
> offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
> blockid: blk_-3781179144642915580_1571141, duration: 173548261
> 2010-09-24 16:38:44,525 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
> cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
> offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
> blockid: blk_-176750251227749356_1535293, duration: 77045
> 2010-09-24 16:38:44,526 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> /10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
> HDFS_READ, cliID:
> DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
> offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
> blockid: blk_-176750251227749356_1535293, duration: 128270
>
>
>
>
> On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
>> (Good one Ryan)
>>
>> Master is doing the assigning.  It needs to be restarted to see the
>> config change.
>>
>> St.Ack
>>
>> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Only regionserver, do I need to restart both?
>>>
>>> -jack
>>>
>>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>>> Did you restart the master and the regionserver? Or just one or the other?
>>>>
>>>> -ryan
>>>>
>>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Also, even with '1' value, I see:
>>>>>
>>>>> 2010-09-24 16:20:29,983 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>>> 2010-09-24 16:20:29,984 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img651,pict1408.jpg,1285018965749.110871465
>>>>> 2010-09-24 16:20:29,984 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>>> 2010-09-24 16:20:29,985 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>>> 2010-09-24 16:20:29,985 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>>> 2010-09-24 16:20:29,985 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>>> 2010-09-24 16:20:29,985 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>>> 2010-09-24 16:20:29,985 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>>> 2010-09-24 16:20:29,985 INFO
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>>> 2010-09-24 16:20:29,986 DEBUG
>>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>>
>>>>> Which is pretty close to concurrent.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> Still having a problem:
>>>>>>
>>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>>> java.io.EOFException
>>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>>
>>>>>>
>>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>>> that my hdfs is not having a problem.
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>>> Try
>>>>>>>
>>>>>>>  <property>
>>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>>    <value>10</value>
>>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>>    to a region server.
>>>>>>>    </description>
>>>>>>>  </property>
>>>>>>>
>>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>>
>>>>>>>> -Jack
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

so, datanode log shows no errors whatsoever, however I do see same
blocks fetched repeatedly, and the network speed is quite high, but I
am unable to load _some_ regions, what could it be?

2010-09-24 16:38:42,729 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.101.6.2:50010, dest: /10.101.6.2:53038, bytes: 914, op: HDFS_READ,
cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
offset: 13803520, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
blockid: blk_5556468858269577961_1550101, duration: 127413
2010-09-24 16:38:44,317 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.101.6.2:50010, dest: /10.101.6.2:53048, bytes: 110, op: HDFS_READ,
cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
offset: 32723968, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
blockid: blk_364673737339632029_1347910, duration: 1140653
2010-09-24 16:38:44,318 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.101.6.2:50010, dest: /10.101.6.2:53049, bytes: 38294, op:
HDFS_READ, cliID:
DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
offset: 32686080, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
blockid: blk_364673737339632029_1347910, duration: 691929
2010-09-24 16:38:44,510 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.101.6.2:50010, dest: /10.101.6.2:53054, bytes: 18021300, op:
HDFS_READ, cliID:
DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
offset: 0, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
blockid: blk_-3781179144642915580_1571141, duration: 173548261
2010-09-24 16:38:44,525 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.101.6.2:50010, dest: /10.101.6.2:53055, bytes: 506, op: HDFS_READ,
cliID: DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
offset: 48700928, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
blockid: blk_-176750251227749356_1535293, duration: 77045
2010-09-24 16:38:44,526 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
/10.101.6.2:50010, dest: /10.101.6.2:53056, bytes: 6182, op:
HDFS_READ, cliID:
DFSClient_hb_rs_rdaf2.prod.imageshack.com,60020,1285371202189_1285371202237,
offset: 48695296, srvID: DS-1363732508-10.101.6.2-50010-1284520709569,
blockid: blk_-176750251227749356_1535293, duration: 128270




On Fri, Sep 24, 2010 at 4:32 PM, Stack <st...@duboce.net> wrote:
> (Good one Ryan)
>
> Master is doing the assigning.  It needs to be restarted to see the
> config change.
>
> St.Ack
>
> On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
>> Only regionserver, do I need to restart both?
>>
>> -jack
>>
>> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>>> Did you restart the master and the regionserver? Or just one or the other?
>>>
>>> -ryan
>>>
>>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> Also, even with '1' value, I see:
>>>>
>>>> 2010-09-24 16:20:29,983 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>>> 2010-09-24 16:20:29,984 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img651,pict1408.jpg,1285018965749.110871465
>>>> 2010-09-24 16:20:29,984 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>>> 2010-09-24 16:20:29,985 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>>> 2010-09-24 16:20:29,985 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>>> 2010-09-24 16:20:29,985 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>>> 2010-09-24 16:20:29,985 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>>> 2010-09-24 16:20:29,985 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>>> 2010-09-24 16:20:29,985 INFO
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>>> img94,img1711m.jpg,1285016850260.1424182007
>>>> 2010-09-24 16:20:29,986 DEBUG
>>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>>
>>>> Which is pretty close to concurrent.
>>>>
>>>> -Jack
>>>>
>>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> Still having a problem:
>>>>>
>>>>> 2010-09-24 16:15:02,572 ERROR
>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>>> java.io.EOFException
>>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>>
>>>>>
>>>>> I changed the value to '1', and restarted the regionserver... Note
>>>>> that my hdfs is not having a problem.
>>>>>
>>>>> -Jack
>>>>>
>>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>>> Try
>>>>>>
>>>>>>  <property>
>>>>>>    <name>hbase.regions.percheckin</name>
>>>>>>    <value>10</value>
>>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>>    to a region server.
>>>>>>    </description>
>>>>>>  </property>
>>>>>>
>>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>>
>>>>>>> -Jack
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

(Good one Ryan)

Master is doing the assigning.  It needs to be restarted to see the
config change.

St.Ack

On Fri, Sep 24, 2010 at 4:28 PM, Jack Levin <ma...@gmail.com> wrote:
> Only regionserver, do I need to restart both?
>
> -jack
>
> On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> Did you restart the master and the regionserver? Or just one or the other?
>>
>> -ryan
>>
>> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Also, even with '1' value, I see:
>>>
>>> 2010-09-24 16:20:29,983 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>>> 2010-09-24 16:20:29,984 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img651,pict1408.jpg,1285018965749.110871465
>>> 2010-09-24 16:20:29,984 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>>> 2010-09-24 16:20:29,985 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>>> 2010-09-24 16:20:29,985 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>>> 2010-09-24 16:20:29,985 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>>> 2010-09-24 16:20:29,985 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>>> 2010-09-24 16:20:29,985 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>>> 2010-09-24 16:20:29,985 INFO
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>>> img94,img1711m.jpg,1285016850260.1424182007
>>> 2010-09-24 16:20:29,986 DEBUG
>>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>>
>>> Which is pretty close to concurrent.
>>>
>>> -Jack
>>>
>>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> Still having a problem:
>>>>
>>>> 2010-09-24 16:15:02,572 ERROR
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>>> java.io.EOFException
>>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>>
>>>>
>>>> I changed the value to '1', and restarted the regionserver... Note
>>>> that my hdfs is not having a problem.
>>>>
>>>> -Jack
>>>>
>>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>>> Try
>>>>>
>>>>>  <property>
>>>>>    <name>hbase.regions.percheckin</name>
>>>>>    <value>10</value>
>>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>>    to a region server.
>>>>>    </description>
>>>>>  </property>
>>>>>
>>>>> What do you have now?  Whatever it is, go down from there.
>>>>>
>>>>> St.Ack
>>>>>
>>>>>
>>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>>
>>>>>> -Jack
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

Only regionserver, do I need to restart both?

-jack

On Fri, Sep 24, 2010 at 4:22 PM, Ryan Rawson <ry...@gmail.com> wrote:
> Did you restart the master and the regionserver? Or just one or the other?
>
> -ryan
>
> On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
>> Also, even with '1' value, I see:
>>
>> 2010-09-24 16:20:29,983 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
>> 2010-09-24 16:20:29,984 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img651,pict1408.jpg,1285018965749.110871465
>> 2010-09-24 16:20:29,984 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
>> 2010-09-24 16:20:29,985 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
>> 2010-09-24 16:20:29,985 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
>> 2010-09-24 16:20:29,985 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
>> 2010-09-24 16:20:29,985 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
>> 2010-09-24 16:20:29,985 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
>> 2010-09-24 16:20:29,985 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
>> img94,img1711m.jpg,1285016850260.1424182007
>> 2010-09-24 16:20:29,986 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
>> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>>
>> Which is pretty close to concurrent.
>>
>> -Jack
>>
>> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>>> Still having a problem:
>>>
>>> 2010-09-24 16:15:02,572 ERROR
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>>> java.io.EOFException
>>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>>
>>>
>>> I changed the value to '1', and restarted the regionserver... Note
>>> that my hdfs is not having a problem.
>>>
>>> -Jack
>>>
>>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>>> Try
>>>>
>>>>  <property>
>>>>    <name>hbase.regions.percheckin</name>
>>>>    <value>10</value>
>>>>    <description>Maximum number of regions that can be assigned in a single go
>>>>    to a region server.
>>>>    </description>
>>>>  </property>
>>>>
>>>> What do you have now?  Whatever it is, go down from there.
>>>>
>>>> St.Ack
>>>>
>>>>
>>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>>
>>>>> -Jack
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Ryan Rawson <ry...@gmail.com>.

Did you restart the master and the regionserver? Or just one or the other?

-ryan

On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
> Also, even with '1' value, I see:
>
> 2010-09-24 16:20:29,983 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
> 2010-09-24 16:20:29,984 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img651,pict1408.jpg,1285018965749.110871465
> 2010-09-24 16:20:29,984 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img94,img1711m.jpg,1285016850260.1424182007
> 2010-09-24 16:20:29,986 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>
> Which is pretty close to concurrent.
>
> -Jack
>
> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>> Still having a problem:
>>
>> 2010-09-24 16:15:02,572 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>> java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>
>>
>> I changed the value to '1', and restarted the regionserver... Note
>> that my hdfs is not having a problem.
>>
>> -Jack
>>
>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>> Try
>>>
>>>  <property>
>>>    <name>hbase.regions.percheckin</name>
>>>    <value>10</value>
>>>    <description>Maximum number of regions that can be assigned in a single go
>>>    to a region server.
>>>    </description>
>>>  </property>
>>>
>>> What do you have now?  Whatever it is, go down from there.
>>>
>>> St.Ack
>>>
>>>
>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>
>>>> -Jack
>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

The '1' is give a regionserver '1' region to open each time it checks
in.  Servers check in every second by default.  It looks like you have
a good few servers coming in.

Change this if you'd have them come in less frequently:


  <property>
    <name>hbase.regionserver.msginterval</name>
    <value>1000</value>
    <description>Interval between messages from the RegionServer to HMaster
    in milliseconds. Use a high value like 3000 for clusters with more than 10
    nodes. Default is 1 second so that HBase seems more 'live'.
    </description>
  </property>

The java.io.EOFException is something else, a bad storefile or an
issue with hdfs.  Check your hdfs logs for xceiver complaint.

St.Ack


On Fri, Sep 24, 2010 at 4:21 PM, Jack Levin <ma...@gmail.com> wrote:
> Also, even with '1' value, I see:
>
> 2010-09-24 16:20:29,983 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
> 2010-09-24 16:20:29,984 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img651,pict1408.jpg,1285018965749.110871465
> 2010-09-24 16:20:29,984 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
> 2010-09-24 16:20:29,985 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
> img94,img1711m.jpg,1285016850260.1424182007
> 2010-09-24 16:20:29,986 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegion: Creating region
> img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.
>
> Which is pretty close to concurrent.
>
> -Jack
>
> On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
>> Still having a problem:
>>
>> 2010-09-24 16:15:02,572 ERROR
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
>> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
>> java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>>
>>
>> I changed the value to '1', and restarted the regionserver... Note
>> that my hdfs is not having a problem.
>>
>> -Jack
>>
>> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>>> Try
>>>
>>>  <property>
>>>    <name>hbase.regions.percheckin</name>
>>>    <value>10</value>
>>>    <description>Maximum number of regions that can be assigned in a single go
>>>    to a region server.
>>>    </description>
>>>  </property>
>>>
>>> What do you have now?  Whatever it is, go down from there.
>>>
>>> St.Ack
>>>
>>>
>>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>>
>>>> -Jack
>>>>
>>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

Also, even with '1' value, I see:

2010-09-24 16:20:29,983 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img834,1000351n.jpg,1285251664421.d09510a16c0cfd0d8a251a36229125e0.
2010-09-24 16:20:29,984 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img651,pict1408.jpg,1285018965749.110871465
2010-09-24 16:20:29,984 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img806,sam0084a.jpg,1285324613056.82a1e8ba8d2a37a591a847fb36803c45.
2010-09-24 16:20:29,985 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img535,screenshot1bt.png,1285323376435.fae5f3ab474196c99f10b8a936fb9ead.
2010-09-24 16:20:29,985 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img838,123468.jpg,1285223690165.a2903008621d1a6b6ca02441bf3b68ea.
2010-09-24 16:20:29,985 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img839,yug.jpg,1285230318537.c09323dbaf54130671df2a14d671fe25.
2010-09-24 16:20:29,985 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img821,vlcsnap78737.png,1285283076812.ea4973ce6e43d7f974613c5989647278.
2010-09-24 16:20:29,985 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img805,njt30scbkdmb.gif,1285322429401.f9aacdafd8064bfbcc8cd4f6930febbe.
2010-09-24 16:20:29,985 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: MSG_REGION_OPEN:
img94,img1711m.jpg,1285016850260.1424182007
2010-09-24 16:20:29,986 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Creating region
img840,kitbarca2.png,1285189312696.1ce170ec09384fca51297a5fe7aeb4af.

Which is pretty close to concurrent.

-Jack

On Fri, Sep 24, 2010 at 4:16 PM, Jack Levin <ma...@gmail.com> wrote:
> Still having a problem:
>
> 2010-09-24 16:15:02,572 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
> img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
> java.io.EOFException
>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
>        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)
>
>
> I changed the value to '1', and restarted the regionserver... Note
> that my hdfs is not having a problem.
>
> -Jack
>
> On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
>> Try
>>
>>  <property>
>>    <name>hbase.regions.percheckin</name>
>>    <value>10</value>
>>    <description>Maximum number of regions that can be assigned in a single go
>>    to a region server.
>>    </description>
>>  </property>
>>
>> What do you have now?  Whatever it is, go down from there.
>>
>> St.Ack
>>
>>
>> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>>
>>> -Jack
>>>
>>>
>>
>

Re: Regions loading too fast

Posted by Jack Levin <ma...@gmail.com>.

Still having a problem:

2010-09-24 16:15:02,572 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer: Error opening
img695,p1908101232.jpg,1285288492084.d451f05024b42f71a115951c62cdcccf.
java.io.EOFException
	at java.io.DataInputStream.readFully(DataInputStream.java:180)
	at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
	at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1937)
	at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1837)


I changed the value to '1', and restarted the regionserver... Note
that my hdfs is not having a problem.

-Jack

On Fri, Sep 24, 2010 at 4:01 PM, Stack <st...@duboce.net> wrote:
> Try
>
>  <property>
>    <name>hbase.regions.percheckin</name>
>    <value>10</value>
>    <description>Maximum number of regions that can be assigned in a single go
>    to a region server.
>    </description>
>  </property>
>
> What do you have now?  Whatever it is, go down from there.
>
> St.Ack
>
>
> On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
>> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>>
>> -Jack
>>
>>
>

Re: Regions loading too fast

Posted by Stack <st...@duboce.net>.

Try

  <property>
    <name>hbase.regions.percheckin</name>
    <value>10</value>
    <description>Maximum number of regions that can be assigned in a single go
    to a region server.
    </description>
  </property>

What do you have now?  Whatever it is, go down from there.

St.Ack


On Fri, Sep 24, 2010 at 3:07 PM, Jack Levin <ma...@gmail.com> wrote:
> My regions are 1gb in size and when I cold start the cluster I oversaturate my network links (1000 mbps) and get client dfs timeouts , anyway to slow the m down?
>
> -Jack
>
>