You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Robert Gonzalez <Ro...@maxpointinteractive.com> on 2011/05/13 20:57:45 UTC

wrong region exception

Anyone ever see one of these?

org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 25 actions: WrongRegionException: 25 times, servers with issues: c1-s49.atxd.maxpointinteractive.com:60020, c1-s03.atxd.maxpointinteractive.com:60020,
                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1220)
                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1234)
                at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
                at org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
                at com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectThumbs.java:453)
                at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
                at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
                at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
                at org.apache.hadoop.mapred.Child.main(Child.java:170)

thanks,

Gonz





RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Nope, nothing in the logs with that string.

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, May 25, 2011 3:30 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Can you find this region in the filesystem?  Look under the urlhashv2 table directory for a direction named 80116D7E506D87ED39EAFFE784B5B590.  Grep your master log to see if you can figure the history of this region.
St.Ack

On Wed, May 25, 2011 at 1:21 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> The detailed error is :
>
> Chain of regions in table urlhashv2 is broken; edges does not contain 
> 80116D7E506D87ED39EAFFE784B5B590 Table urlhashv2 is inconsistent.
>
> How does one fix this?
>
> Thanks,
>
> Robert
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Monday, May 16, 2011 2:35 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> Says you have an inconsistency in your table.  Add -details and try and figure where the inconsistency.  Grep master logs to try and figure what happened to the problematic regions.  See if adding -fix to hbck will clean up your prob.
>
> St.Ack
>
> On Mon, May 16, 2011 at 12:04 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> attached
>>
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Monday, May 16, 2011 12:57 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> See the rest of my email.
>> St.Ack
>>
>> On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> 0.90.0
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Friday, May 13, 2011 2:21 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> What version of hbase?  We used to see those from time to time in 
>>> old
>>> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>>>
>>> St.Ack
>>>
>>> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Anyone ever see one of these?
>>>>
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>>>> Failed 25 actions: WrongRegionException: 25 times, servers with
>>>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>>>> c1-s03.atxd.maxpointinteractive.com:60020,
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> e
>>>> n
>>>> t
>>>> ation.processBatch(HConnectionManager.java:1220)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> e
>>>> n
>>>> t
>>>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>>>                at
>>>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(Select
>>>> T
>>>> h
>>>> u
>>>> mbs.java:453)
>>>>                at
>>>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:5
>>>> 6
>>>> 6
>>>> )
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>>>                at
>>>> org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>
>>>> thanks,
>>>>
>>>> Gonz
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
I'm getting a lot of this on the slave that is doing the latest adds:

2011-06-02 00:33:05,231 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlushe
r: Region urlhashv4,7F8537883DDF5230B10AA2CB13182505,1306992752074.71ab4c4527ce7
6d777f78943a86009d2. has too many store files; delaying flush up to 90000ms
2011-06-02 00:33:17,395 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC 
Server handler 7 on 60020 took 1008 ms appending an edit to hlog; editcount=1, l
en~=34.1m
2011-06-02 00:33:53,626 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: IPC 
Server handler 2 on 60020 took 1093 ms appending an edit to hlog; editcount=1, l
en~=34.1m
2011-06-02 00:34:02,001 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlushe
r: Region urlhashv4,7F9567FC7E75F6F219D704791212B1F5,1306992806225.0d427324855e7
d0dd043566d18e8d5c4. has too many store files; delaying flush up to 90000ms
2011-06-02 00:34:10,107 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlushe
r: Region urlhashv4,7F9567FC7E75F6F219D704791212B1F5,1306992806225.0d427324855e7
d0dd043566d18e8d5c4. has too many store files; delaying flush up to 90000ms
2011-06-02 00:34:16,989 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlushe
r: Region urlhashv4,7F9567FC7E75F6F219D704791212B1F5,1306992806225.0d427324855e7
d0dd043566d18e8d5c4. has too many store files; delaying flush up to 90000ms

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 01, 2011 5:34 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?

St.Ack

On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> This is basically it (for the first time it died while copying), we have it at warn level and above:
>
> 2011-05-27 16:44:27,565 WARN 
> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of catalog 
> table
> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on 
> socket time out exception: java.net.SocketTimeoutException: 60000 
> millis timeout while waiti ng for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connec
> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy6.delete(Unknown Source)
>        at 
> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
> arent(MetaEditor.java:201)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
> t(CatalogJanitor.java:233)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
> nitor.java:275)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
> nitor.java:202)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
> tor.java:166)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
> a:120)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
> va:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting f or channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected
> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at 
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
> aseClient.java:521)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
> va:459)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Wednesday, June 01, 2011 12:29 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>
>
> Robert:
>
> Do you have the master logs for this copy run still?  If so, if you 
> put them somewhere where I can pull them (or send them to me, I'll 
> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>
> St.Ack
>
>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Tuesday, May 31, 2011 6:42 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>
>> St.Ack
>>
>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>
>>>
>>>
>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>
>>> St.Ack
>>>
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
It can't get to c1-s19.  It timesout trying to connect.  Can you
figure whats up w/ that?  On always going to the same server, is this
a case of http://hbase.apache.org/book.html#timeseries?  Or perhaps,
regions split and go elsewhere but distcp is writing from the src in
order?
St.Ack

On Thu, Jun 2, 2011 at 12:49 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> First a clarification: everything is happening within the context of a single cluster of 55 machines, there is no inter-cluster copying.  I restarted c1-s06, the regionserver that died, and by the way, all new data seems to be going to this server first.  Is there a reason for this?  This is from the beginning of the copy until it crashed, c1-s06 always served the latest keys, no other server.  So after I restarted c1-s06, it keeps dying.  Here is one of the crashes:
>
> 2011-06-02 13:29:07,546 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORT\
> ING region server serverName=c1-s06.atxd.maxpointinteractive.com,60020,1306860799744, l\
> oad=(requests=0, regions=136, usedHeap=307, maxHeap=2999): Failed open of daughter urlh\
> ashv4,837743DFAE34D105BB5B1E81810627B8,1307039228513.d1c0eb0aef559f349fdaca3452f55c10.
> java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.100.1.2\
> 6:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 mill\
> is timeout while waiting for channel to be ready for read. ch : java.nio.channels.Socke\
> tChannel[connected local=/10.100.1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/1\
> 0.100.1.26:60020]
>        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:784)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>        at $Proxy8.put(Unknown Source)
>        at org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:97)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegi\
> onServer.java:1350)
>        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(Spl\
> itTransaction.java:328)
>        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(Spl\
> itTransaction.java:296)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for chan\
> nel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.100.\
> 1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:60020]
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBas\
> eClient.java:281)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClie\
> nt.java:521)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:459)
> 2011-06-02 13:29:07,551 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORT\
> ING region server serverName=c1-s06.atxd.maxpointinteractive.com,60020,1306860799744, l\
> oad=(requests=0, regions=136, usedHeap=307, maxHeap=2999): Failed open of daughter urlh\
> ashv4,836A60E0F78046975AD00B84CC0B71FB,1307039228513.be6575478b7f4c22d6540a60c7c45fbb.
> java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.100.1.2\
> 6:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 mill\
> is timeout while waiting for channel to be ready for read. ch : java.nio.channels.Socke\
> tChannel[connected local=/10.100.1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/1\
> 0.100.1.26:60020]
>        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:784)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
>        at $Proxy8.put(Unknown Source)
>        at org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:97)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegi\
> onServer.java:1350)
>        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(Spl\
> itTransaction.java:328)
>        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(Spl\
> itTransaction.java:296)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for chan\
> nel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.100.\
> 1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:60020]
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBas\
> eClient.java:281)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClie\
> nt.java:521)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:459)
>
>
>
>
>
> .....
>
>
>
> 2011-06-02 13:29:09,802 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
> /hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/0cf36594d8e349830d3b4f175b5b68\
> 8a/crawl/1293482649614989303.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: Er\
> ror Recovery for block blk_3518061384037278533_81210851 failed  because recovery from p\
> rimary datanode 10.100.2.27:50010 failed 6 times.  Pipeline was 10.100.2.27:50010. Abor\
> ting...
> java.io.IOException: Error Recovery for block blk_3518061384037278533_81210851 failed  \
> because recovery from primary datanode 10.100.2.27:50010 failed 6 times.  Pipeline was \
> 10.100.2.27:50010. Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli\
> ent.java:2668)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:\
> 2139)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
> java:2306)
> 2011-06-02 13:29:09,819 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
> /hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/0cf36594d8e349830d3b4f175b5b68\
> 8a/flags/9107094927628682840.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: Er\
> ror Recovery for block blk_-2414361613802877611_81210865 failed  because recovery from \
> primary datanode 10.100.2.9:50010 failed 6 times.  Pipeline was 10.100.2.9:50010. Abort\
> ing...
> java.io.IOException: Error Recovery for block blk_-2414361613802877611_81210865 failed \
>  because recovery from primary datanode 10.100.2.9:50010 failed 6 times.  Pipeline was \
> 10.100.2.9:50010. Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli\
> ent.java:2668)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:\
> 2139)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
> java:2306)
> 2011-06-02 13:29:09,820 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
> /hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/0cf36594d8e349830d3b4f175b5b68\
> 8a/url/6599763290219995724.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: Erro\
> r Recovery for block blk_798371171266377251_81210865 failed  because recovery from prim\
> ary datanode 10.100.2.3:50010 failed 6 times.  Pipeline was 10.100.2.3:50010. Aborting.\
> ..
> java.io.IOException: Error Recovery for block blk_798371171266377251_81210865 failed  b\
> ecause recovery from primary datanode 10.100.2.3:50010 failed 6 times.  Pipeline was 10\
> .100.2.3:50010. Aborting...
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli\
> ent.java:2668)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:\
> 2139)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
> java:2306)
> 2011-06-02 13:29:09,820 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
> /hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/1e83e5a02f616d7437f10a6a636e26\
> 86/thumbs/2894023573319516345.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: B\
> ad connect ack with firstBadLink 10.100.2.11:50010
> java.io.IOException: Bad connect ack with firstBadLink 10.100.2.11:50010
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFS\
> Client.java:2963)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSCl\
> ient.java:2888)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1900(DFSClient.java:\
> 2139)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
> java:2329)
> Thu Jun  2 13:31:07 CDT 2011 Starting regionserver on c1-s06
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Thursday, June 02, 2011 2:25 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> So, cluster is OK after the below crash?  Regions come up fine on new servers and .META. is fine?
>
> Below is interesting in that we failed a split because we could not write an edit to the .META. (how many handlers are you running with?
> And what is going on on the .META. server at around this time?   Are
> you on 0.90.3 hbase?).  If we fail a split, we'll crash out the regionserver.  The recover of the crashed regionserver should fixup the failed split so no holes in the .META.  If this fixup did not run properly, this might be a cause of WrongRegionException.
>
> St.Ack
>
> On Thu, Jun 2, 2011 at 11:34 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> And more info.  The copy dies on a regionserver failure.  Here is the exception when it dies:
>>
>> 2011-06-02 13:29:07,546 FATAL
>> org.apache.hadoop.hbase.regionserver.HRegionServer
>> : ABORTING region server
>> serverName=c1-s06.atxd.maxpointinteractive.com,60020,13
>> 06860799744, load=(requests=0, regions=136, usedHeap=307,
>> maxHeap=2999): Failed open of daughter
>> urlhashv4,837743DFAE34D105BB5B1E81810627B8,1307039228513.d1c0eb
>> 0aef559f349fdaca3452f55c10.
>> java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.
>> 100.1.26:60020 failed on socket timeout exception:
>> java.net.SocketTimeoutExcepti
>> on: 60000 millis timeout while waiting for channel to be ready for
>> read. ch : ja va.nio.channels.SocketChannel[connected
>> local=/10.100.1.7:41085 remote=c1-s19.at
>> xd.maxpointinteractive.com/10.100.1.26:60020]
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
>> a:784)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>> )
>>        at $Proxy8.put(Unknown Source)
>>        at
>> org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.jav
>> a:97)
>>        at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTask
>> s(HRegionServer.java:1350)
>>        at
>> org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterReg
>> ion(SplitTransaction.java:328)
>>        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.
>> run(SplitTransaction.java:296)
>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while
>> waiting f or channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected
>> local=/10.100.1.7:41085
>> remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:6
>> 0020]
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
>> va:164)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>> 55)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>> 28)
>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
>> ad(HBaseClient.java:281)
>>        at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>> -
>>
>> -----Original Message-----
>> From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com]
>> Sent: Thursday, June 02, 2011 12:29 PM
>> To: 'user@hbase.apache.org'
>> Subject: RE: wrong region exception
>>
>> Here's another clue:  the process is taking up lots of cpu time, likes its in some kind of loop, but the output indicates that its stuck on the same section.
>>
>> Robert
>>
>>
>> -----Original Message-----
>> From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com]
>> Sent: Thursday, June 02, 2011 12:07 PM
>> To: 'user@hbase.apache.org'
>> Subject: RE: wrong region exception
>>
>> Also, notice the output of my copy, where its now stuck on the final line.  first column is number of rows, second column is key value:
>> total:904600000 7FECD7A2D11FFD850FDC7CA899CA3138
>> total:904700000 7FF0787C8EC28FF760BF0E38BB1F95C8
>> total:904800000 7FF418DDFCB134EFA7F1304762EA4A20
>> total:904900000 7FF7B7BC506DC77272DC9CBAE27DDD2D
>> total:905000000 7FFB5E24CC30B1FF8A9AE73068EFDB0B
>> total:905100000 7FFF0085ECE908C208BA083A99C05E42
>> total:905200000 8002A1540C309D99F587DAA712167091
>> total:905300000 800644A00B083B496A07B0633A51B528
>> total:905400000 8009E8A2EAC96846405476D294FDD999
>> total:905500000 800D8E0D6DB9259F16775B1080AE6968
>> total:905600000 80112E5D7E36AFB915DE4906BFF9F41C
>>
>> But in the hbase web page for the table urlhashv4 (the one we are copying into) it only got this far.
>>
>> urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f
>> 332a2f38056d934cbf3.       c1-s06.atxd.maxpointinteractive.com:60030
>> 7FC19684831DF6E8ACCE0E690EF5BCAB
>> 7FD3A81AD94CD99BEA6B4DA485BDDEBE
>> urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d
>> 7dfe8ad8adac3ad7bf5.      c1-s06.atxd.maxpointinteractive.com:60030
>> 7FD3A81AD94CD99BEA6B4DA485BDDEBE
>> 7FE65457BB3A9492B6A0437124D6F5C7
>> urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eab
>> bd6d862634b83ba412e.      c1-s06.atxd.maxpointinteractive.com:60030
>> 7FE65457BB3A9492B6A0437124D6F5C7
>> 7FF5A0C88779F27177F1E5E8159680BE
>> urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080
>> dca60d2e043da973258.      c1-s06.atxd.maxpointinteractive.com:60030
>> 7FF5A0C88779F27177F1E5E8159680BE
>>
>>
>> That, in conjunction with the messages on the slave that is trying to insert the data, indicates to me that its about to get into the same wrong region exception situation again.
>>
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> Sent: Wednesday, June 01, 2011 5:34 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?
>>
>> St.Ack
>>
>> On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> This is basically it (for the first time it died while copying), we have it at warn level and above:
>>>
>>> 2011-05-27 16:44:27,565 WARN
>>> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of
>>> catalog table
>>> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on
>>> socket time out exception: java.net.SocketTimeoutException: 60000
>>> millis timeout while waiti ng for channel to be ready for read. ch :
>>> java.nio.channels.SocketChannel[connec
>>> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>>>        at
>>> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
>>> a:784)
>>>        at
>>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>>>        at
>>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>>> )
>>>        at $Proxy6.delete(Unknown Source)
>>>        at
>>> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
>>> arent(MetaEditor.java:201)
>>>        at
>>> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
>>> t(CatalogJanitor.java:233)
>>>        at
>>> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
>>> nitor.java:275)
>>>        at
>>> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
>>> nitor.java:202)
>>>        at
>>> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
>>> tor.java:166)
>>>        at
>>> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
>>> a:120)
>>>        at
>>> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
>>> va:85)
>>>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout
>>> while waiting f or channel to be ready for read. ch :
>>> java.nio.channels.SocketChannel[connected
>>> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>>>        at
>>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
>>> va:164)
>>>        at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>>> 55)
>>>        at
>>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>>> 28)
>>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>>        at
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
>>> ad(HBaseClient.java:281)
>>>        at
>>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>        at
>>> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>>        at
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
>>> aseClient.java:521)
>>>        at
>>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
>>> va:459)
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>> Stack
>>> Sent: Wednesday, June 01, 2011 12:29 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>>>
>>>
>>> Robert:
>>>
>>> Do you have the master logs for this copy run still?  If so, if you
>>> put them somewhere where I can pull them (or send them to me, I'll
>>> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>>>
>>> St.Ack
>>>
>>>
>>>> -----Original Message-----
>>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>>> Stack
>>>> Sent: Tuesday, May 31, 2011 6:42 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: wrong region exception
>>>>
>>>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>>>
>>>> St.Ack
>>>>
>>>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>>>
>>>>> -----Original Message-----
>>>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>>>> Stack
>>>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>>>> To: user@hbase.apache.org
>>>>> Subject: Re: wrong region exception
>>>>>
>>>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>>>
>>>>>
>>>>>
>>>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>>>
>>>>> St.Ack
>>>>>
>>>>
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
First a clarification: everything is happening within the context of a single cluster of 55 machines, there is no inter-cluster copying.  I restarted c1-s06, the regionserver that died, and by the way, all new data seems to be going to this server first.  Is there a reason for this?  This is from the beginning of the copy until it crashed, c1-s06 always served the latest keys, no other server.  So after I restarted c1-s06, it keeps dying.  Here is one of the crashes:

2011-06-02 13:29:07,546 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORT\
ING region server serverName=c1-s06.atxd.maxpointinteractive.com,60020,1306860799744, l\
oad=(requests=0, regions=136, usedHeap=307, maxHeap=2999): Failed open of daughter urlh\
ashv4,837743DFAE34D105BB5B1E81810627B8,1307039228513.d1c0eb0aef559f349fdaca3452f55c10.
java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.100.1.2\
6:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 mill\
is timeout while waiting for channel to be ready for read. ch : java.nio.channels.Socke\
tChannel[connected local=/10.100.1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/1\
0.100.1.26:60020]
        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:784)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy8.put(Unknown Source)
        at org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:97)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegi\
onServer.java:1350)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(Spl\
itTransaction.java:328)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(Spl\
itTransaction.java:296)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for chan\
nel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.100.\
1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:60020]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBas\
eClient.java:281)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClie\
nt.java:521)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:459)
2011-06-02 13:29:07,551 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORT\
ING region server serverName=c1-s06.atxd.maxpointinteractive.com,60020,1306860799744, l\
oad=(requests=0, regions=136, usedHeap=307, maxHeap=2999): Failed open of daughter urlh\
ashv4,836A60E0F78046975AD00B84CC0B71FB,1307039228513.be6575478b7f4c22d6540a60c7c45fbb.
java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.100.1.2\
6:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 mill\
is timeout while waiting for channel to be ready for read. ch : java.nio.channels.Socke\
tChannel[connected local=/10.100.1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/1\
0.100.1.26:60020]
        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.java:784)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy8.put(Unknown Source)
        at org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.java:97)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTasks(HRegi\
onServer.java:1350)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(Spl\
itTransaction.java:328)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(Spl\
itTransaction.java:296)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for chan\
nel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.100.\
1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:60020]
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.FilterInputStream.read(FilterInputStream.java:116)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.read(HBas\
eClient.java:281)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
        at java.io.DataInputStream.readInt(DataInputStream.java:370)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HBaseClie\
nt.java:521)
        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.java:459)





.....



2011-06-02 13:29:09,802 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
/hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/0cf36594d8e349830d3b4f175b5b68\
8a/crawl/1293482649614989303.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: Er\
ror Recovery for block blk_3518061384037278533_81210851 failed  because recovery from p\
rimary datanode 10.100.2.27:50010 failed 6 times.  Pipeline was 10.100.2.27:50010. Abor\
ting...
java.io.IOException: Error Recovery for block blk_3518061384037278533_81210851 failed  \
because recovery from primary datanode 10.100.2.27:50010 failed 6 times.  Pipeline was \
10.100.2.27:50010. Aborting...
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli\
ent.java:2668)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:\
2139)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
java:2306)
2011-06-02 13:29:09,819 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
/hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/0cf36594d8e349830d3b4f175b5b68\
8a/flags/9107094927628682840.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: Er\
ror Recovery for block blk_-2414361613802877611_81210865 failed  because recovery from \
primary datanode 10.100.2.9:50010 failed 6 times.  Pipeline was 10.100.2.9:50010. Abort\
ing...
java.io.IOException: Error Recovery for block blk_-2414361613802877611_81210865 failed \
 because recovery from primary datanode 10.100.2.9:50010 failed 6 times.  Pipeline was \
10.100.2.9:50010. Aborting...
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli\
ent.java:2668)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:\
2139)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
java:2306)
2011-06-02 13:29:09,820 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
/hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/0cf36594d8e349830d3b4f175b5b68\
8a/url/6599763290219995724.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: Erro\
r Recovery for block blk_798371171266377251_81210865 failed  because recovery from prim\
ary datanode 10.100.2.3:50010 failed 6 times.  Pipeline was 10.100.2.3:50010. Aborting.\
..
java.io.IOException: Error Recovery for block blk_798371171266377251_81210865 failed  b\
ecause recovery from primary datanode 10.100.2.3:50010 failed 6 times.  Pipeline was 10\
.100.2.3:50010. Aborting...
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli\
ent.java:2668)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:\
2139)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
java:2306)
2011-06-02 13:29:09,820 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file \
/hbase/urlhashv4/2f21ad9ef74c6e6d69259e9525f7b863/splits/1e83e5a02f616d7437f10a6a636e26\
86/thumbs/2894023573319516345.2f21ad9ef74c6e6d69259e9525f7b863 : java.io.IOException: B\
ad connect ack with firstBadLink 10.100.2.11:50010
java.io.IOException: Bad connect ack with firstBadLink 10.100.2.11:50010
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFS\
Client.java:2963)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSCl\
ient.java:2888)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1900(DFSClient.java:\
2139)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.\
java:2329)
Thu Jun  2 13:31:07 CDT 2011 Starting regionserver on c1-s06


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Thursday, June 02, 2011 2:25 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

So, cluster is OK after the below crash?  Regions come up fine on new servers and .META. is fine?

Below is interesting in that we failed a split because we could not write an edit to the .META. (how many handlers are you running with?
And what is going on on the .META. server at around this time?   Are
you on 0.90.3 hbase?).  If we fail a split, we'll crash out the regionserver.  The recover of the crashed regionserver should fixup the failed split so no holes in the .META.  If this fixup did not run properly, this might be a cause of WrongRegionException.

St.Ack

On Thu, Jun 2, 2011 at 11:34 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> And more info.  The copy dies on a regionserver failure.  Here is the exception when it dies:
>
> 2011-06-02 13:29:07,546 FATAL 
> org.apache.hadoop.hbase.regionserver.HRegionServer
> : ABORTING region server 
> serverName=c1-s06.atxd.maxpointinteractive.com,60020,13
> 06860799744, load=(requests=0, regions=136, usedHeap=307, 
> maxHeap=2999): Failed open of daughter 
> urlhashv4,837743DFAE34D105BB5B1E81810627B8,1307039228513.d1c0eb
> 0aef559f349fdaca3452f55c10.
> java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.
> 100.1.26:60020 failed on socket timeout exception: 
> java.net.SocketTimeoutExcepti
> on: 60000 millis timeout while waiting for channel to be ready for 
> read. ch : ja va.nio.channels.SocketChannel[connected 
> local=/10.100.1.7:41085 remote=c1-s19.at 
> xd.maxpointinteractive.com/10.100.1.26:60020]
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy8.put(Unknown Source)
>        at 
> org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.jav
> a:97)
>        at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTask
> s(HRegionServer.java:1350)
>        at 
> org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterReg
> ion(SplitTransaction.java:328)
>        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.
> run(SplitTransaction.java:296)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting f or channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected
> local=/10.100.1.7:41085 
> remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:6
> 0020]
>        at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at 
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> -
>
> -----Original Message-----
> From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com]
> Sent: Thursday, June 02, 2011 12:29 PM
> To: 'user@hbase.apache.org'
> Subject: RE: wrong region exception
>
> Here's another clue:  the process is taking up lots of cpu time, likes its in some kind of loop, but the output indicates that its stuck on the same section.
>
> Robert
>
>
> -----Original Message-----
> From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com]
> Sent: Thursday, June 02, 2011 12:07 PM
> To: 'user@hbase.apache.org'
> Subject: RE: wrong region exception
>
> Also, notice the output of my copy, where its now stuck on the final line.  first column is number of rows, second column is key value:
> total:904600000 7FECD7A2D11FFD850FDC7CA899CA3138
> total:904700000 7FF0787C8EC28FF760BF0E38BB1F95C8
> total:904800000 7FF418DDFCB134EFA7F1304762EA4A20
> total:904900000 7FF7B7BC506DC77272DC9CBAE27DDD2D
> total:905000000 7FFB5E24CC30B1FF8A9AE73068EFDB0B
> total:905100000 7FFF0085ECE908C208BA083A99C05E42
> total:905200000 8002A1540C309D99F587DAA712167091
> total:905300000 800644A00B083B496A07B0633A51B528
> total:905400000 8009E8A2EAC96846405476D294FDD999
> total:905500000 800D8E0D6DB9259F16775B1080AE6968
> total:905600000 80112E5D7E36AFB915DE4906BFF9F41C
>
> But in the hbase web page for the table urlhashv4 (the one we are copying into) it only got this far.
>
> urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f
> 332a2f38056d934cbf3.       c1-s06.atxd.maxpointinteractive.com:60030      
> 7FC19684831DF6E8ACCE0E690EF5BCAB        
> 7FD3A81AD94CD99BEA6B4DA485BDDEBE 
> urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d
> 7dfe8ad8adac3ad7bf5.      c1-s06.atxd.maxpointinteractive.com:60030       
> 7FD3A81AD94CD99BEA6B4DA485BDDEBE        
> 7FE65457BB3A9492B6A0437124D6F5C7 
> urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eab
> bd6d862634b83ba412e.      c1-s06.atxd.maxpointinteractive.com:60030       
> 7FE65457BB3A9492B6A0437124D6F5C7        
> 7FF5A0C88779F27177F1E5E8159680BE 
> urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080
> dca60d2e043da973258.      c1-s06.atxd.maxpointinteractive.com:60030       
> 7FF5A0C88779F27177F1E5E8159680BE
>
>
> That, in conjunction with the messages on the slave that is trying to insert the data, indicates to me that its about to get into the same wrong region exception situation again.
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Wednesday, June 01, 2011 5:34 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?
>
> St.Ack
>
> On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> This is basically it (for the first time it died while copying), we have it at warn level and above:
>>
>> 2011-05-27 16:44:27,565 WARN
>> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of 
>> catalog table
>> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on 
>> socket time out exception: java.net.SocketTimeoutException: 60000 
>> millis timeout while waiti ng for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connec
>> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
>> a:784)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>> )
>>        at $Proxy6.delete(Unknown Source)
>>        at
>> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
>> arent(MetaEditor.java:201)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
>> t(CatalogJanitor.java:233)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
>> nitor.java:275)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
>> nitor.java:202)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
>> tor.java:166)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
>> a:120)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
>> va:85)
>>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout 
>> while waiting f or channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected
>> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
>> va:164)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>> 55)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>> 28)
>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
>> ad(HBaseClient.java:281)
>>        at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>        at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
>> aseClient.java:521)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
>> va:459)
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Wednesday, June 01, 2011 12:29 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>>
>>
>> Robert:
>>
>> Do you have the master logs for this copy run still?  If so, if you 
>> put them somewhere where I can pull them (or send them to me, I'll 
>> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>>
>> St.Ack
>>
>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Tuesday, May 31, 2011 6:42 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>>
>>> St.Ack
>>>
>>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>>
>>>> -----Original Message-----
>>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>>> Stack
>>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: wrong region exception
>>>>
>>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>>
>>>>
>>>>
>>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>>
>>>> St.Ack
>>>>
>>>
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
So, cluster is OK after the below crash?  Regions come up fine on new
servers and .META. is fine?

Below is interesting in that we failed a split because we could not
write an edit to the .META. (how many handlers are you running with?
And what is going on on the .META. server at around this time?   Are
you on 0.90.3 hbase?).  If we fail a split, we'll crash out the
regionserver.  The recover of the crashed regionserver should fixup
the failed split so no holes in the .META.  If this fixup did not run
properly, this might be a cause of WrongRegionException.

St.Ack

On Thu, Jun 2, 2011 at 11:34 AM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> And more info.  The copy dies on a regionserver failure.  Here is the exception when it dies:
>
> 2011-06-02 13:29:07,546 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer
> : ABORTING region server serverName=c1-s06.atxd.maxpointinteractive.com,60020,13
> 06860799744, load=(requests=0, regions=136, usedHeap=307, maxHeap=2999): Failed
> open of daughter urlhashv4,837743DFAE34D105BB5B1E81810627B8,1307039228513.d1c0eb
> 0aef559f349fdaca3452f55c10.
> java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.
> 100.1.26:60020 failed on socket timeout exception: java.net.SocketTimeoutExcepti
> on: 60000 millis timeout while waiting for channel to be ready for read. ch : ja
> va.nio.channels.SocketChannel[connected local=/10.100.1.7:41085 remote=c1-s19.at
> xd.maxpointinteractive.com/10.100.1.26:60020]
>        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy8.put(Unknown Source)
>        at org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.jav
> a:97)
>        at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTask
> s(HRegionServer.java:1350)
>        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterReg
> ion(SplitTransaction.java:328)
>        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.
> run(SplitTransaction.java:296)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting f
> or channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/10.100.1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:6
> 0020]
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> -
>
> -----Original Message-----
> From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com]
> Sent: Thursday, June 02, 2011 12:29 PM
> To: 'user@hbase.apache.org'
> Subject: RE: wrong region exception
>
> Here's another clue:  the process is taking up lots of cpu time, likes its in some kind of loop, but the output indicates that its stuck on the same section.
>
> Robert
>
>
> -----Original Message-----
> From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com]
> Sent: Thursday, June 02, 2011 12:07 PM
> To: 'user@hbase.apache.org'
> Subject: RE: wrong region exception
>
> Also, notice the output of my copy, where its now stuck on the final line.  first column is number of rows, second column is key value:
> total:904600000 7FECD7A2D11FFD850FDC7CA899CA3138
> total:904700000 7FF0787C8EC28FF760BF0E38BB1F95C8
> total:904800000 7FF418DDFCB134EFA7F1304762EA4A20
> total:904900000 7FF7B7BC506DC77272DC9CBAE27DDD2D
> total:905000000 7FFB5E24CC30B1FF8A9AE73068EFDB0B
> total:905100000 7FFF0085ECE908C208BA083A99C05E42
> total:905200000 8002A1540C309D99F587DAA712167091
> total:905300000 800644A00B083B496A07B0633A51B528
> total:905400000 8009E8A2EAC96846405476D294FDD999
> total:905500000 800D8E0D6DB9259F16775B1080AE6968
> total:905600000 80112E5D7E36AFB915DE4906BFF9F41C
>
> But in the hbase web page for the table urlhashv4 (the one we are copying into) it only got this far.
>
> urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f332a2f38056d934cbf3.       c1-s06.atxd.maxpointinteractive.com:60030      7FC19684831DF6E8ACCE0E690EF5BCAB        7FD3A81AD94CD99BEA6B4DA485BDDEBE
> urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d7dfe8ad8adac3ad7bf5.      c1-s06.atxd.maxpointinteractive.com:60030       7FD3A81AD94CD99BEA6B4DA485BDDEBE        7FE65457BB3A9492B6A0437124D6F5C7
> urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eabbd6d862634b83ba412e.      c1-s06.atxd.maxpointinteractive.com:60030       7FE65457BB3A9492B6A0437124D6F5C7        7FF5A0C88779F27177F1E5E8159680BE
> urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080dca60d2e043da973258.      c1-s06.atxd.maxpointinteractive.com:60030       7FF5A0C88779F27177F1E5E8159680BE
>
>
> That, in conjunction with the messages on the slave that is trying to insert the data, indicates to me that its about to get into the same wrong region exception situation again.
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Wednesday, June 01, 2011 5:34 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?
>
> St.Ack
>
> On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> This is basically it (for the first time it died while copying), we have it at warn level and above:
>>
>> 2011-05-27 16:44:27,565 WARN
>> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of catalog
>> table
>> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on
>> socket time out exception: java.net.SocketTimeoutException: 60000
>> millis timeout while waiti ng for channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connec
>> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
>> a:784)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
>> )
>>        at $Proxy6.delete(Unknown Source)
>>        at
>> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
>> arent(MetaEditor.java:201)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
>> t(CatalogJanitor.java:233)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
>> nitor.java:275)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
>> nitor.java:202)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
>> tor.java:166)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
>> a:120)
>>        at
>> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
>> va:85)
>>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
>> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while
>> waiting f or channel to be ready for read. ch :
>> java.nio.channels.SocketChannel[connected
>> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>>        at
>> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
>> va:164)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>> 55)
>>        at
>> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
>> 28)
>>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
>> ad(HBaseClient.java:281)
>>        at
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>        at
>> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
>> aseClient.java:521)
>>        at
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
>> va:459)
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> Sent: Wednesday, June 01, 2011 12:29 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>>
>>
>> Robert:
>>
>> Do you have the master logs for this copy run still?  If so, if you
>> put them somewhere where I can pull them (or send them to me, I'll
>> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>>
>> St.Ack
>>
>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>> Stack
>>> Sent: Tuesday, May 31, 2011 6:42 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>>
>>> St.Ack
>>>
>>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>>
>>>> -----Original Message-----
>>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>>> Stack
>>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: wrong region exception
>>>>
>>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>>
>>>>
>>>>
>>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>>
>>>> St.Ack
>>>>
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
And more info.  The copy dies on a regionserver failure.  Here is the exception when it dies:

2011-06-02 13:29:07,546 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer
: ABORTING region server serverName=c1-s06.atxd.maxpointinteractive.com,60020,13
06860799744, load=(requests=0, regions=136, usedHeap=307, maxHeap=2999): Failed 
open of daughter urlhashv4,837743DFAE34D105BB5B1E81810627B8,1307039228513.d1c0eb
0aef559f349fdaca3452f55c10.
java.net.SocketTimeoutException: Call to c1-s19.atxd.maxpointinteractive.com/10.
100.1.26:60020 failed on socket timeout exception: java.net.SocketTimeoutExcepti
on: 60000 millis timeout while waiting for channel to be ready for read. ch : ja
va.nio.channels.SocketChannel[connected local=/10.100.1.7:41085 remote=c1-s19.at
xd.maxpointinteractive.com/10.100.1.26:60020]
	at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
a:784)
	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
)
	at $Proxy8.put(Unknown Source)
	at org.apache.hadoop.hbase.catalog.MetaEditor.addDaughter(MetaEditor.jav
a:97)
	at org.apache.hadoop.hbase.regionserver.HRegionServer.postOpenDeployTask
s(HRegionServer.java:1350)
	at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterReg
ion(SplitTransaction.java:328)
	at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.
run(SplitTransaction.java:296)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting f
or channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.100.1.7:41085 remote=c1-s19.atxd.maxpointinteractive.com/10.100.1.26:6
0020]
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
va:164)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
55)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
28)
	at java.io.FilterInputStream.read(FilterInputStream.java:116)
	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
ad(HBaseClient.java:281)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
-

-----Original Message-----
From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com] 
Sent: Thursday, June 02, 2011 12:29 PM
To: 'user@hbase.apache.org'
Subject: RE: wrong region exception

Here's another clue:  the process is taking up lots of cpu time, likes its in some kind of loop, but the output indicates that its stuck on the same section.

Robert


-----Original Message-----
From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com]
Sent: Thursday, June 02, 2011 12:07 PM
To: 'user@hbase.apache.org'
Subject: RE: wrong region exception

Also, notice the output of my copy, where its now stuck on the final line.  first column is number of rows, second column is key value:
total:904600000	7FECD7A2D11FFD850FDC7CA899CA3138
total:904700000	7FF0787C8EC28FF760BF0E38BB1F95C8
total:904800000	7FF418DDFCB134EFA7F1304762EA4A20
total:904900000	7FF7B7BC506DC77272DC9CBAE27DDD2D
total:905000000	7FFB5E24CC30B1FF8A9AE73068EFDB0B
total:905100000	7FFF0085ECE908C208BA083A99C05E42
total:905200000	8002A1540C309D99F587DAA712167091
total:905300000	800644A00B083B496A07B0633A51B528
total:905400000	8009E8A2EAC96846405476D294FDD999
total:905500000	800D8E0D6DB9259F16775B1080AE6968
total:905600000	80112E5D7E36AFB915DE4906BFF9F41C

But in the hbase web page for the table urlhashv4 (the one we are copying into) it only got this far.

urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f332a2f38056d934cbf3.  	 c1-s06.atxd.maxpointinteractive.com:60030   	7FC19684831DF6E8ACCE0E690EF5BCAB  	7FD3A81AD94CD99BEA6B4DA485BDDEBE
urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d7dfe8ad8adac3ad7bf5. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FD3A81AD94CD99BEA6B4DA485BDDEBE 	7FE65457BB3A9492B6A0437124D6F5C7
urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eabbd6d862634b83ba412e. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FE65457BB3A9492B6A0437124D6F5C7 	7FF5A0C88779F27177F1E5E8159680BE
urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080dca60d2e043da973258. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FF5A0C88779F27177F1E5E8159680BE 	


That, in conjunction with the messages on the slave that is trying to insert the data, indicates to me that its about to get into the same wrong region exception situation again.


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 01, 2011 5:34 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?

St.Ack

On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> This is basically it (for the first time it died while copying), we have it at warn level and above:
>
> 2011-05-27 16:44:27,565 WARN
> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of catalog 
> table
> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on 
> socket time out exception: java.net.SocketTimeoutException: 60000 
> millis timeout while waiti ng for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connec
> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy6.delete(Unknown Source)
>        at
> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
> arent(MetaEditor.java:201)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
> t(CatalogJanitor.java:233)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
> nitor.java:275)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
> nitor.java:202)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
> tor.java:166)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
> a:120)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
> va:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting f or channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected
> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
> aseClient.java:521)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
> va:459)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Wednesday, June 01, 2011 12:29 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>
>
> Robert:
>
> Do you have the master logs for this copy run still?  If so, if you 
> put them somewhere where I can pull them (or send them to me, I'll 
> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>
> St.Ack
>
>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Tuesday, May 31, 2011 6:42 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>
>> St.Ack
>>
>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>
>>>
>>>
>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>
>>> St.Ack
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Here's another clue:  the process is taking up lots of cpu time, likes its in some kind of loop, but the output indicates that its stuck on the same section.

Robert


-----Original Message-----
From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com] 
Sent: Thursday, June 02, 2011 12:07 PM
To: 'user@hbase.apache.org'
Subject: RE: wrong region exception

Also, notice the output of my copy, where its now stuck on the final line.  first column is number of rows, second column is key value:
total:904600000	7FECD7A2D11FFD850FDC7CA899CA3138
total:904700000	7FF0787C8EC28FF760BF0E38BB1F95C8
total:904800000	7FF418DDFCB134EFA7F1304762EA4A20
total:904900000	7FF7B7BC506DC77272DC9CBAE27DDD2D
total:905000000	7FFB5E24CC30B1FF8A9AE73068EFDB0B
total:905100000	7FFF0085ECE908C208BA083A99C05E42
total:905200000	8002A1540C309D99F587DAA712167091
total:905300000	800644A00B083B496A07B0633A51B528
total:905400000	8009E8A2EAC96846405476D294FDD999
total:905500000	800D8E0D6DB9259F16775B1080AE6968
total:905600000	80112E5D7E36AFB915DE4906BFF9F41C

But in the hbase web page for the table urlhashv4 (the one we are copying into) it only got this far.

urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f332a2f38056d934cbf3.  	 c1-s06.atxd.maxpointinteractive.com:60030   	7FC19684831DF6E8ACCE0E690EF5BCAB  	7FD3A81AD94CD99BEA6B4DA485BDDEBE
urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d7dfe8ad8adac3ad7bf5. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FD3A81AD94CD99BEA6B4DA485BDDEBE 	7FE65457BB3A9492B6A0437124D6F5C7
urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eabbd6d862634b83ba412e. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FE65457BB3A9492B6A0437124D6F5C7 	7FF5A0C88779F27177F1E5E8159680BE
urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080dca60d2e043da973258. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FF5A0C88779F27177F1E5E8159680BE 	


That, in conjunction with the messages on the slave that is trying to insert the data, indicates to me that its about to get into the same wrong region exception situation again.


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 01, 2011 5:34 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?

St.Ack

On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> This is basically it (for the first time it died while copying), we have it at warn level and above:
>
> 2011-05-27 16:44:27,565 WARN
> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of catalog 
> table
> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on 
> socket time out exception: java.net.SocketTimeoutException: 60000 
> millis timeout while waiti ng for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connec
> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy6.delete(Unknown Source)
>        at
> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
> arent(MetaEditor.java:201)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
> t(CatalogJanitor.java:233)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
> nitor.java:275)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
> nitor.java:202)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
> tor.java:166)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
> a:120)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
> va:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting f or channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected
> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
> aseClient.java:521)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
> va:459)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Wednesday, June 01, 2011 12:29 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>
>
> Robert:
>
> Do you have the master logs for this copy run still?  If so, if you 
> put them somewhere where I can pull them (or send them to me, I'll 
> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>
> St.Ack
>
>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Tuesday, May 31, 2011 6:42 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>
>> St.Ack
>>
>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>
>>>
>>>
>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>
>>> St.Ack
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Ok, I think I know why it gets stuck there.  That's where the hole is in the original table.  I skipped past the hole and it is off and running again.  Knock on wood!

Robert


-----Original Message-----
From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com] 
Sent: Thursday, June 02, 2011 12:07 PM
To: 'user@hbase.apache.org'
Subject: RE: wrong region exception

Also, notice the output of my copy, where its now stuck on the final line.  first column is number of rows, second column is key value:
total:904600000	7FECD7A2D11FFD850FDC7CA899CA3138
total:904700000	7FF0787C8EC28FF760BF0E38BB1F95C8
total:904800000	7FF418DDFCB134EFA7F1304762EA4A20
total:904900000	7FF7B7BC506DC77272DC9CBAE27DDD2D
total:905000000	7FFB5E24CC30B1FF8A9AE73068EFDB0B
total:905100000	7FFF0085ECE908C208BA083A99C05E42
total:905200000	8002A1540C309D99F587DAA712167091
total:905300000	800644A00B083B496A07B0633A51B528
total:905400000	8009E8A2EAC96846405476D294FDD999
total:905500000	800D8E0D6DB9259F16775B1080AE6968
total:905600000	80112E5D7E36AFB915DE4906BFF9F41C

But in the hbase web page for the table urlhashv4 (the one we are copying into) it only got this far.

urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f332a2f38056d934cbf3.  	 c1-s06.atxd.maxpointinteractive.com:60030   	7FC19684831DF6E8ACCE0E690EF5BCAB  	7FD3A81AD94CD99BEA6B4DA485BDDEBE
urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d7dfe8ad8adac3ad7bf5. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FD3A81AD94CD99BEA6B4DA485BDDEBE 	7FE65457BB3A9492B6A0437124D6F5C7
urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eabbd6d862634b83ba412e. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FE65457BB3A9492B6A0437124D6F5C7 	7FF5A0C88779F27177F1E5E8159680BE
urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080dca60d2e043da973258. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FF5A0C88779F27177F1E5E8159680BE 	


That, in conjunction with the messages on the slave that is trying to insert the data, indicates to me that its about to get into the same wrong region exception situation again.


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 01, 2011 5:34 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?

St.Ack

On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> This is basically it (for the first time it died while copying), we have it at warn level and above:
>
> 2011-05-27 16:44:27,565 WARN
> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of catalog 
> table
> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on 
> socket time out exception: java.net.SocketTimeoutException: 60000 
> millis timeout while waiti ng for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connec
> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy6.delete(Unknown Source)
>        at
> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
> arent(MetaEditor.java:201)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
> t(CatalogJanitor.java:233)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
> nitor.java:275)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
> nitor.java:202)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
> tor.java:166)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
> a:120)
>        at
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
> va:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting f or channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected
> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
> aseClient.java:521)
>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
> va:459)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Wednesday, June 01, 2011 12:29 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>
>
> Robert:
>
> Do you have the master logs for this copy run still?  If so, if you 
> put them somewhere where I can pull them (or send them to me, I'll 
> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>
> St.Ack
>
>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Tuesday, May 31, 2011 6:42 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>
>> St.Ack
>>
>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>
>>>
>>>
>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>
>>> St.Ack
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Also, notice the output of my copy, where its now stuck on the final line.  first column is number of rows, second column is key value:
total:904600000	7FECD7A2D11FFD850FDC7CA899CA3138
total:904700000	7FF0787C8EC28FF760BF0E38BB1F95C8
total:904800000	7FF418DDFCB134EFA7F1304762EA4A20
total:904900000	7FF7B7BC506DC77272DC9CBAE27DDD2D
total:905000000	7FFB5E24CC30B1FF8A9AE73068EFDB0B
total:905100000	7FFF0085ECE908C208BA083A99C05E42
total:905200000	8002A1540C309D99F587DAA712167091
total:905300000	800644A00B083B496A07B0633A51B528
total:905400000	8009E8A2EAC96846405476D294FDD999
total:905500000	800D8E0D6DB9259F16775B1080AE6968
total:905600000	80112E5D7E36AFB915DE4906BFF9F41C

But in the hbase web page for the table urlhashv4 (the one we are copying into) it only got this far.

urlhashv4,7FC19684831DF6E8ACCE0E690EF5BCAB,1306993100934.63e52326bc86f332a2f38056d934cbf3.  	 c1-s06.atxd.maxpointinteractive.com:60030   	7FC19684831DF6E8ACCE0E690EF5BCAB  	7FD3A81AD94CD99BEA6B4DA485BDDEBE
urlhashv4,7FD3A81AD94CD99BEA6B4DA485BDDEBE,1306993171155.0296cf0214b2d7dfe8ad8adac3ad7bf5. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FD3A81AD94CD99BEA6B4DA485BDDEBE 	7FE65457BB3A9492B6A0437124D6F5C7
urlhashv4,7FE65457BB3A9492B6A0437124D6F5C7,1306993227676.c038dcb619eabbd6d862634b83ba412e. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FE65457BB3A9492B6A0437124D6F5C7 	7FF5A0C88779F27177F1E5E8159680BE
urlhashv4,7FF5A0C88779F27177F1E5E8159680BE,1306993227676.5348549ea1080dca60d2e043da973258. 	c1-s06.atxd.maxpointinteractive.com:60030 	7FF5A0C88779F27177F1E5E8159680BE 	


That, in conjunction with the messages on the slave that is trying to insert the data, indicates to me that its about to get into the same wrong region exception situation again.


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 01, 2011 5:34 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

We can't reach the server carrying .META. within 60 seconds.  Whats going on on that server?  Doesn't the next time the below catalogjanitor run, does it succeed or just always fail?

St.Ack

On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> This is basically it (for the first time it died while copying), we have it at warn level and above:
>
> 2011-05-27 16:44:27,565 WARN 
> org.apache.hadoop.hbase.master.CatalogJanitor: Fail ed scan of catalog 
> table
> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on 
> socket time out exception: java.net.SocketTimeoutException: 60000 
> millis timeout while waiti ng for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connec
> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy6.delete(Unknown Source)
>        at 
> org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
> arent(MetaEditor.java:201)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
> t(CatalogJanitor.java:233)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
> nitor.java:275)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
> nitor.java:202)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
> tor.java:166)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
> a:120)
>        at 
> org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
> va:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting f or channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected
> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at 
> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
> aseClient.java:521)
>        at 
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
> va:459)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Wednesday, June 01, 2011 12:29 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>
>
> Robert:
>
> Do you have the master logs for this copy run still?  If so, if you 
> put them somewhere where I can pull them (or send them to me, I'll 
> take a look).   I'd like to see the logs in the cluster to which you were copying the data.
>
> St.Ack
>
>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Tuesday, May 31, 2011 6:42 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>
>> St.Ack
>>
>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>
>>>
>>>
>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>
>>> St.Ack
>>>
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
We can't reach the server carrying .META. within 60 seconds.  Whats
going on on that server?  Doesn't the next time the below
catalogjanitor run, does it succeed or just always fail?

St.Ack

On Wed, Jun 1, 2011 at 2:27 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> This is basically it (for the first time it died while copying), we have it at warn level and above:
>
> 2011-05-27 16:44:27,565 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Fail
> ed scan of catalog table
> java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on socket time
> out exception: java.net.SocketTimeoutException: 60000 millis timeout while waiti
> ng for channel to be ready for read. ch : java.nio.channels.SocketChannel[connec
> ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
> a:784)
>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
>        at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
> )
>        at $Proxy6.delete(Unknown Source)
>        at org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
> arent(MetaEditor.java:201)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
> t(CatalogJanitor.java:233)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
> nitor.java:275)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
> nitor.java:202)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
> tor.java:166)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
> a:120)
>        at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
> va:85)
>        at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting f
> or channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
>        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
> va:164)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 55)
>        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
> 28)
>        at java.io.FilterInputStream.read(FilterInputStream.java:116)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
> ad(HBaseClient.java:281)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at java.io.DataInputStream.readInt(DataInputStream.java:370)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
> aseClient.java:521)
>        at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
> va:459)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Wednesday, June 01, 2011 12:29 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>>
>
> Robert:
>
> Do you have the master logs for this copy run still?  If so, if you put them somewhere where I can pull them (or send them to me, I'll
> take a look).   I'd like to see the logs in the cluster to which you
> were copying the data.
>
> St.Ack
>
>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> Sent: Tuesday, May 31, 2011 6:42 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>>
>> St.Ack
>>
>> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>> Stack
>>> Sent: Tuesday, May 31, 2011 3:03 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>>
>>>
>>>
>>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>>
>>> St.Ack
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
This is basically it (for the first time it died while copying), we have it at warn level and above:

2011-05-27 16:44:27,565 WARN org.apache.hadoop.hbase.master.CatalogJanitor: Fail
ed scan of catalog table
java.net.SocketTimeoutException: Call to /10.100.2.6:60020 failed on socket time
out exception: java.net.SocketTimeoutException: 60000 millis timeout while waiti
ng for channel to be ready for read. ch : java.nio.channels.SocketChannel[connec
ted local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
	at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException(HBaseClient.jav
a:784)
	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:757)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257
)
	at $Proxy6.delete(Unknown Source)
	at org.apache.hadoop.hbase.catalog.MetaEditor.deleteDaughterReferenceInP
arent(MetaEditor.java:201)
	at org.apache.hadoop.hbase.master.CatalogJanitor.removeDaughterFromParen
t(CatalogJanitor.java:233)
	at org.apache.hadoop.hbase.master.CatalogJanitor.hasReferences(CatalogJa
nitor.java:275)
	at org.apache.hadoop.hbase.master.CatalogJanitor.checkDaughter(CatalogJa
nitor.java:202)
	at org.apache.hadoop.hbase.master.CatalogJanitor.cleanParent(CatalogJani
tor.java:166)
	at org.apache.hadoop.hbase.master.CatalogJanitor.scan(CatalogJanitor.jav
a:120)
	at org.apache.hadoop.hbase.master.CatalogJanitor.chore(CatalogJanitor.ja
va:85)
	at org.apache.hadoop.hbase.Chore.run(Chore.java:66)
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting f
or channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/10.100.1.39:37717 remote=/10.100.2.6:60020]
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja
va:164)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
55)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1
28)
	at java.io.FilterInputStream.read(FilterInputStream.java:116)
	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection$PingInputStream.re
ad(HBaseClient.java:281)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
	at java.io.DataInputStream.readInt(DataInputStream.java:370)
	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.receiveResponse(HB
aseClient.java:521)
	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.run(HBaseClient.ja
va:459)

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 01, 2011 12:29 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>

Robert:

Do you have the master logs for this copy run still?  If so, if you put them somewhere where I can pull them (or send them to me, I'll
take a look).   I'd like to see the logs in the cluster to which you
were copying the data.

St.Ack


> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Tuesday, May 31, 2011 6:42 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>
> St.Ack
>
> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Tuesday, May 31, 2011 3:03 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>
>>
>>
>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>
>> St.Ack
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
On Wed, Jun 1, 2011 at 7:32 AM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.
>

Robert:

Do you have the master logs for this copy run still?  If so, if you
put them somewhere where I can pull them (or send them to me, I'll
take a look).   I'd like to see the logs in the cluster to which you
were copying the data.

St.Ack


> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Tuesday, May 31, 2011 6:42 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?
>
> St.Ack
>
> On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> Sent: Tuesday, May 31, 2011 3:03 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>>
>>
>>
>> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>>
>> St.Ack
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
We have a table copy program that copies the data from one table to another, and we can give it the start/end keys.  In this case we created a new blank table with the essential column families and let it run with start/end to be the whole range, 0-maxkey.  At about 30% of the way through, which is roughly 600 million rows, it died trying to write to the new table with the wrong region exception.  When we tried to restart the copy from that key + some delta, it still crapped out.  No explanation in the logs the first time, but a series of timeouts in the second run.  Now we are trying the copy again with a new table.

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, May 31, 2011 6:42 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

So, what about this new WrongRegionException in the new cluster.  Can you figure how it came about?  In the new cluster, is there also a hole?  Did you start the new cluster fresh or copy from old cluster?

St.Ack

On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Tuesday, May 31, 2011 3:03 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>
>
>
> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>
> St.Ack
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
So, what about this new WrongRegionException in the new cluster.  Can
you figure how it came about?  In the new cluster, is there also a
hole?  Did you start the new cluster fresh or copy from old cluster?

St.Ack

On Tue, May 31, 2011 at 1:55 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Tuesday, May 31, 2011 3:03 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>>
>
>
> Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?
>
> St.Ack
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Yeah, we learned the hard way early last year to follow the guidelines religiously.  I've gone over the requirements and checked off everything.  We even re-did our tables to only have 4 column families, down from 4x that amount.   We are at a loss to find out why we seemed to be cursed when it comes to HBase.  Hadoop is performing like a charm, pretty much every machine is busy 24/7.

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, May 31, 2011 3:03 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>


Sorry you are not having a good experience.  I've not seen WrongRegionException in ages (Grep these lists yourself).  Makes me suspect your environment.  For sure you've read the requirements section in the manual and set up ulimits, nprocs and xceivers up?

St.Ack

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
On Tue, May 31, 2011 at 10:42 AM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.
>


Sorry you are not having a good experience.  I've not seen
WrongRegionException in ages (Grep these lists yourself).  Makes me
suspect your environment.  For sure you've read the requirements
section in the manual and set up ulimits, nprocs and xceivers up?

St.Ack

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Now I'm getting the wrong region exception on the new table that I'm copying the old table to.  Running hbck reveals an inconsistency in the new table.  The frustration is unbelievable.  Like I said before, it doesn't appear that HBase is ready for prime time.  I don't know how companies are using this successfully, it doesn't appear plausible.

Robert


-----Original Message-----
From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com] 
Sent: Tuesday, May 31, 2011 11:03 AM
To: 'user@hbase.apache.org'
Subject: RE: wrong region exception

I'm trying my "nuclear" option: basically copy the data from the old db to a new one, skipping over bad regions.  The bad news is that it is taking forever.


I get a stack trace just trying to run check_meta.rb:

maxpoint@c1-m02:/usr/lib/hbase/bin$ ./hbase org.jruby.Main check_meta.rb Writables.java:75:in `org.apache.hadoop.hbase.util.Writables.getWritable': java.lang.NullPointerException: null (NativeException)
	from Writables.java:119:in `org.apache.hadoop.hbase.util.Writables.getHRegionInfo'
	from NativeMethodAccessorImpl.java:-2:in `sun.reflect.NativeMethodAccessorImpl.invoke0'
	from NativeMethodAccessorImpl.java:39:in `sun.reflect.NativeMethodAccessorImpl.invoke'
	from DelegatingMethodAccessorImpl.java:25:in `sun.reflect.DelegatingMethodAccessorImpl.invoke'
	from Method.java:597:in `java.lang.reflect.Method.invoke'
	from JavaMethod.java:196:in `org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling'
	from JavaMethod.java:182:in `org.jruby.javasupport.JavaMethod.invoke_static'
	from JavaClass.java:371:in `org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute'
	 ... 17 levels...
	from Main.java:183:in `org.jruby.Main.runInterpreter'
	from Main.java:120:in `org.jruby.Main.run'
	from Main.java:95:in `org.jruby.Main.main'
Complete Java stackTrace
java.lang.NullPointerException
	at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
	at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:119)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling(JavaMethod.java:196)
	at org.jruby.javasupport.JavaMethod.invoke_static(JavaMethod.java:182)
	at org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute(JavaClass.java:371)
	at org.jruby.internal.runtime.methods.SimpleCallbackMethod.call(SimpleCallbackMethod.java:81)
	at org.jruby.evaluator.EvaluationState.callNode(EvaluationState.java:571)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:207)
	at org.jruby.evaluator.EvaluationState.localAsgnNode(EvaluationState.java:1254)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:286)
	at org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:201)
	at org.jruby.evaluator.EvaluationState.whileNode(EvaluationState.java:1793)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:387)
	at org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:201)
	at org.jruby.evaluator.EvaluationState.rootNode(EvaluationState.java:1628)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:356)
	at org.jruby.evaluator.EvaluationState.eval(EvaluationState.java:164)
	at org.jruby.Ruby.eval(Ruby.java:278)
	at org.jruby.Ruby.compileOrFallbackAndRun(Ruby.java:306)
	at org.jruby.Main.runInterpreter(Main.java:238)
	at org.jruby.Main.runInterpreter(Main.java:183)
	at org.jruby.Main.run(Main.java:120)
	at org.jruby.Main.main(Main.java:95)

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Friday, May 27, 2011 12:43 AM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Robert:

Looks like script already exists.  Check bin/check_meta.rb.  If you pass it --fix it should plug the hole.  Read the head of the script for how to run it.

Good luck,
St.Ack

On Thu, May 26, 2011 at 1:06 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> I sent the meta.txt to your saint.ack@gmail .com account due to the attachment.
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Thursday, May 26, 2011 1:35 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> It looks like an entire region is missing, here is the online table:
>>
>> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0e
>> 0 2ae98541618396aa7b1.
>> c1-s03.atxd.maxpointinteractive.com:60030
>> 7FF1A5BF839C37078083B4F8267008F6
>> 80116D7E506D87ED39EAFFE784B5B590
>> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f8
>> 9 aece6b994dd31b42b2a.      c1-s33.atxd.maxpointinteractive.com:60030
>> 8031483E0B3B7F587020FCBB764272D9
>> 8041346D0B05617FA4B9152BFE9B18B9
>>
>> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>>
>
> So, make sure you actually have a hole.  Dump out your meta table:
>
> echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt
>
> Then look ensure that there is a hole between the above regions (compare start and end keys... the end key of one region needs to match the start key of the next).
>
> If indeed a hole, you need to do a little surgery inserting a new missing region (hbck should fix this but it doesn't have the smarts just yet).
>
> Basically, you create a new region with start and end keys to fill the hole then you insert it into .META. and then assign it.  There are some scripts in our bin directory that do various parts of this.  I'm pretty sure its beyond any but a few figuring this mess out so if you do the above foot work and provide a few more details, I'll hack up something for you (and hopefully something generalized to be use by others later, and later to be integrated into hbck).
>
> St.Ack
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
If I write manually, I get this:

hbase(main):005:0> put 'urlhashv4', "7BB16418308C2CB6B8AE56982781A5C7", 'url', "bogus"   

ERROR: org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in .META. for region urlhashv4,7BB16418308C2CB6B8AE56982781A5C6,1308776365880.870c76fc43287036d776cca6c4ac6e6f.



-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 22, 2011 11:42 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

On Wed, Jun 22, 2011 at 2:23 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> Stack,
>
> This script does work from one perspective.  It makes an entry in the .meta. table with the missing region.  But it does not create a region file for it.  How does one go about doing that?
>

That will be created as soon as you've written enough edits to cause a flush.
St.Ack

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Its been writing for a while now.  When I try to read from that range, I get an exception that is caught.  Is this normal?

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, June 22, 2011 11:42 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

On Wed, Jun 22, 2011 at 2:23 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> Stack,
>
> This script does work from one perspective.  It makes an entry in the .meta. table with the missing region.  But it does not create a region file for it.  How does one go about doing that?
>

That will be created as soon as you've written enough edits to cause a flush.
St.Ack

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
On Wed, Jun 22, 2011 at 2:23 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> Stack,
>
> This script does work from one perspective.  It makes an entry in the .meta. table with the missing region.  But it does not create a region file for it.  How does one go about doing that?
>

That will be created as soon as you've written enough edits to cause a flush.
St.Ack

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
Stack,

This script does work from one perspective.  It makes an entry in the .meta. table with the missing region.  But it does not create a region file for it.  How does one go about doing that?

Thanks,
Robert

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, May 31, 2011 6:39 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

On Tue, May 31, 2011 at 3:34 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> The script doesn't work because it attempts to fix the hole by finding a region in the hdfs filesystem that fills the hole.  But in this case there is no such file.  The hole is just there.
>

OK.  The fixup method has the left and right edges.  Could use previous rows regioninfo and adjust it to fill the hole (or create a new one)?  Something like the below?


diff --git a/bin/check_meta.rb b/bin/check_meta.rb index d874922..82c6ac0 100644
--- a/bin/check_meta.rb
+++ b/bin/check_meta.rb
@@ -80,32 +80,12 @@ def getConfiguration  end

 def fixup(leftEdge, rightEdge, metatable, fs, rootdir)
-  plugged = nil
-  # Try and fix the passed holes in meta.
-  tabledir = HTableDescriptor::getTableDir(rootdir,
leftEdge.getTableDesc().getName())
-  statuses = fs.listStatus(tabledir)
-  for status in statuses
-    next unless status.isDir()
-    next if status.getPath().getName() == "compaction.dir"
-    regioninfofile =  Path.new(status.getPath(), ".regioninfo")
-    unless fs.exists(regioninfofile)
-      LOG.warn("Missing .regioninfo: " + regioninfofile.toString())
-      next
-    end
-    is = fs.open(regioninfofile)
-    hri = HRegionInfo.new()
-    hri.readFields(is)
-    is.close()
-    next unless Bytes.equals(leftEdge.getEndKey(), hri.getStartKey())
-    # TODO: Check against right edge to make sure this addition does
not overflow right edge.
-    # TODO: Check that the schema matches both left and right edges schemas.
-    p = Put.new(hri.getRegionName())
-    p.add(HConstants::CATALOG_FAMILY,
HConstants::REGIONINFO_QUALIFIER, Writables.getBytes(hri))
-    metatable.put(p)
-    LOG.info("Plugged hole in .META. at: " + hri.toString())
-    plugged = true
-  end
-  return plugged
+  hri = HRegionInfo.new(leftEdge.getTableDesc(),
leftEdge.getEndRow(), rightEdge.getStartRow())
+  p = Put.new(hri.getRegionName())
+  p.add(HConstants::CATALOG_FAMILY, HConstants::REGIONINFO_QUALIFIER,
Writables.getBytes(hri))
+  metatable.put(p)
+  LOG.info("Plugged hole in .META. at: " + hri.toString())  return true
 end

 fixup = isFixup()


St.Ack

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
I'll create a new script, fix_hole, that does this.  We are going to let the copy finish first, then we will try this with the broken original table.

Thanks Stack,

Robert


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, May 31, 2011 6:39 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

On Tue, May 31, 2011 at 3:34 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> The script doesn't work because it attempts to fix the hole by finding a region in the hdfs filesystem that fills the hole.  But in this case there is no such file.  The hole is just there.
>

OK.  The fixup method has the left and right edges.  Could use previous rows regioninfo and adjust it to fill the hole (or create a new one)?  Something like the below?


diff --git a/bin/check_meta.rb b/bin/check_meta.rb index d874922..82c6ac0 100644
--- a/bin/check_meta.rb
+++ b/bin/check_meta.rb
@@ -80,32 +80,12 @@ def getConfiguration  end

 def fixup(leftEdge, rightEdge, metatable, fs, rootdir)
-  plugged = nil
-  # Try and fix the passed holes in meta.
-  tabledir = HTableDescriptor::getTableDir(rootdir,
leftEdge.getTableDesc().getName())
-  statuses = fs.listStatus(tabledir)
-  for status in statuses
-    next unless status.isDir()
-    next if status.getPath().getName() == "compaction.dir"
-    regioninfofile =  Path.new(status.getPath(), ".regioninfo")
-    unless fs.exists(regioninfofile)
-      LOG.warn("Missing .regioninfo: " + regioninfofile.toString())
-      next
-    end
-    is = fs.open(regioninfofile)
-    hri = HRegionInfo.new()
-    hri.readFields(is)
-    is.close()
-    next unless Bytes.equals(leftEdge.getEndKey(), hri.getStartKey())
-    # TODO: Check against right edge to make sure this addition does
not overflow right edge.
-    # TODO: Check that the schema matches both left and right edges schemas.
-    p = Put.new(hri.getRegionName())
-    p.add(HConstants::CATALOG_FAMILY,
HConstants::REGIONINFO_QUALIFIER, Writables.getBytes(hri))
-    metatable.put(p)
-    LOG.info("Plugged hole in .META. at: " + hri.toString())
-    plugged = true
-  end
-  return plugged
+  hri = HRegionInfo.new(leftEdge.getTableDesc(),
leftEdge.getEndRow(), rightEdge.getStartRow())
+  p = Put.new(hri.getRegionName())
+  p.add(HConstants::CATALOG_FAMILY, HConstants::REGIONINFO_QUALIFIER,
Writables.getBytes(hri))
+  metatable.put(p)
+  LOG.info("Plugged hole in .META. at: " + hri.toString())  return true
 end

 fixup = isFixup()


St.Ack

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
On Tue, May 31, 2011 at 3:34 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> The script doesn't work because it attempts to fix the hole by finding a region in the hdfs filesystem that fills the hole.  But in this case there is no such file.  The hole is just there.
>

OK.  The fixup method has the left and right edges.  Could use
previous rows regioninfo and adjust it to fill the hole (or create a
new one)?  Something like the below?


diff --git a/bin/check_meta.rb b/bin/check_meta.rb
index d874922..82c6ac0 100644
--- a/bin/check_meta.rb
+++ b/bin/check_meta.rb
@@ -80,32 +80,12 @@ def getConfiguration
 end

 def fixup(leftEdge, rightEdge, metatable, fs, rootdir)
-  plugged = nil
-  # Try and fix the passed holes in meta.
-  tabledir = HTableDescriptor::getTableDir(rootdir,
leftEdge.getTableDesc().getName())
-  statuses = fs.listStatus(tabledir)
-  for status in statuses
-    next unless status.isDir()
-    next if status.getPath().getName() == "compaction.dir"
-    regioninfofile =  Path.new(status.getPath(), ".regioninfo")
-    unless fs.exists(regioninfofile)
-      LOG.warn("Missing .regioninfo: " + regioninfofile.toString())
-      next
-    end
-    is = fs.open(regioninfofile)
-    hri = HRegionInfo.new()
-    hri.readFields(is)
-    is.close()
-    next unless Bytes.equals(leftEdge.getEndKey(), hri.getStartKey())
-    # TODO: Check against right edge to make sure this addition does
not overflow right edge.
-    # TODO: Check that the schema matches both left and right edges schemas.
-    p = Put.new(hri.getRegionName())
-    p.add(HConstants::CATALOG_FAMILY,
HConstants::REGIONINFO_QUALIFIER, Writables.getBytes(hri))
-    metatable.put(p)
-    LOG.info("Plugged hole in .META. at: " + hri.toString())
-    plugged = true
-  end
-  return plugged
+  hri = HRegionInfo.new(leftEdge.getTableDesc(),
leftEdge.getEndRow(), rightEdge.getStartRow())
+  p = Put.new(hri.getRegionName())
+  p.add(HConstants::CATALOG_FAMILY, HConstants::REGIONINFO_QUALIFIER,
Writables.getBytes(hri))
+  metatable.put(p)
+  LOG.info("Plugged hole in .META. at: " + hri.toString())
+  return true
 end

 fixup = isFixup()


St.Ack

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
The script doesn't work because it attempts to fix the hole by finding a region in the hdfs filesystem that fills the hole.  But in this case there is no such file.  The hole is just there.

-----Original Message-----
From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com] 
Sent: Tuesday, May 31, 2011 5:20 PM
To: 'user@hbase.apache.org'
Subject: RE: wrong region exception

The script ran without the previous problem, but it did not fix the problem.  When I ran hbck or check_meta.rb again they indicated that the problem was still there.  Do I need to do something else in preparation before running check_meta?

Thanks,

Robert


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, May 31, 2011 2:57 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Try adding this change:

Index: bin/check_meta.rb
===================================================================
--- bin/check_meta.rb   (revision 1129468)
+++ bin/check_meta.rb   (working copy)
@@ -127,11 +127,13 @@
 scan = Scan.new()
 scanner = metatable.getScanner(scan)
 oldHRI = nil
-bad = nil
+bad = 0
 while (result = scanner.next())
   rowid = Bytes.toString(result.getRow())
   rowidStr = java.lang.String.new(rowid)
   bytes = result.getValue(HConstants::CATALOG_FAMILY,
HConstants::REGIONINFO_QUALIFIER)
+  next if not bytes
+  next if bytes.length == 0
   hri = Writables.getHRegionInfo(bytes)
   if oldHRI
     if oldHRI.isOffline() && Bytes.equals(oldHRI.getStartKey(),
hri.getStartKey())


You might print out the result you have if the qualifier is null just to see what are the rows missing an HRegionInfo.

St.Ack

On Tue, May 31, 2011 at 9:02 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> I'm trying my "nuclear" option: basically copy the data from the old db to a new one, skipping over bad regions.  The bad news is that it is taking forever.
>
>
> I get a stack trace just trying to run check_meta.rb:
>
> maxpoint@c1-m02:/usr/lib/hbase/bin$ ./hbase org.jruby.Main 
> check_meta.rb Writables.java:75:in
> `org.apache.hadoop.hbase.util.Writables.getWritable': 
> java.lang.NullPointerException: null (NativeException)
>        from Writables.java:119:in `org.apache.hadoop.hbase.util.Writables.getHRegionInfo'
>        from NativeMethodAccessorImpl.java:-2:in `sun.reflect.NativeMethodAccessorImpl.invoke0'
>        from NativeMethodAccessorImpl.java:39:in `sun.reflect.NativeMethodAccessorImpl.invoke'
>        from DelegatingMethodAccessorImpl.java:25:in `sun.reflect.DelegatingMethodAccessorImpl.invoke'
>        from Method.java:597:in `java.lang.reflect.Method.invoke'
>        from JavaMethod.java:196:in `org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling'
>        from JavaMethod.java:182:in `org.jruby.javasupport.JavaMethod.invoke_static'
>        from JavaClass.java:371:in `org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute'
>         ... 17 levels...
>        from Main.java:183:in `org.jruby.Main.runInterpreter'
>        from Main.java:120:in `org.jruby.Main.run'
>        from Main.java:95:in `org.jruby.Main.main'
> Complete Java stackTrace
> java.lang.NullPointerException
>        at
> org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
>        at
> org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
> 19)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:39)
>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at
> org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling(JavaMetho
> d.java:196)
>        at
> org.jruby.javasupport.JavaMethod.invoke_static(JavaMethod.java:182)
>        at
> org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute(JavaClass.
> java:371)
>        at
> org.jruby.internal.runtime.methods.SimpleCallbackMethod.call(SimpleCal
> lbackMethod.java:81)
>        at
> org.jruby.evaluator.EvaluationState.callNode(EvaluationState.java:571)
>        at
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 207)
>        at
> org.jruby.evaluator.EvaluationState.localAsgnNode(EvaluationState.java
> :1254)
>        at
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 286)
>        at
> org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533
> )
>        at
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 201)
>        at
> org.jruby.evaluator.EvaluationState.whileNode(EvaluationState.java:179
> 3)
>        at
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 387)
>        at
> org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533
> )
>        at
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 201)
>        at
> org.jruby.evaluator.EvaluationState.rootNode(EvaluationState.java:1628
> )
>        at
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 356)
>        at
> org.jruby.evaluator.EvaluationState.eval(EvaluationState.java:164)
>        at org.jruby.Ruby.eval(Ruby.java:278)
>        at org.jruby.Ruby.compileOrFallbackAndRun(Ruby.java:306)
>        at org.jruby.Main.runInterpreter(Main.java:238)
>        at org.jruby.Main.runInterpreter(Main.java:183)
>        at org.jruby.Main.run(Main.java:120)
>        at org.jruby.Main.main(Main.java:95)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Friday, May 27, 2011 12:43 AM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> Robert:
>
> Looks like script already exists.  Check bin/check_meta.rb.  If you pass it --fix it should plug the hole.  Read the head of the script for how to run it.
>
> Good luck,
> St.Ack
>
> On Thu, May 26, 2011 at 1:06 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> I sent the meta.txt to your saint.ack@gmail .com account due to the attachment.
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Thursday, May 26, 2011 1:35 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> It looks like an entire region is missing, here is the online table:
>>>
>>> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0
>>> e
>>> 0 2ae98541618396aa7b1.
>>> c1-s03.atxd.maxpointinteractive.com:60030
>>> 7FF1A5BF839C37078083B4F8267008F6
>>> 80116D7E506D87ED39EAFFE784B5B590
>>> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f
>>> 8
>>> 9 aece6b994dd31b42b2a.
>>> c1-s33.atxd.maxpointinteractive.com:60030
>>> 8031483E0B3B7F587020FCBB764272D9
>>> 8041346D0B05617FA4B9152BFE9B18B9
>>>
>>> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>>>
>>
>> So, make sure you actually have a hole.  Dump out your meta table:
>>
>> echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt
>>
>> Then look ensure that there is a hole between the above regions (compare start and end keys... the end key of one region needs to match the start key of the next).
>>
>> If indeed a hole, you need to do a little surgery inserting a new missing region (hbck should fix this but it doesn't have the smarts just yet).
>>
>> Basically, you create a new region with start and end keys to fill the hole then you insert it into .META. and then assign it.  There are some scripts in our bin directory that do various parts of this.  I'm pretty sure its beyond any but a few figuring this mess out so if you do the above foot work and provide a few more details, I'll hack up something for you (and hopefully something generalized to be use by others later, and later to be integrated into hbck).
>>
>> St.Ack
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
The script ran without the previous problem, but it did not fix the problem.  When I ran hbck or check_meta.rb again they indicated that the problem was still there.  Do I need to do something else in preparation before running check_meta?

Thanks,

Robert


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Tuesday, May 31, 2011 2:57 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Try adding this change:

Index: bin/check_meta.rb
===================================================================
--- bin/check_meta.rb   (revision 1129468)
+++ bin/check_meta.rb   (working copy)
@@ -127,11 +127,13 @@
 scan = Scan.new()
 scanner = metatable.getScanner(scan)
 oldHRI = nil
-bad = nil
+bad = 0
 while (result = scanner.next())
   rowid = Bytes.toString(result.getRow())
   rowidStr = java.lang.String.new(rowid)
   bytes = result.getValue(HConstants::CATALOG_FAMILY,
HConstants::REGIONINFO_QUALIFIER)
+  next if not bytes
+  next if bytes.length == 0
   hri = Writables.getHRegionInfo(bytes)
   if oldHRI
     if oldHRI.isOffline() && Bytes.equals(oldHRI.getStartKey(),
hri.getStartKey())


You might print out the result you have if the qualifier is null just to see what are the rows missing an HRegionInfo.

St.Ack

On Tue, May 31, 2011 at 9:02 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> I'm trying my "nuclear" option: basically copy the data from the old db to a new one, skipping over bad regions.  The bad news is that it is taking forever.
>
>
> I get a stack trace just trying to run check_meta.rb:
>
> maxpoint@c1-m02:/usr/lib/hbase/bin$ ./hbase org.jruby.Main 
> check_meta.rb Writables.java:75:in 
> `org.apache.hadoop.hbase.util.Writables.getWritable': 
> java.lang.NullPointerException: null (NativeException)
>        from Writables.java:119:in `org.apache.hadoop.hbase.util.Writables.getHRegionInfo'
>        from NativeMethodAccessorImpl.java:-2:in `sun.reflect.NativeMethodAccessorImpl.invoke0'
>        from NativeMethodAccessorImpl.java:39:in `sun.reflect.NativeMethodAccessorImpl.invoke'
>        from DelegatingMethodAccessorImpl.java:25:in `sun.reflect.DelegatingMethodAccessorImpl.invoke'
>        from Method.java:597:in `java.lang.reflect.Method.invoke'
>        from JavaMethod.java:196:in `org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling'
>        from JavaMethod.java:182:in `org.jruby.javasupport.JavaMethod.invoke_static'
>        from JavaClass.java:371:in `org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute'
>         ... 17 levels...
>        from Main.java:183:in `org.jruby.Main.runInterpreter'
>        from Main.java:120:in `org.jruby.Main.run'
>        from Main.java:95:in `org.jruby.Main.main'
> Complete Java stackTrace
> java.lang.NullPointerException
>        at 
> org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
>        at 
> org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:1
> 19)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.j
> ava:39)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccess
> orImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at 
> org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling(JavaMetho
> d.java:196)
>        at 
> org.jruby.javasupport.JavaMethod.invoke_static(JavaMethod.java:182)
>        at 
> org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute(JavaClass.
> java:371)
>        at 
> org.jruby.internal.runtime.methods.SimpleCallbackMethod.call(SimpleCal
> lbackMethod.java:81)
>        at 
> org.jruby.evaluator.EvaluationState.callNode(EvaluationState.java:571)
>        at 
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 207)
>        at 
> org.jruby.evaluator.EvaluationState.localAsgnNode(EvaluationState.java
> :1254)
>        at 
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 286)
>        at 
> org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533
> )
>        at 
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 201)
>        at 
> org.jruby.evaluator.EvaluationState.whileNode(EvaluationState.java:179
> 3)
>        at 
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 387)
>        at 
> org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533
> )
>        at 
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 201)
>        at 
> org.jruby.evaluator.EvaluationState.rootNode(EvaluationState.java:1628
> )
>        at 
> org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:
> 356)
>        at 
> org.jruby.evaluator.EvaluationState.eval(EvaluationState.java:164)
>        at org.jruby.Ruby.eval(Ruby.java:278)
>        at org.jruby.Ruby.compileOrFallbackAndRun(Ruby.java:306)
>        at org.jruby.Main.runInterpreter(Main.java:238)
>        at org.jruby.Main.runInterpreter(Main.java:183)
>        at org.jruby.Main.run(Main.java:120)
>        at org.jruby.Main.main(Main.java:95)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Friday, May 27, 2011 12:43 AM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> Robert:
>
> Looks like script already exists.  Check bin/check_meta.rb.  If you pass it --fix it should plug the hole.  Read the head of the script for how to run it.
>
> Good luck,
> St.Ack
>
> On Thu, May 26, 2011 at 1:06 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> I sent the meta.txt to your saint.ack@gmail .com account due to the attachment.
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Thursday, May 26, 2011 1:35 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> It looks like an entire region is missing, here is the online table:
>>>
>>> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0
>>> e
>>> 0 2ae98541618396aa7b1.
>>> c1-s03.atxd.maxpointinteractive.com:60030
>>> 7FF1A5BF839C37078083B4F8267008F6
>>> 80116D7E506D87ED39EAFFE784B5B590
>>> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f
>>> 8
>>> 9 aece6b994dd31b42b2a.      
>>> c1-s33.atxd.maxpointinteractive.com:60030
>>> 8031483E0B3B7F587020FCBB764272D9
>>> 8041346D0B05617FA4B9152BFE9B18B9
>>>
>>> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>>>
>>
>> So, make sure you actually have a hole.  Dump out your meta table:
>>
>> echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt
>>
>> Then look ensure that there is a hole between the above regions (compare start and end keys... the end key of one region needs to match the start key of the next).
>>
>> If indeed a hole, you need to do a little surgery inserting a new missing region (hbck should fix this but it doesn't have the smarts just yet).
>>
>> Basically, you create a new region with start and end keys to fill the hole then you insert it into .META. and then assign it.  There are some scripts in our bin directory that do various parts of this.  I'm pretty sure its beyond any but a few figuring this mess out so if you do the above foot work and provide a few more details, I'll hack up something for you (and hopefully something generalized to be use by others later, and later to be integrated into hbck).
>>
>> St.Ack
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
Try adding this change:

Index: bin/check_meta.rb
===================================================================
--- bin/check_meta.rb   (revision 1129468)
+++ bin/check_meta.rb   (working copy)
@@ -127,11 +127,13 @@
 scan = Scan.new()
 scanner = metatable.getScanner(scan)
 oldHRI = nil
-bad = nil
+bad = 0
 while (result = scanner.next())
   rowid = Bytes.toString(result.getRow())
   rowidStr = java.lang.String.new(rowid)
   bytes = result.getValue(HConstants::CATALOG_FAMILY,
HConstants::REGIONINFO_QUALIFIER)
+  next if not bytes
+  next if bytes.length == 0
   hri = Writables.getHRegionInfo(bytes)
   if oldHRI
     if oldHRI.isOffline() && Bytes.equals(oldHRI.getStartKey(),
hri.getStartKey())


You might print out the result you have if the qualifier is null just
to see what are the rows missing an HRegionInfo.

St.Ack

On Tue, May 31, 2011 at 9:02 AM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> I'm trying my "nuclear" option: basically copy the data from the old db to a new one, skipping over bad regions.  The bad news is that it is taking forever.
>
>
> I get a stack trace just trying to run check_meta.rb:
>
> maxpoint@c1-m02:/usr/lib/hbase/bin$ ./hbase org.jruby.Main check_meta.rb
> Writables.java:75:in `org.apache.hadoop.hbase.util.Writables.getWritable': java.lang.NullPointerException: null (NativeException)
>        from Writables.java:119:in `org.apache.hadoop.hbase.util.Writables.getHRegionInfo'
>        from NativeMethodAccessorImpl.java:-2:in `sun.reflect.NativeMethodAccessorImpl.invoke0'
>        from NativeMethodAccessorImpl.java:39:in `sun.reflect.NativeMethodAccessorImpl.invoke'
>        from DelegatingMethodAccessorImpl.java:25:in `sun.reflect.DelegatingMethodAccessorImpl.invoke'
>        from Method.java:597:in `java.lang.reflect.Method.invoke'
>        from JavaMethod.java:196:in `org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling'
>        from JavaMethod.java:182:in `org.jruby.javasupport.JavaMethod.invoke_static'
>        from JavaClass.java:371:in `org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute'
>         ... 17 levels...
>        from Main.java:183:in `org.jruby.Main.runInterpreter'
>        from Main.java:120:in `org.jruby.Main.run'
>        from Main.java:95:in `org.jruby.Main.main'
> Complete Java stackTrace
> java.lang.NullPointerException
>        at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
>        at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:119)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling(JavaMethod.java:196)
>        at org.jruby.javasupport.JavaMethod.invoke_static(JavaMethod.java:182)
>        at org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute(JavaClass.java:371)
>        at org.jruby.internal.runtime.methods.SimpleCallbackMethod.call(SimpleCallbackMethod.java:81)
>        at org.jruby.evaluator.EvaluationState.callNode(EvaluationState.java:571)
>        at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:207)
>        at org.jruby.evaluator.EvaluationState.localAsgnNode(EvaluationState.java:1254)
>        at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:286)
>        at org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533)
>        at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:201)
>        at org.jruby.evaluator.EvaluationState.whileNode(EvaluationState.java:1793)
>        at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:387)
>        at org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533)
>        at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:201)
>        at org.jruby.evaluator.EvaluationState.rootNode(EvaluationState.java:1628)
>        at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:356)
>        at org.jruby.evaluator.EvaluationState.eval(EvaluationState.java:164)
>        at org.jruby.Ruby.eval(Ruby.java:278)
>        at org.jruby.Ruby.compileOrFallbackAndRun(Ruby.java:306)
>        at org.jruby.Main.runInterpreter(Main.java:238)
>        at org.jruby.Main.runInterpreter(Main.java:183)
>        at org.jruby.Main.run(Main.java:120)
>        at org.jruby.Main.main(Main.java:95)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Friday, May 27, 2011 12:43 AM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> Robert:
>
> Looks like script already exists.  Check bin/check_meta.rb.  If you pass it --fix it should plug the hole.  Read the head of the script for how to run it.
>
> Good luck,
> St.Ack
>
> On Thu, May 26, 2011 at 1:06 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> I sent the meta.txt to your saint.ack@gmail .com account due to the attachment.
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> Sent: Thursday, May 26, 2011 1:35 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> It looks like an entire region is missing, here is the online table:
>>>
>>> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0e
>>> 0 2ae98541618396aa7b1.
>>> c1-s03.atxd.maxpointinteractive.com:60030
>>> 7FF1A5BF839C37078083B4F8267008F6
>>> 80116D7E506D87ED39EAFFE784B5B590
>>> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f8
>>> 9 aece6b994dd31b42b2a.      c1-s33.atxd.maxpointinteractive.com:60030
>>> 8031483E0B3B7F587020FCBB764272D9
>>> 8041346D0B05617FA4B9152BFE9B18B9
>>>
>>> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>>>
>>
>> So, make sure you actually have a hole.  Dump out your meta table:
>>
>> echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt
>>
>> Then look ensure that there is a hole between the above regions (compare start and end keys... the end key of one region needs to match the start key of the next).
>>
>> If indeed a hole, you need to do a little surgery inserting a new missing region (hbck should fix this but it doesn't have the smarts just yet).
>>
>> Basically, you create a new region with start and end keys to fill the hole then you insert it into .META. and then assign it.  There are some scripts in our bin directory that do various parts of this.  I'm pretty sure its beyond any but a few figuring this mess out so if you do the above foot work and provide a few more details, I'll hack up something for you (and hopefully something generalized to be use by others later, and later to be integrated into hbck).
>>
>> St.Ack
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
I'm trying my "nuclear" option: basically copy the data from the old db to a new one, skipping over bad regions.  The bad news is that it is taking forever.


I get a stack trace just trying to run check_meta.rb:

maxpoint@c1-m02:/usr/lib/hbase/bin$ ./hbase org.jruby.Main check_meta.rb 
Writables.java:75:in `org.apache.hadoop.hbase.util.Writables.getWritable': java.lang.NullPointerException: null (NativeException)
	from Writables.java:119:in `org.apache.hadoop.hbase.util.Writables.getHRegionInfo'
	from NativeMethodAccessorImpl.java:-2:in `sun.reflect.NativeMethodAccessorImpl.invoke0'
	from NativeMethodAccessorImpl.java:39:in `sun.reflect.NativeMethodAccessorImpl.invoke'
	from DelegatingMethodAccessorImpl.java:25:in `sun.reflect.DelegatingMethodAccessorImpl.invoke'
	from Method.java:597:in `java.lang.reflect.Method.invoke'
	from JavaMethod.java:196:in `org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling'
	from JavaMethod.java:182:in `org.jruby.javasupport.JavaMethod.invoke_static'
	from JavaClass.java:371:in `org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute'
	 ... 17 levels...
	from Main.java:183:in `org.jruby.Main.runInterpreter'
	from Main.java:120:in `org.jruby.Main.run'
	from Main.java:95:in `org.jruby.Main.main'
Complete Java stackTrace
java.lang.NullPointerException
	at org.apache.hadoop.hbase.util.Writables.getWritable(Writables.java:75)
	at org.apache.hadoop.hbase.util.Writables.getHRegionInfo(Writables.java:119)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.jruby.javasupport.JavaMethod.invokeWithExceptionHandling(JavaMethod.java:196)
	at org.jruby.javasupport.JavaMethod.invoke_static(JavaMethod.java:182)
	at org.jruby.javasupport.JavaClass$StaticMethodInvoker.execute(JavaClass.java:371)
	at org.jruby.internal.runtime.methods.SimpleCallbackMethod.call(SimpleCallbackMethod.java:81)
	at org.jruby.evaluator.EvaluationState.callNode(EvaluationState.java:571)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:207)
	at org.jruby.evaluator.EvaluationState.localAsgnNode(EvaluationState.java:1254)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:286)
	at org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:201)
	at org.jruby.evaluator.EvaluationState.whileNode(EvaluationState.java:1793)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:387)
	at org.jruby.evaluator.EvaluationState.blockNode(EvaluationState.java:533)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:201)
	at org.jruby.evaluator.EvaluationState.rootNode(EvaluationState.java:1628)
	at org.jruby.evaluator.EvaluationState.evalInternal(EvaluationState.java:356)
	at org.jruby.evaluator.EvaluationState.eval(EvaluationState.java:164)
	at org.jruby.Ruby.eval(Ruby.java:278)
	at org.jruby.Ruby.compileOrFallbackAndRun(Ruby.java:306)
	at org.jruby.Main.runInterpreter(Main.java:238)
	at org.jruby.Main.runInterpreter(Main.java:183)
	at org.jruby.Main.run(Main.java:120)
	at org.jruby.Main.main(Main.java:95)

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Friday, May 27, 2011 12:43 AM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Robert:

Looks like script already exists.  Check bin/check_meta.rb.  If you pass it --fix it should plug the hole.  Read the head of the script for how to run it.

Good luck,
St.Ack

On Thu, May 26, 2011 at 1:06 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> I sent the meta.txt to your saint.ack@gmail .com account due to the attachment.
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Thursday, May 26, 2011 1:35 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> It looks like an entire region is missing, here is the online table:
>>
>> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0e
>> 0 2ae98541618396aa7b1.       
>> c1-s03.atxd.maxpointinteractive.com:60030
>> 7FF1A5BF839C37078083B4F8267008F6
>> 80116D7E506D87ED39EAFFE784B5B590
>> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f8
>> 9 aece6b994dd31b42b2a.      c1-s33.atxd.maxpointinteractive.com:60030
>> 8031483E0B3B7F587020FCBB764272D9
>> 8041346D0B05617FA4B9152BFE9B18B9
>>
>> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>>
>
> So, make sure you actually have a hole.  Dump out your meta table:
>
> echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt
>
> Then look ensure that there is a hole between the above regions (compare start and end keys... the end key of one region needs to match the start key of the next).
>
> If indeed a hole, you need to do a little surgery inserting a new missing region (hbck should fix this but it doesn't have the smarts just yet).
>
> Basically, you create a new region with start and end keys to fill the hole then you insert it into .META. and then assign it.  There are some scripts in our bin directory that do various parts of this.  I'm pretty sure its beyond any but a few figuring this mess out so if you do the above foot work and provide a few more details, I'll hack up something for you (and hopefully something generalized to be use by others later, and later to be integrated into hbck).
>
> St.Ack
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
Robert:

Looks like script already exists.  Check bin/check_meta.rb.  If you
pass it --fix it should plug the hole.  Read the head of the script
for how to run it.

Good luck,
St.Ack

On Thu, May 26, 2011 at 1:06 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> I sent the meta.txt to your saint.ack@gmail .com account due to the attachment.
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Thursday, May 26, 2011 1:35 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> It looks like an entire region is missing, here is the online table:
>>
>> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0e0
>> 2ae98541618396aa7b1.       c1-s03.atxd.maxpointinteractive.com:60030
>> 7FF1A5BF839C37078083B4F8267008F6
>> 80116D7E506D87ED39EAFFE784B5B590
>> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f89
>> aece6b994dd31b42b2a.      c1-s33.atxd.maxpointinteractive.com:60030
>> 8031483E0B3B7F587020FCBB764272D9
>> 8041346D0B05617FA4B9152BFE9B18B9
>>
>> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>>
>
> So, make sure you actually have a hole.  Dump out your meta table:
>
> echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt
>
> Then look ensure that there is a hole between the above regions (compare start and end keys... the end key of one region needs to match the start key of the next).
>
> If indeed a hole, you need to do a little surgery inserting a new missing region (hbck should fix this but it doesn't have the smarts just yet).
>
> Basically, you create a new region with start and end keys to fill the hole then you insert it into .META. and then assign it.  There are some scripts in our bin directory that do various parts of this.  I'm pretty sure its beyond any but a few figuring this mess out so if you do the above foot work and provide a few more details, I'll hack up something for you (and hopefully something generalized to be use by others later, and later to be integrated into hbck).
>
> St.Ack
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
I sent the meta.txt to your saint.ack@gmail .com account due to the attachment.

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Thursday, May 26, 2011 1:35 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> It looks like an entire region is missing, here is the online table:
>
> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0e0
> 2ae98541618396aa7b1.       c1-s03.atxd.maxpointinteractive.com:60030      
> 7FF1A5BF839C37078083B4F8267008F6        
> 80116D7E506D87ED39EAFFE784B5B590 
> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f89
> aece6b994dd31b42b2a.      c1-s33.atxd.maxpointinteractive.com:60030       
> 8031483E0B3B7F587020FCBB764272D9        
> 8041346D0B05617FA4B9152BFE9B18B9
>
> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>

So, make sure you actually have a hole.  Dump out your meta table:

echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt

Then look ensure that there is a hole between the above regions (compare start and end keys... the end key of one region needs to match the start key of the next).

If indeed a hole, you need to do a little surgery inserting a new missing region (hbck should fix this but it doesn't have the smarts just yet).

Basically, you create a new region with start and end keys to fill the hole then you insert it into .META. and then assign it.  There are some scripts in our bin directory that do various parts of this.  I'm pretty sure its beyond any but a few figuring this mess out so if you do the above foot work and provide a few more details, I'll hack up something for you (and hopefully something generalized to be use by others later, and later to be integrated into hbck).

St.Ack

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
On Thu, May 26, 2011 at 8:06 AM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> It looks like an entire region is missing, here is the online table:
>
> urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0e02ae98541618396aa7b1.       c1-s03.atxd.maxpointinteractive.com:60030      7FF1A5BF839C37078083B4F8267008F6        80116D7E506D87ED39EAFFE784B5B590
> urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f89aece6b994dd31b42b2a.      c1-s33.atxd.maxpointinteractive.com:60030       8031483E0B3B7F587020FCBB764272D9        8041346D0B05617FA4B9152BFE9B18B9
>
> One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.
>

So, make sure you actually have a hole.  Dump out your meta table:

echo "scan '.META.'"| ./bin/hbase shell &> /tmp/meta.txt

Then look ensure that there is a hole between the above regions
(compare start and end keys... the end key of one region needs to
match the start key of the next).

If indeed a hole, you need to do a little surgery inserting a new
missing region (hbck should fix this but it doesn't have the smarts
just yet).

Basically, you create a new region with start and end keys to fill the
hole then you insert it into .META. and then assign it.  There are
some scripts in our bin directory that do various parts of this.  I'm
pretty sure its beyond any but a few figuring this mess out so if you
do the above foot work and provide a few more details, I'll hack up
something for you (and hopefully something generalized to be use by
others later, and later to be integrated into hbck).

St.Ack

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
It looks like an entire region is missing, here is the online table:

urlhashv2,7FF1A5BF839C37078083B4F8267008F6,1303028235302.b0d55566fd0e02ae98541618396aa7b1.  	 c1-s03.atxd.maxpointinteractive.com:60030   	7FF1A5BF839C37078083B4F8267008F6  	80116D7E506D87ED39EAFFE784B5B590
urlhashv2,8031483E0B3B7F587020FCBB764272D9,1305226123483.3ed065ad87f89aece6b994dd31b42b2a. 	c1-s33.atxd.maxpointinteractive.com:60030 	8031483E0B3B7F587020FCBB764272D9 	8041346D0B05617FA4B9152BFE9B18B9

One ends at 80116D7E506D87ED39EAFFE784B5B590, but the next one doesn't start there.

How do I fix this?

-----Original Message-----
From: Robert Gonzalez [mailto:Robert.Gonzalez@maxpointinteractive.com] 
Sent: Wednesday, May 25, 2011 3:41 PM
To: 'user@hbase.apache.org'
Subject: RE: wrong region exception

That region is not in the urlhashv2 directory.

I'll grep all the logs to see if it shows up.



-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, May 25, 2011 3:30 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Can you find this region in the filesystem?  Look under the urlhashv2 table directory for a direction named 80116D7E506D87ED39EAFFE784B5B590.  Grep your master log to see if you can figure the history of this region.
St.Ack

On Wed, May 25, 2011 at 1:21 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> The detailed error is :
>
> Chain of regions in table urlhashv2 is broken; edges does not contain
> 80116D7E506D87ED39EAFFE784B5B590 Table urlhashv2 is inconsistent.
>
> How does one fix this?
>
> Thanks,
>
> Robert
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Monday, May 16, 2011 2:35 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> Says you have an inconsistency in your table.  Add -details and try and figure where the inconsistency.  Grep master logs to try and figure what happened to the problematic regions.  See if adding -fix to hbck will clean up your prob.
>
> St.Ack
>
> On Mon, May 16, 2011 at 12:04 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> attached
>>
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Monday, May 16, 2011 12:57 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> See the rest of my email.
>> St.Ack
>>
>> On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> 0.90.0
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Friday, May 13, 2011 2:21 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> What version of hbase?  We used to see those from time to time in 
>>> old
>>> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>>>
>>> St.Ack
>>>
>>> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Anyone ever see one of these?
>>>>
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>>>> Failed 25 actions: WrongRegionException: 25 times, servers with
>>>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>>>> c1-s03.atxd.maxpointinteractive.com:60020,
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> e
>>>> n
>>>> t
>>>> ation.processBatch(HConnectionManager.java:1220)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> e
>>>> n
>>>> t
>>>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>>>                at
>>>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(Select
>>>> T
>>>> h
>>>> u
>>>> mbs.java:453)
>>>>                at
>>>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:5
>>>> 6
>>>> 6
>>>> )
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>>>                at
>>>> org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>
>>>> thanks,
>>>>
>>>> Gonz
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
That region is not in the urlhashv2 directory.

I'll grep all the logs to see if it shows up.



-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Wednesday, May 25, 2011 3:30 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Can you find this region in the filesystem?  Look under the urlhashv2 table directory for a direction named 80116D7E506D87ED39EAFFE784B5B590.  Grep your master log to see if you can figure the history of this region.
St.Ack

On Wed, May 25, 2011 at 1:21 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> The detailed error is :
>
> Chain of regions in table urlhashv2 is broken; edges does not contain 
> 80116D7E506D87ED39EAFFE784B5B590 Table urlhashv2 is inconsistent.
>
> How does one fix this?
>
> Thanks,
>
> Robert
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Monday, May 16, 2011 2:35 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> Says you have an inconsistency in your table.  Add -details and try and figure where the inconsistency.  Grep master logs to try and figure what happened to the problematic regions.  See if adding -fix to hbck will clean up your prob.
>
> St.Ack
>
> On Mon, May 16, 2011 at 12:04 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> attached
>>
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Monday, May 16, 2011 12:57 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> See the rest of my email.
>> St.Ack
>>
>> On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> 0.90.0
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>>> Stack
>>> Sent: Friday, May 13, 2011 2:21 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> What version of hbase?  We used to see those from time to time in 
>>> old
>>> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>>>
>>> St.Ack
>>>
>>> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Anyone ever see one of these?
>>>>
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>>>> Failed 25 actions: WrongRegionException: 25 times, servers with
>>>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>>>> c1-s03.atxd.maxpointinteractive.com:60020,
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> e
>>>> n
>>>> t
>>>> ation.processBatch(HConnectionManager.java:1220)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplem
>>>> e
>>>> n
>>>> t
>>>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>>>                at
>>>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(Select
>>>> T
>>>> h
>>>> u
>>>> mbs.java:453)
>>>>                at
>>>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:5
>>>> 6
>>>> 6
>>>> )
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>>>                at
>>>> org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>
>>>> thanks,
>>>>
>>>> Gonz
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
Can you find this region in the filesystem?  Look under the urlhashv2
table directory for a direction named
80116D7E506D87ED39EAFFE784B5B590.  Grep your master log to see if you
can figure the history of this region.
St.Ack

On Wed, May 25, 2011 at 1:21 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> The detailed error is :
>
> Chain of regions in table urlhashv2 is broken; edges does not contain 80116D7E506D87ED39EAFFE784B5B590
> Table urlhashv2 is inconsistent.
>
> How does one fix this?
>
> Thanks,
>
> Robert
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Monday, May 16, 2011 2:35 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> Says you have an inconsistency in your table.  Add -details and try and figure where the inconsistency.  Grep master logs to try and figure what happened to the problematic regions.  See if adding -fix to hbck will clean up your prob.
>
> St.Ack
>
> On Mon, May 16, 2011 at 12:04 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> attached
>>
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> Sent: Monday, May 16, 2011 12:57 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> See the rest of my email.
>> St.Ack
>>
>> On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> 0.90.0
>>>
>>> -----Original Message-----
>>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>>> Stack
>>> Sent: Friday, May 13, 2011 2:21 PM
>>> To: user@hbase.apache.org
>>> Subject: Re: wrong region exception
>>>
>>> What version of hbase?  We used to see those from time to time in old
>>> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>>>
>>> St.Ack
>>>
>>> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>>> Anyone ever see one of these?
>>>>
>>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>>>> Failed 25 actions: WrongRegionException: 25 times, servers with
>>>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>>>> c1-s03.atxd.maxpointinteractive.com:60020,
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpleme
>>>> n
>>>> t
>>>> ation.processBatch(HConnectionManager.java:1220)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpleme
>>>> n
>>>> t
>>>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>>>                at
>>>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>>>                at
>>>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectT
>>>> h
>>>> u
>>>> mbs.java:453)
>>>>                at
>>>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:56
>>>> 6
>>>> )
>>>>                at
>>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>>>                at
>>>> org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>>
>>>> thanks,
>>>>
>>>> Gonz
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
The detailed error is :

Chain of regions in table urlhashv2 is broken; edges does not contain 80116D7E506D87ED39EAFFE784B5B590
Table urlhashv2 is inconsistent.

How does one fix this?

Thanks,

Robert

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Monday, May 16, 2011 2:35 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

Says you have an inconsistency in your table.  Add -details and try and figure where the inconsistency.  Grep master logs to try and figure what happened to the problematic regions.  See if adding -fix to hbck will clean up your prob.

St.Ack

On Mon, May 16, 2011 at 12:04 PM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> attached
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Monday, May 16, 2011 12:57 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> See the rest of my email.
> St.Ack
>
> On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> 0.90.0
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
>> Stack
>> Sent: Friday, May 13, 2011 2:21 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> What version of hbase?  We used to see those from time to time in old
>> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>>
>> St.Ack
>>
>> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Anyone ever see one of these?
>>>
>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>>> Failed 25 actions: WrongRegionException: 25 times, servers with
>>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>>> c1-s03.atxd.maxpointinteractive.com:60020,
>>>                at
>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpleme
>>> n
>>> t
>>> ation.processBatch(HConnectionManager.java:1220)
>>>                at
>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImpleme
>>> n
>>> t
>>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>>                at
>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>>                at
>>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>>                at
>>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectT
>>> h
>>> u
>>> mbs.java:453)
>>>                at
>>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>>                at
>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:56
>>> 6
>>> )
>>>                at
>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>>                at 
>>> org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> thanks,
>>>
>>> Gonz
>>>
>>>
>>>
>>>
>>>
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
Says you have an inconsistency in your table.  Add -details and try
and figure where the inconsistency.  Grep master logs to try and
figure what happened to the problematic regions.  See if adding -fix
to hbck will clean up your prob.

St.Ack

On Mon, May 16, 2011 at 12:04 PM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> attached
>
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Monday, May 16, 2011 12:57 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> See the rest of my email.
> St.Ack
>
> On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> 0.90.0
>>
>> -----Original Message-----
>> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
>> Stack
>> Sent: Friday, May 13, 2011 2:21 PM
>> To: user@hbase.apache.org
>> Subject: Re: wrong region exception
>>
>> What version of hbase?  We used to see those from time to time in old
>> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>>
>> St.Ack
>>
>> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>>> Anyone ever see one of these?
>>>
>>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>>> Failed 25 actions: WrongRegionException: 25 times, servers with
>>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>>> c1-s03.atxd.maxpointinteractive.com:60020,
>>>                at
>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
>>> t
>>> ation.processBatch(HConnectionManager.java:1220)
>>>                at
>>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
>>> t
>>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>>                at
>>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>>                at
>>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>>                at
>>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectTh
>>> u
>>> mbs.java:453)
>>>                at
>>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>>                at
>>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566
>>> )
>>>                at
>>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>>                at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>>
>>> thanks,
>>>
>>> Gonz
>>>
>>>
>>>
>>>
>>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
attached


-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Monday, May 16, 2011 12:57 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

See the rest of my email.
St.Ack

On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> 0.90.0
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of 
> Stack
> Sent: Friday, May 13, 2011 2:21 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> What version of hbase?  We used to see those from time to time in old
> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>
> St.Ack
>
> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> Anyone ever see one of these?
>>
>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>> Failed 25 actions: WrongRegionException: 25 times, servers with
>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>> c1-s03.atxd.maxpointinteractive.com:60020,
>>                at
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
>> t
>> ation.processBatch(HConnectionManager.java:1220)
>>                at
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
>> t
>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>                at
>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>                at
>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>                at
>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectTh
>> u
>> mbs.java:453)
>>                at
>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>                at
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566
>> )
>>                at
>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>                at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> thanks,
>>
>> Gonz
>>
>>
>>
>>
>>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
See the rest of my email.
St.Ack

On Mon, May 16, 2011 at 8:18 AM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> 0.90.0
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Friday, May 13, 2011 2:21 PM
> To: user@hbase.apache.org
> Subject: Re: wrong region exception
>
> What version of hbase?  We used to see those from time to time in old
> 0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.
>
> St.Ack
>
> On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
>> Anyone ever see one of these?
>>
>> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException:
>> Failed 25 actions: WrongRegionException: 25 times, servers with
>> issues: c1-s49.atxd.maxpointinteractive.com:60020,
>> c1-s03.atxd.maxpointinteractive.com:60020,
>>                at
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplement
>> ation.processBatch(HConnectionManager.java:1220)
>>                at
>> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplement
>> ation.processBatchOfPuts(HConnectionManager.java:1234)
>>                at
>> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>>                at
>> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>>                at
>> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectThu
>> mbs.java:453)
>>                at
>> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>>                at
>> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>>                at
>> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>>                at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> thanks,
>>
>> Gonz
>>
>>
>>
>>
>>
>

RE: wrong region exception

Posted by Robert Gonzalez <Ro...@maxpointinteractive.com>.
0.90.0

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Friday, May 13, 2011 2:21 PM
To: user@hbase.apache.org
Subject: Re: wrong region exception

What version of hbase?  We used to see those from time to time in old
0.20 hbase but haven't seen one recent.  Usually the .META. table is 'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.

St.Ack

On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez <Ro...@maxpointinteractive.com> wrote:
> Anyone ever see one of these?
>
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: 
> Failed 25 actions: WrongRegionException: 25 times, servers with 
> issues: c1-s49.atxd.maxpointinteractive.com:60020, 
> c1-s03.atxd.maxpointinteractive.com:60020,
>                at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplement
> ation.processBatch(HConnectionManager.java:1220)
>                at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplement
> ation.processBatchOfPuts(HConnectionManager.java:1234)
>                at 
> org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>                at 
> org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>                at 
> com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectThu
> mbs.java:453)
>                at 
> org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>                at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>                at 
> org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>                at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> thanks,
>
> Gonz
>
>
>
>
>

Re: wrong region exception

Posted by Stack <st...@duboce.net>.
What version of hbase?  We used to see those from time to time in old
0.20 hbase but haven't seen one recent.  Usually the .META. table is
'off'.  If 0.90.x, try running ./bin/hbase hbck.  See what it says.

St.Ack

On Fri, May 13, 2011 at 11:57 AM, Robert Gonzalez
<Ro...@maxpointinteractive.com> wrote:
> Anyone ever see one of these?
>
> org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 25 actions: WrongRegionException: 25 times, servers with issues: c1-s49.atxd.maxpointinteractive.com:60020, c1-s03.atxd.maxpointinteractive.com:60020,
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1220)
>                at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchOfPuts(HConnectionManager.java:1234)
>                at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:819)
>                at org.apache.hadoop.hbase.client.HTable.close(HTable.java:831)
>                at com.maxpoint.crawl.crawlmgr.SelectThumbs$SelTReducer.cleanup(SelectThumbs.java:453)
>                at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:178)
>                at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:566)
>                at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
>                at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> thanks,
>
> Gonz
>
>
>
>
>