You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Cosmin Lehene (Created) (JIRA)" <ji...@apache.org> on 2012/03/28 20:47:29 UTC

[jira] [Created] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Repeated split causes HRegionServer failures and breaks table 
--------------------------------------------------------------

                 Key: HBASE-5665
                 URL: https://issues.apache.org/jira/browse/HBASE-5665
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.92.1, 0.92.0
            Reporter: Cosmin Lehene
            Priority: Blocker


Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.

The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.

I was able to reproduce this on a smaller table consistently.

{code}
hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
{code}

Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 

{code}
2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
        at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
        at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
        at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
        at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
        at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
        at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
        at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
        at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
        ... 1 more
2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
{code}


http://hastebin.com/diqinibajo.avrasm

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241556#comment-13241556 ] 

stack commented on HBASE-5665:
------------------------------

Or, is this a problem only with forced splits?  It doesn't happen when we split 'naturally' because we'll check for references?
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Cosmin Lehene (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cosmin Lehene updated HBASE-5665:
---------------------------------

    Affects Version/s:     (was: 0.94.1)
                           (was: 0.96.0)
                           (was: 0.94.0)
               Status: Patch Available  (was: Open)
    
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.1, 0.92.0
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Cosmin Lehene (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241593#comment-13241593 ] 

Cosmin Lehene commented on HBASE-5665:
--------------------------------------

BTW - I don't think getSplitPoint should do that check, and we also shouldn't have to places where we check for references - perhaps we should have another JIRA to fix this in trunk?
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5665:
-------------------------

       Resolution: Fixed
    Fix Version/s: 0.94.0
                   0.92.2
     Hadoop Flags: Reviewed
           Status: Resolved  (was: Patch Available)

Committed to 0.92, 0.94, and trunk.  Thanks Cosmin and Matteo.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Cosmin Lehene (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cosmin Lehene updated HBASE-5665:
---------------------------------

    Attachment: HBASE-5665-0.92.patch

Adding patch.
>From Reference.java documentation 

{code}
 * Note, a region is itself not splitable if it has instances of store file
 * references.  References are cleaned up by compactions.
{code}

SplitTransaction.prepare() should check if the parent region has references.

I added a unit test and patch. Funny enough the HRegion.hasReferences was implemented but only used from a unit test.

I think recursive references wouldn't be so hard to have if there's a good reason to have them in the first place.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Ted Yu (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243842#comment-13243842 ] 

Ted Yu commented on HBASE-5665:
-------------------------------

HBASE-5665-trunk.patch looks good.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Cosmin Lehene (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cosmin Lehene updated HBASE-5665:
---------------------------------

    Affects Version/s: 0.94.1
                       0.96.0
                       0.94.0

0.94 and trunk seem to suffer from this as well and not checking if parent has references.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246965#comment-13246965 ] 

Hudson commented on HBASE-5665:
-------------------------------

Integrated in HBase-0.92-security #104 (See [https://builds.apache.org/job/HBase-0.92-security/104/])
    HBASE-5665 Repeated split causes HRegionServer failures and breaks table (Revision 1308549)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java

                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "stack (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-5665:
-------------------------

    Attachment: 5665trunk.v2.patch

Same as last patch but w/ fixed javadoc... isAvailable is not closed and not closing.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Cosmin Lehene (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241586#comment-13241586 ] 

Cosmin Lehene commented on HBASE-5665:
--------------------------------------

Indeed it seems to be a problem with forced splits. I'm not sure though if the natural splits are safe - they seem to be, but I need to test that too. 

RegionSplitPolicy.getSplitPoint() calls Store.getSplitPoint()
Store.getSplitPoint seems to do the check. 

{code}
    for (StoreFile sf : storefiles) {
        if (sf.isReference()) {
          // Should already be enforced since we return false in this case
          assert false : "getSplitPoint() called on a region that can't split!";
          return null;
        }
{code}

BTW, we also have Store.hasReferences()
{code}
  private boolean hasReferences(Collection<StoreFile> files) {
    if (files != null && files.size() > 0) {
      for (StoreFile hsf: files) {
        if (hsf.isReference()) {
          return true;
        }
      }
    }
    return false;
  }

{code}


However here's the code in HRegion.checkSplit()
If there's an explicit split point it won't get to do the reference check.

{code}
 public byte[] checkSplit() {
    // Can't split META
    if (getRegionInfo().isMetaRegion()) {
      if (shouldForceSplit()) {
        LOG.warn("Cannot split meta regions in HBase 0.20 and above");
      }
      return null;
    }

    if (this.explicitSplitPoint != null) {
      return this.explicitSplitPoint;
    }

    if (!splitPolicy.shouldSplit()) {
      return null;
    }

    byte[] ret = splitPolicy.getSplitPoint();

    if (ret != null) {
      try {
        checkRow(ret, "calculated split");
      } catch (IOException e) {
        LOG.error("Ignoring invalid split", e);
        return null;
      }
    }
    return ret;
  }
{code}

Multiple return points + a ret variable - this could use some polishing too :)

I'm a bit puzzled about the natural split, because, I've seen the problem with a forced split from UI where I don't think we provide an explicit split point. 

Cosmin
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Matteo Bertozzi (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243771#comment-13243771 ] 

Matteo Bertozzi commented on HBASE-5665:
----------------------------------------

Can we also add a couple of methods to the region like isSplittable() and isAvailable()
{code}
boolean isAvailable() {
  return !isClosed() && !isClosing();
}

boolean isSplittable() {
  return isAvailable() && !hasReferences();
}
{code}

just to avoid similar problems in future...
For example in HRegionServer both getMostLoadedRegions() and closeUserRegions() does the same "isAvailable()" check...
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245008#comment-13245008 ] 

Hudson commented on HBASE-5665:
-------------------------------

Integrated in HBase-TRUNK-security #156 (See [https://builds.apache.org/job/HBase-TRUNK-security/156/])
    HBASE-5665 Repeated split causes HRegionServer failures and breaks table (Revision 1308545)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java

                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246905#comment-13246905 ] 

Hudson commented on HBASE-5665:
-------------------------------

Integrated in HBase-0.94-security #7 (See [https://builds.apache.org/job/HBase-0.94-security/7/])
    HBASE-5665 Repeated split causes HRegionServer failures and breaks table (Revision 1308547)

     Result = SUCCESS
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java

                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Cosmin Lehene (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cosmin Lehene reassigned HBASE-5665:
------------------------------------

    Assignee: Cosmin Lehene
    
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241611#comment-13241611 ] 

Hadoop QA commented on HBASE-5665:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12520458/HBASE-5665-0.92.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
     

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1347//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1347//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1347//console

This message is automatically generated.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244830#comment-13244830 ] 

Hudson commented on HBASE-5665:
-------------------------------

Integrated in HBase-TRUNK #2704 (See [https://builds.apache.org/job/HBase-TRUNK/2704/])
    HBASE-5665 Repeated split causes HRegionServer failures and breaks table (Revision 1308545)

     Result = FAILURE
stack : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java

                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Matteo Bertozzi (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matteo Bertozzi updated HBASE-5665:
-----------------------------------

    Attachment: HBASE-5665-trunk.patch
    
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hadoop QA (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243898#comment-13243898 ] 

Hadoop QA commented on HBASE-5665:
----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12520847/HBASE-5665-trunk.patch
  against trunk revision .

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 1.3.9) warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

     -1 core tests.  The patch failed these unit tests:
     

Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/1362//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/1362//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/1362//console

This message is automatically generated.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241555#comment-13241555 ] 

stack commented on HBASE-5665:
------------------------------

Did we drop the check for references along the way Cosmin?  It used to be impossible to even attempt a split of a region with references.  Are you working on it?  I agree this a blocker all around.
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1, 0.94.0, 0.96.0, 0.94.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244713#comment-13244713 ] 

Hudson commented on HBASE-5665:
-------------------------------

Integrated in HBase-0.94 #79 (See [https://builds.apache.org/job/HBase-0.94/79/])
    HBASE-5665 Repeated split causes HRegionServer failures and breaks table (Revision 1308547)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/branches/0.94/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* /hbase/branches/0.94/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java

                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244858#comment-13244858 ] 

Hudson commented on HBASE-5665:
-------------------------------

Integrated in HBase-0.92 #351 (See [https://builds.apache.org/job/HBase-0.92/351/])
    HBASE-5665 Repeated split causes HRegionServer failures and breaks table (Revision 1308549)

     Result = FAILURE
stack : 
Files : 
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java
* /hbase/branches/0.92/src/main/java/org/apache/hadoop/hbase/regionserver/SplitTransaction.java
* /hbase/branches/0.92/src/test/java/org/apache/hadoop/hbase/regionserver/TestSplitTransaction.java

                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>             Fix For: 0.92.2, 0.94.0
>
>         Attachments: 5665trunk.v2.patch, HBASE-5665-0.92.patch, HBASE-5665-trunk.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "stack (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13241765#comment-13241765 ] 

stack commented on HBASE-5665:
------------------------------

Nice test Cosmin
                
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Assignee: Cosmin Lehene
>            Priority: Blocker
>         Attachments: HBASE-5665-0.92.patch
>
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HBASE-5665) Repeated split causes HRegionServer failures and breaks table

Posted by "Cosmin Lehene (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-5665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cosmin Lehene updated HBASE-5665:
---------------------------------

    Description: 
Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.

The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.

I was able to reproduce this on a smaller table consistently.

{code}
hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
{code}

Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 

{code}
2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
        at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
        at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
        at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
        at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
        at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
        at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
        at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
        at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
        ... 1 more
2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
{code}

http://hastebin.com/diqinibajo.avrasm

later edit:

(I'm using the last 4 characters from each string)
Region 94e3 has storefile 7237
Region 94e3 gets splited in daughters a: ffa1 and b: eee1
Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
ffa1 has a reference: 7237.94e3 for it's store file
when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
{code}
"^([0-9a-f]+)(?:\\.(.+))?$"
{code}
and will attempt to go to /hbase/t1/[region] which resolves to 
/hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 

This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)



  was:
Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.

The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.

I was able to reproduce this on a smaller table consistently.

{code}
hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
{code}

Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 

{code}
2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
        at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
        at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
        at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
        at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
        at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
        at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
        at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
        at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
        at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
        at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
        at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
        ... 1 more
2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
{code}


http://hastebin.com/diqinibajo.avrasm

    
> Repeated split causes HRegionServer failures and breaks table 
> --------------------------------------------------------------
>
>                 Key: HBASE-5665
>                 URL: https://issues.apache.org/jira/browse/HBASE-5665
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.92.0, 0.92.1
>            Reporter: Cosmin Lehene
>            Priority: Blocker
>
> Repeated splits on large tables (2 consecutive would suffice) will essentially "break" the table (and the cluster), unrecoverable.
> The regionserver doing the split dies and the master will get into an infinite loop trying to assign regions that seem to have the files missing from HDFS.
> The table can be disabled once. upon trying to re-enable it, it will remain in an intermediary state forever.
> I was able to reproduce this on a smaller table consistently.
> {code}
> hbase(main):030:0> (0..10000).each{|x| put 't1', "#{x}", 'f1:t', 'dd'}
> hbase(main):030:0> (0..1000).each{|x| split 't1', "#{x*10}"}
> {code}
> Running overlapping splits in parallel (e.g. "#{x*10+1}", "#{x*10+2}"... ) will reproduce the issue almost instantly and consistently. 
> {code}
> 2012-03-28 10:57:16,320 INFO org.apache.hadoop.hbase.catalog.MetaEditor: Offlined parent region t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1. in META
> 2012-03-28 10:57:16,321 DEBUG org.apache.hadoop.hbase.regionserver.CompactSplitThread: Split requested for t1,5,1332957435767.648d30de55a5cec6fc2f56dcb3c7eee1..  compaction_queue=(0:1), split_queue=10
> 2012-03-28 10:57:16,343 INFO org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup of failed split of t1,,1332957435767.2fb0473f4e71339e88dab0ee0d4dffa1.; Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
> java.io.IOException: Failed ld2,60020,1332957343833-daughterOpener=2469c5650ea2aeed631eb85d3cdc3124
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughters(SplitTransaction.java:363)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.execute(SplitTransaction.java:451)
>         at org.apache.hadoop.hbase.regionserver.SplitRequest.run(SplitRequest.java:67)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.FileNotFoundException: File does not exist: /hbase/t1/589c44cabba419c6ad8c9b427e5894e3.2fb0473f4e71339e88dab0ee0d4dffa1/f1/d62a852c25ad44e09518e102ca557237
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1822)
>         at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1813)
>         at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:544)
>         at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:187)
>         at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:456)
>         at org.apache.hadoop.hbase.io.hfile.HFile.createReader(HFile.java:341)
>         at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1008)
>         at org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:65)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:467)
>         at org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:548)
>         at org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:284)
>         at org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:221)
>         at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:2511)
>         at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:450)
>         at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:3229)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction.openDaughterRegion(SplitTransaction.java:504)
>         at org.apache.hadoop.hbase.regionserver.SplitTransaction$DaughterOpener.run(SplitTransaction.java:484)
>         ... 1 more
> 2012-03-28 10:57:16,345 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ld2,60020,1332957343833: Abort; we got an error after point-of-no-return
> {code}
> http://hastebin.com/diqinibajo.avrasm
> later edit:
> (I'm using the last 4 characters from each string)
> Region 94e3 has storefile 7237
> Region 94e3 gets splited in daughters a: ffa1 and b: eee1
> Daughter region ffa1 get's splitted in daughters a: 3124 and b: dc77
> ffa1 has a reference: 7237.94e3 for it's store file
> when ffa1 gets splited it will create another reference: 7237.94e3.ffa1
> when SplitTransaction will execute() it will try to open that (openDaughters above) and it will match it from left to right [storefile].[region] 
> {code}
> "^([0-9a-f]+)(?:\\.(.+))?$"
> {code}
> and will attempt to go to /hbase/t1/[region] which resolves to 
> /hbase/t1/94e3.ffa1/f1/7237 - which obviously doesn't exist and will fail. 
> This seems like a design problem: we should either stop from splitting if the path is reference or be able to recursively resolve reference paths (e.g. parse right to left 7237.94e3.ffa1 -> [7237.94e3].ffa1 -> open /hbase/t1/ffa1/f1/7237.94e3 -> [7237].94e3 -> open /hbase/t1/94e3/7237)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira