You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2011/08/29 06:10:38 UTC

[jira] [Created] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

IOE ignored during flush-on-close causes dataloss
-------------------------------------------------

                 Key: HBASE-4270
                 URL: https://issues.apache.org/jira/browse/HBASE-4270
             Project: HBase
          Issue Type: Bug
          Components: regionserver
    Affects Versions: 0.90.4, 0.92.0
            Reporter: Todd Lipcon
            Priority: Blocker
             Fix For: 0.92.0


If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.

Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102132#comment-13102132 ] 

Hudson commented on HBASE-4270:
-------------------------------

Integrated in HBase-TRUNK #2198 (See [https://builds.apache.org/job/HBase-TRUNK/2198/])
    HBASE-4270 IOE ignored during flush-on-close causes dataloss
HBASE-4270 IOE ignored during flush-on-close causes dataloss

stack : 
Files : 
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/handler/MockRegionServerServices.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/handler/MockServer.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java

stack : 
Files : 
* /hbase/trunk/CHANGES.txt
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java


> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093044#comment-13093044 ] 

Jean-Daniel Cryans commented on HBASE-4270:
-------------------------------------------

Ah I thought I read it was happening during abort. That's a pretty big hole then!

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093033#comment-13093033 ] 

Todd Lipcon commented on HBASE-4270:
------------------------------------

Without the server.abort() call, the logs don't get replayed. It just closes the region, and reports successful close in ZK, at which point some other server picks up the region without the data.

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092968#comment-13092968 ] 

Jean-Daniel Cryans commented on HBASE-4270:
-------------------------------------------

Like you say the logs are replayed, so how is that causing data loss?

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102091#comment-13102091 ] 

stack commented on HBASE-4270:
------------------------------

@Lars Doesn't look like it.  This patch is about how we keep going if an IOE on close of a region rather than abort the RS to get the WAL replayed.  4078 is about validating created storefiles post compaction.

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4270:
-------------------------

    Status: Patch Available  (was: Open)

Marking patch available.

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "Lars Hofhansl (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102079#comment-13102079 ] 

Lars Hofhansl commented on HBASE-4270:
--------------------------------------

Is this related at all to HBASE-4078?

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102053#comment-13102053 ] 

jiraposter@reviews.apache.org commented on HBASE-4270:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1784/#review1849
-----------------------------------------------------------

Ship it!



src/test/java/org/apache/hadoop/hbase/regionserver/handler/MockServer.java
<https://reviews.apache.org/r/1784/#comment4226>

    Maybe this can be removed.



src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java
<https://reviews.apache.org/r/1784/#comment4227>

    Nice.


- Ted


On 2011-09-09 23:08:12, Michael Stack wrote:
bq.  
bq.  -----------------------------------------------------------
bq.  This is an automatically generated e-mail. To reply, visit:
bq.  https://reviews.apache.org/r/1784/
bq.  -----------------------------------------------------------
bq.  
bq.  (Updated 2011-09-09 23:08:12)
bq.  
bq.  
bq.  Review request for hbase.
bq.  
bq.  
bq.  Summary
bq.  -------
bq.  
bq.  Todd wrote the patch for this issue.  Whats posted here is his patch plus a unit test.  The diff is pretty big because I refactored the TestOpenRegionHandler so I could share bits of it creating this new TestCloseRegionHandler; the bulk of the patch is making shared mock server and shared mock regionserverservice files.
bq.  
bq.  
bq.  This addresses bug hbase-4270.
bq.      https://issues.apache.org/jira/browse/hbase-4270
bq.  
bq.  
bq.  Diffs
bq.  -----
bq.  
bq.    src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java b684af2 
bq.    src/test/java/org/apache/hadoop/hbase/regionserver/handler/MockRegionServerServices.java PRE-CREATION 
bq.    src/test/java/org/apache/hadoop/hbase/regionserver/handler/MockServer.java PRE-CREATION 
bq.    src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java PRE-CREATION 
bq.    src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java ab12968 
bq.  
bq.  Diff: https://reviews.apache.org/r/1784/diff
bq.  
bq.  
bq.  Testing
bq.  -------
bq.  
bq.  I ran the new TestCloseRegionHandler test.
bq.  
bq.  
bq.  Thanks,
bq.  
bq.  Michael
bq.  
bq.



> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "jiraposter@reviews.apache.org (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101630#comment-13101630 ] 

jiraposter@reviews.apache.org commented on HBASE-4270:
------------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/1784/
-----------------------------------------------------------

Review request for hbase.


Summary
-------

Todd wrote the patch for this issue.  Whats posted here is his patch plus a unit test.  The diff is pretty big because I refactored the TestOpenRegionHandler so I could share bits of it creating this new TestCloseRegionHandler; the bulk of the patch is making shared mock server and shared mock regionserverservice files.


This addresses bug hbase-4270.
    https://issues.apache.org/jira/browse/hbase-4270


Diffs
-----

  src/main/java/org/apache/hadoop/hbase/regionserver/handler/CloseRegionHandler.java b684af2 
  src/test/java/org/apache/hadoop/hbase/regionserver/handler/MockRegionServerServices.java PRE-CREATION 
  src/test/java/org/apache/hadoop/hbase/regionserver/handler/MockServer.java PRE-CREATION 
  src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestCloseRegionHandler.java PRE-CREATION 
  src/test/java/org/apache/hadoop/hbase/regionserver/handler/TestOpenRegionHandler.java ab12968 

Diff: https://reviews.apache.org/r/1784/diff


Testing
-------

I ran the new TestCloseRegionHandler test.


Thanks,

Michael



> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HBASE-4270:
-------------------------------

    Attachment: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch

basic fix - i've verified this fixes the data loss. Needs a unit test.

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (HBASE-4270) IOE ignored during flush-on-close causes dataloss

Posted by "stack (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-4270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4270:
-------------------------

      Resolution: Fixed
        Assignee: Todd Lipcon
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed to TRUNK.  Committed the fix only to branch (no unit tests -- too hard to redo for 0.90 because of ServerName).  Thanks for review Ted.  Did your suggested remove on commit.

> IOE ignored during flush-on-close causes dataloss
> -------------------------------------------------
>
>                 Key: HBASE-4270
>                 URL: https://issues.apache.org/jira/browse/HBASE-4270
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.4, 0.92.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.92.0
>
>         Attachments: 0001-HBASE-4270.-Abort-and-rethrow-errors-on-close-failur.patch
>
>
> If the RS experiences an exception during the flush of a region while closing it, it currently catches the exception, logs a warning, and keeps going. If the exception was a DroppedSnapshotException, this means that it will silently drop any data that was in memstore when the region was closed.
> Instead, the RS should do a hard abort so that its logs will be replayed.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira