You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Jim Kellerman (JIRA)" <ji...@apache.org> on 2007/10/19 18:52:51 UTC

[jira] Created: (HADOOP-2079) [hbase] HLog generates incorrect file name when splitting a log, race condition also contributes

[hbase] HLog generates incorrect file name when splitting a log, race  condition also contributes
-------------------------------------------------------------------------------------------------

                 Key: HADOOP-2079
                 URL: https://issues.apache.org/jira/browse/HADOOP-2079
             Project: Hadoop
          Issue Type: Bug
          Components: contrib/hbase
    Affects Versions: 0.16.0
            Reporter: Jim Kellerman
            Assignee: Jim Kellerman
             Fix For: 0.16.0


In Hadoop-Nightly #277 TestRegionServerExit failed with a timeout.

The reason for this was a race in the Master in which checkAssigned (run from either the root or meta scanner)  will immediately try to split the log and then assign a region which has invalid server info.

The scenario went something like this:

1. region server aborted
2. root region was written on optional cache flush
lease timed out on aborted server which removes it from serversToServerInfo and queues a PendingServerShutdown operation
3. root scanner runs and finds server info incorrect (it is in the root region but the server is not in serversToServerInfo
4. checkAssigned starts splitting the log but because the log name is incorrect it can't finish
5. PendingServerShutdown fires and really gums up the works.

So there are two problems:

1. HLog.splitLog needs to generate the correct log file name.
2. PendingServerShutdown and/or leaseExpired need to cooperate with checkAssigned so that there are not two concurrent attempts to recover the log.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2079) [hbase] HLog generates incorrect file name when splitting a log, race condition also contributes

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HADOOP-2079:
----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

This issue was mostly about the incorrect HLog name generation and the race condition in the master in splitting the HLog when a region server dies. That part has been fixed. Resolving this issue.

> [hbase] HLog generates incorrect file name when splitting a log, race  condition also contributes
> -------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.16.0
>
>         Attachments: patch.txt
>
>
> In Hadoop-Nightly #277 TestRegionServerExit failed with a timeout.
> The reason for this was a race in the Master in which checkAssigned (run from either the root or meta scanner)  will immediately try to split the log and then assign a region which has invalid server info.
> The scenario went something like this:
> 1. region server aborted
> 2. root region was written on optional cache flush
> lease timed out on aborted server which removes it from serversToServerInfo and queues a PendingServerShutdown operation
> 3. root scanner runs and finds server info incorrect (it is in the root region but the server is not in serversToServerInfo
> 4. checkAssigned starts splitting the log but because the log name is incorrect it can't finish
> 5. PendingServerShutdown fires and really gums up the works.
> So there are two problems:
> 1. HLog.splitLog needs to generate the correct log file name.
> 2. PendingServerShutdown and/or leaseExpired need to cooperate with checkAssigned so that there are not two concurrent attempts to recover the log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2079) [hbase] HLog generates incorrect file name when splitting a log, race condition also contributes

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536405 ] 

Hadoop QA commented on HADOOP-2079:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
http://issues.apache.org/jira/secure/attachment/12368061/patch.txt
against trunk revision r586264.

    @author +1.  The patch does not contain any @author tags.

    javadoc +1.  The javadoc tool did not generate any warning messages.

    javac +1.  The applied patch does not generate any new compiler warnings.

    findbugs +1.  The patch does not introduce any new Findbugs warnings.

    core tests +1.  The patch passed core unit tests.

    contrib tests +1.  The patch passed contrib unit tests.

Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/974/testReport/
Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/974/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/974/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/974/console

This message is automatically generated.

> [hbase] HLog generates incorrect file name when splitting a log, race  condition also contributes
> -------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.16.0
>
>         Attachments: patch.txt
>
>
> In Hadoop-Nightly #277 TestRegionServerExit failed with a timeout.
> The reason for this was a race in the Master in which checkAssigned (run from either the root or meta scanner)  will immediately try to split the log and then assign a region which has invalid server info.
> The scenario went something like this:
> 1. region server aborted
> 2. root region was written on optional cache flush
> lease timed out on aborted server which removes it from serversToServerInfo and queues a PendingServerShutdown operation
> 3. root scanner runs and finds server info incorrect (it is in the root region but the server is not in serversToServerInfo
> 4. checkAssigned starts splitting the log but because the log name is incorrect it can't finish
> 5. PendingServerShutdown fires and really gums up the works.
> So there are two problems:
> 1. HLog.splitLog needs to generate the correct log file name.
> 2. PendingServerShutdown and/or leaseExpired need to cooperate with checkAssigned so that there are not two concurrent attempts to recover the log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2079) [hbase] HLog generates incorrect file name when splitting a log, race condition also contributes

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HADOOP-2079:
----------------------------------

    Status: Patch Available  (was: Open)

> [hbase] HLog generates incorrect file name when splitting a log, race  condition also contributes
> -------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.16.0
>
>         Attachments: patch.txt
>
>
> In Hadoop-Nightly #277 TestRegionServerExit failed with a timeout.
> The reason for this was a race in the Master in which checkAssigned (run from either the root or meta scanner)  will immediately try to split the log and then assign a region which has invalid server info.
> The scenario went something like this:
> 1. region server aborted
> 2. root region was written on optional cache flush
> lease timed out on aborted server which removes it from serversToServerInfo and queues a PendingServerShutdown operation
> 3. root scanner runs and finds server info incorrect (it is in the root region but the server is not in serversToServerInfo
> 4. checkAssigned starts splitting the log but because the log name is incorrect it can't finish
> 5. PendingServerShutdown fires and really gums up the works.
> So there are two problems:
> 1. HLog.splitLog needs to generate the correct log file name.
> 2. PendingServerShutdown and/or leaseExpired need to cooperate with checkAssigned so that there are not two concurrent attempts to recover the log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-2079) [hbase] HLog generates incorrect file name when splitting a log, race condition also contributes

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536427 ] 

Hudson commented on HADOOP-2079:
--------------------------------

Integrated in Hadoop-Nightly #278 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/278/])

> [hbase] HLog generates incorrect file name when splitting a log, race  condition also contributes
> -------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.16.0
>
>         Attachments: patch.txt
>
>
> In Hadoop-Nightly #277 TestRegionServerExit failed with a timeout.
> The reason for this was a race in the Master in which checkAssigned (run from either the root or meta scanner)  will immediately try to split the log and then assign a region which has invalid server info.
> The scenario went something like this:
> 1. region server aborted
> 2. root region was written on optional cache flush
> lease timed out on aborted server which removes it from serversToServerInfo and queues a PendingServerShutdown operation
> 3. root scanner runs and finds server info incorrect (it is in the root region but the server is not in serversToServerInfo
> 4. checkAssigned starts splitting the log but because the log name is incorrect it can't finish
> 5. PendingServerShutdown fires and really gums up the works.
> So there are two problems:
> 1. HLog.splitLog needs to generate the correct log file name.
> 2. PendingServerShutdown and/or leaseExpired need to cooperate with checkAssigned so that there are not two concurrent attempts to recover the log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-2079) [hbase] HLog generates incorrect file name when splitting a log, race condition also contributes

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HADOOP-2079:
----------------------------------

    Attachment: patch.txt

This patch addresses both HADOOP-2079 and HADOOP-2056

Changes:

Because row keys are essentially unbounded in length and can potentially file name unfriendly characters, the start key is now SHA1 encoded and inserted into the file name as a long decimal number. This is better than using Base64 encoding of the row key because the Base64 encoding is longer than the row key and can consequently cause the file name to be too long. This approach assures that file names will be unique and still be short enough for modern file systems. HRegionInfo now supports this encoding method and since SHA1 encoding is not reversable, the decode method has been removed.

HStore uses both the raw region name and the encoded region name.

Use new static method HRegionInfo.encodeRegionName: HLog, HMaster, HRegion

HMaster avoids race conditions on log splitting by only doing them in PendingServerShutdown if the server's lease expires while the master is running. If the master is just starting up, then the root and meta scanners invoke log splitting if they find stale server data.

HStoreFile now uses ArrayList and List instead of Vector and Collection



> [hbase] HLog generates incorrect file name when splitting a log, race  condition also contributes
> -------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2079
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2079
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/hbase
>    Affects Versions: 0.16.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.16.0
>
>         Attachments: patch.txt
>
>
> In Hadoop-Nightly #277 TestRegionServerExit failed with a timeout.
> The reason for this was a race in the Master in which checkAssigned (run from either the root or meta scanner)  will immediately try to split the log and then assign a region which has invalid server info.
> The scenario went something like this:
> 1. region server aborted
> 2. root region was written on optional cache flush
> lease timed out on aborted server which removes it from serversToServerInfo and queues a PendingServerShutdown operation
> 3. root scanner runs and finds server info incorrect (it is in the root region but the server is not in serversToServerInfo
> 4. checkAssigned starts splitting the log but because the log name is incorrect it can't finish
> 5. PendingServerShutdown fires and really gums up the works.
> So there are two problems:
> 1. HLog.splitLog needs to generate the correct log file name.
> 2. PendingServerShutdown and/or leaseExpired need to cooperate with checkAssigned so that there are not two concurrent attempts to recover the log.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.