You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Jim Kellerman (JIRA)" <ji...@apache.org> on 2009/01/27 03:56:59 UTC

[jira] Created: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                 Key: HBASE-1157
                 URL: https://issues.apache.org/jira/browse/HBASE-1157
             Project: Hadoop HBase
          Issue Type: Bug
          Components: master, regionserver
    Affects Versions: 0.20.0
            Reporter: Jim Kellerman
            Assignee: Jim Kellerman
             Fix For: 0.20.0




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683284#action_12683284 ] 

ryan rawson commented on HBASE-1157:
------------------------------------

the problem seems to be the new serverToServerInfo() map format...

previously the key was 'server:port'.  Now it is: 'server_startcode_port'.

The problem seems to be that the only place one has access to the startcode is if you already have the HServerInfo, which if you did, you wouldn't need the serversToServerInfo map anyways...

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689820#action_12689820 ] 

ryan rawson commented on HBASE-1157:
------------------------------------

have a look at the work i patched in on HBASE-1290, i added a second map to provide data for the JSP.

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1157.patch, HBASE-1267-3.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "stack (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681994#action_12681994 ] 

stack commented on HBASE-1157:
------------------------------

Just wondering, did you talk w/ the ZKBoys about whether or not ZK already has a mechanism differentiating different regionservers in ZK?

Above you say "While we might use a different mechanism after ZK integration, the ZK integration still needs to account for what instance of the regionserver it is dealing with."  How does this patch help the ZK integration effort? 

(I ask because you just made a fairly large commit without general notes on what is in the commit).

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson updated HBASE-1157:
-------------------------------

    Attachment: HBASE-1157.patch

here is a proposed patch - the other option is to add a parallel map in addition to serversToServerInfo.  Considering the major purpose of this map is to feed into the JSPs i went with this approach instead.

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1157.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689817#action_12689817 ] 

Jim Kellerman commented on HBASE-1157:
--------------------------------------

> ryan rawson added a comment - 18/Mar/09 09:37 PM
> here is a proposed patch - the other option is to add a parallel map in addition to serversToServerInfo.
> Considering the major purpose of this map is to feed into the JSPs i went with this approach instead.

The major purpose of serversToServerinfo is to keep track of which servers are alive.
This will change as Zookeeper integration expands.

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1157.patch, HBASE-1267-3.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682083#action_12682083 ] 

Jim Kellerman commented on HBASE-1157:
--------------------------------------

Sorry, meant to include the list of changes (which I had written down), but forgot. Here they are:

Summary of change:

Server name is now hostname_startcode_port so it is unique among multiple instances of a region server running at the same address

Files changed:

HServerInfo
- Cache server name once it's been computed so that subsequent gets of the server name are a simple dereference. This required setServerAddress, setStartCode to be synchronized as well as getServerName.
- Return a copy of the embedded HServerAddress since an HRegionInfo can get reused and the reference returned can change out from other methods that have stored that reference.
- Changed hashCode, compareTo to just use the serverName as that is now unique.
- new static methods to compute serverName given a HServerInfo; hostName (name:port String) and startCode; and one that takes a HServerAddress and startCode.

HRegionInfo
- Not actually needed for this issue, but I had at one point thought I needed to change it, and when I discarded the change, I left the following in
- make regionName transient as it is not serialized
- fix javadoc for parseRegionName, shouldSplit

HMaster
- regionServerStartup can now throw an exception which comes out of ServerManager.regionServerStartup

ServerManager
- remove unused import
- the key for serversToServerInfo is still a string but is the server name as described above, and not host:port as it was previously. This allows us to distinguish between different instances of a region server at the same address. Zookeeper still uses a ServerExpirer object, but the name is now the server name as described above rather than a host:port pair.
- removed Boolean from deadServers Map. It was not used and is no longer needed. deadServers is now a set. deadServers is still used to detect a server that has not restarted but reports in after its lease has expired (perhaps due to a network partitioning). It cannot be given work until its existing logs have been recovered. 
- removed checkForGhostReferences. deadServers now handles servers that report back in after lease expiration but have not restarted. Servers that have restarted will have a new server name and can consequently be assigned work immediately.
- regionServerStartup can now throw Leases.LeaseStillHeldException. Changed String 's' to String 'serverName'.
- regionServerReport can now throw Leases.LeaseStillHeldException.
- processRegionServerExit, processRegionServerAllsWell, processMsgs, processRegionOpen now take a HServerInfo rather than a string server name
- processSplitRegion no longer takes either a server name or a serverInfo because they were unused.
- getServersToServerInfo, getServersToLoad now return a Collections.unmodifyableMap instead of a new HashMap
- getLoadToServers removed. Unreferenced.

RegionManager
- regionsInTransition is now a Map<String, RegionState> instead of Map<byte[], RegionState> key is server name described above
- assignRegions now just takes an HServerInfo instead of HServerInfo, String as that was redundant and the server name can be obtained from HServerInfo
- unassignSomeRegions, assignRegionsToOneServer, assignRegionsToMultipleServers take an HServerInfo rather than String
- regionIsInTransition, regionIsOpening, isPendingOpen, setOpen, isOfflined, setPendingClose, setClosed take String as argument instead of byte[] due to change in regionsInTransition map

BaseScanner
- checkAssigned now composes a server name as described above from serverAddress and startCode. Additionally it no longer needs to compare start codes as ServerManager.getServerInfo uses a server name as described above and since nothing will be found if the names do not match exactly (since the startcode is part of the name) the comparison of start codes is redundant. Finally it calls new static function HLog.getHLogDirectoryName instead of duplicating the code.

TableOperation
- now builds a server name as described above to pass to isBeingServed and processScanItem
- isBeingServed, processScanItem now take a server name as described above rather than host:port and start code

ChangeTableState
- processScanItem now takes a server name as described above rather than host:port and start code

ModifyTableMeta
- added @SuppressWarnings("unused") for server name argument in processScanItem

TableDelete
- added @SuppressWarnings("unused") for server name argument in processScanItem

ColumnOperation
- added @SuppressWarnings("unused") for server name argument in processScanItem

ProcessRegionOpen
- now stores the HServerInfo passed to the constructor rather than the HServerAddress and start code contained therein
- added if (LOG.isDebugEnabled()) { around LOG.debug calls
- other changes are related to extracting the HServerAddress and start code fields from the HServerInfo

ProcessServerShutdown
- no longer needs the HServerAddress as it constructs a server name as described above for comparision with entries scanned from meta region.

HLog
- Added new static methods getHLogDirectoryName that take either a HServerInfo a host:port string and start code or a string which is a server name as described above


> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson updated HBASE-1157:
-------------------------------

    Attachment: HBASE-1267-3.patch

ok the previous approach does not work.  Here is a new patch with the other approach.

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1157.patch, HBASE-1267-3.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson resolved HBASE-1157.
--------------------------------

    Resolution: Fixed

fixed in HBASE-1290

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>         Attachments: HBASE-1157.patch, HBASE-1267-3.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667570#action_12667570 ] 

Jim Kellerman commented on HBASE-1157:
--------------------------------------

As 0.19.x does not allow a server with the same ip:port number to join the cluster until the previous instance has been recovered, this is not an issue for 0.19.x. However, if we intend to allow a new instance of the same server to start serving regions, we need to be able to differentiate between the dead instance and the new one.

This is my primary reason for making the startcode a part of the regionserver identification.

While we might use a different mechanism after ZK integration, the ZK integration still needs to account for what instance
of the regionserver it is dealing with.


> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Work started: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on HBASE-1157 started by Jim Kellerman.

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman resolved HBASE-1157.
----------------------------------

    Resolution: Fixed

Passes tests. Committed.

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Reopened: (HBASE-1157) If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code

Posted by "ryan rawson (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

ryan rawson reopened HBASE-1157:
--------------------------------


The web UI is broken now.  The web UI has dependencies on the servers to server info map, and the key format has changed but the JSP code has not.

> If we do not take start code as a part of region server recovery, we could inadvertantly try to reassign regions assigned to a restarted server with a different start code
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1157
>                 URL: https://issues.apache.org/jira/browse/HBASE-1157
>             Project: Hadoop HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.20.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.20.0
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.