You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "zhihai xu (JIRA)" <ji...@apache.org> on 2015/02/03 10:59:37 UTC

[jira] [Commented] (YARN-1778) TestFSRMStateStore fails on trunk

    [ https://issues.apache.org/jira/browse/YARN-1778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14303027#comment-14303027 ] 

zhihai xu commented on YARN-1778:
---------------------------------

This test failed randomly. When it failed, the exception is "java.io.IOException: NameNode still not started"
 I find out the reason for this random error is
The NameNode constructor [set started flag|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L780] at the end.
{code}
    this.started.set(true);
{code}
And it starts [NameNodeRpcServer|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L765] by calling function initialize before started flag is set.
initialize=>startCommonServices=>[rpcServer.start|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java#L639]
{code}
    rpcServer.start();
{code}
If the client (which try to call mkdirs) connects to NameNode server before started flag is set,
the java.io.IOException: "[NameNode still not started|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java#L1875]" will happen, then the test will fail.
If the client connects to NameNode server after started flag is set, the test will succeed.

Although we can fix this error by reordering the code in NameNode constructor: move rpcServer.start to the end before started flag is set,
It looks like it is the normal behavior for the name node. It will give exception when the clients connect before started flag is set. since the gap is very small, it will have little side effect and clients can also retry if exception happens.
I submitted a patch which add delay before call storeApplicationStateInternal and also not treat this exception(NameNode still not started) as an error.

The proof is given in the following logs:
{code}
2015-02-03 00:09:18,991 INFO  [Thread-2] namenode.NameNode (NameNode.java:<init>(754)) - create NameNode
2015-02-03 00:09:18,991 INFO  [Thread-2] namenode.NameNode (NameNode.java:setClientNamenodeAddress(352)) - fs.defaultFS is hdfs://localhost:57792
2015-02-03 00:09:18,991 INFO  [Thread-2] namenode.NameNode (NameNode.java:setClientNamenodeAddress(372)) - Clients are to use localhost:57792 to access this namenode/service.
2015-02-03 00:09:18,992 INFO  [Thread-2] namenode.NameNode (NameNode.java:<init>(766)) - create NameNode initialize
2015-02-03 00:09:18,996 INFO  [Thread-2] hdfs.DFSUtil (DFSUtil.java:httpServerTemplateForNNAndJN(1760)) - Starting Web-server for hdfs at: http://localhost:57791
2015-02-03 00:09:18,997 INFO  [Thread-2] http.HttpRequestLog (HttpRequestLog.java:getRequestLog(80)) - Http request log for http.requests.namenode is not defined
2015-02-03 00:09:18,997 INFO  [Thread-2] http.HttpServer2 (HttpServer2.java:addGlobalFilter(621)) - Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
2015-02-03 00:09:18,998 INFO  [Thread-2] http.HttpServer2 (HttpServer2.java:addFilter(599)) - Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context hdfs
2015-02-03 00:09:18,998 INFO  [Thread-2] http.HttpServer2 (HttpServer2.java:addFilter(606)) - Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static
2015-02-03 00:09:18,999 INFO  [Thread-2] http.HttpServer2 (NameNodeHttpServer.java:initWebHdfs(86)) - Added filter 'org.apache.hadoop.hdfs.web.AuthFilter' (class=org.apache.hadoop.hdfs.web.AuthFilter)
2015-02-03 00:09:18,999 INFO  [Thread-2] http.HttpServer2 (HttpServer2.java:addJerseyResourcePackage(525)) - addJerseyResourcePackage: packageName=org.apache.hadoop.hdfs.server.namenode.web.resources;org.apache.hadoop.hdfs.web.resources, pathSpec=/webhdfs/v1/*
2015-02-03 00:09:19,000 INFO  [Thread-2] http.HttpServer2 (HttpServer2.java:openListeners(808)) - Jetty bound to port 57791
2015-02-03 00:09:19,000 INFO  [Thread-2] mortbay.log (Slf4jLog.java:info(67)) - jetty-6.1.26
2015-02-03 00:09:19,019 INFO  [Thread-2] mortbay.log (Slf4jLog.java:info(67)) - Started SelectChannelConnector@localhost:57791
2015-02-03 00:09:19,021 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(721)) - No KeyProvider found.
2015-02-03 00:09:19,021 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(731)) - fsLock is fair:true
2015-02-03 00:09:19,022 INFO  [Thread-2] blockmanagement.DatanodeManager (DatanodeManager.java:<init>(239)) - dfs.block.invalidate.limit=1000
2015-02-03 00:09:19,022 INFO  [Thread-2] blockmanagement.DatanodeManager (DatanodeManager.java:<init>(245)) - dfs.namenode.datanode.registration.ip-hostname-check=true
2015-02-03 00:09:19,022 INFO  [Thread-2] blockmanagement.BlockManager (InvalidateBlocks.java:printBlockDeletionTime(71)) - dfs.namenode.startup.delay.block.deletion.sec is set to 000:00:00:00.000
2015-02-03 00:09:19,023 INFO  [Thread-2] blockmanagement.BlockManager (InvalidateBlocks.java:printBlockDeletionTime(76)) - The block deletion will start around 2015 Feb 03 00:09:19
2015-02-03 00:09:19,023 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map BlocksMap
2015-02-03 00:09:19,023 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type       = 64-bit
2015-02-03 00:09:19,023 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 2.0% max memory 3.6 GB = 72.8 MB
2015-02-03 00:09:19,023 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity      = 2^23 = 8388608 entries
2015-02-03 00:09:19,027 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:createBlockTokenSecretManager(365)) - dfs.block.access.token.enable=false
2015-02-03 00:09:19,027 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(350)) - defaultReplication         = 2
2015-02-03 00:09:19,027 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(351)) - maxReplication             = 512
2015-02-03 00:09:19,027 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(352)) - minReplication             = 1
2015-02-03 00:09:19,027 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(353)) - maxReplicationStreams      = 2
2015-02-03 00:09:19,027 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(354)) - shouldCheckForEnoughRacks  = false
2015-02-03 00:09:19,028 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(355)) - replicationRecheckInterval = 3000
2015-02-03 00:09:19,028 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(356)) - encryptDataTransfer        = false
2015-02-03 00:09:19,028 INFO  [Thread-2] blockmanagement.BlockManager (BlockManager.java:<init>(357)) - maxNumBlocksToLog          = 1000
2015-02-03 00:09:19,028 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(751)) - fsOwner             = zxu (auth:SIMPLE)
2015-02-03 00:09:19,028 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(752)) - supergroup          = supergroup
2015-02-03 00:09:19,028 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(753)) - isPermissionEnabled = true
2015-02-03 00:09:19,028 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(764)) - HA Enabled: false
2015-02-03 00:09:19,029 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map INodeMap
2015-02-03 00:09:19,029 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type       = 64-bit
2015-02-03 00:09:19,029 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 1.0% max memory 3.6 GB = 36.4 MB
2015-02-03 00:09:19,029 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity      = 2^22 = 4194304 entries
2015-02-03 00:09:19,032 INFO  [Thread-2] namenode.FSDirectory (FSDirectory.java:<init>(217)) - ACLs enabled? false
2015-02-03 00:09:19,033 INFO  [Thread-2] namenode.FSDirectory (FSDirectory.java:<init>(221)) - XAttrs enabled? true
2015-02-03 00:09:19,033 INFO  [Thread-2] namenode.FSDirectory (FSDirectory.java:<init>(229)) - Maximum size of an xattr: 16384
2015-02-03 00:09:19,033 INFO  [Thread-2] namenode.NameNode (FSDirectory.java:<init>(273)) - Caching file names occuring more than 10 times
2015-02-03 00:09:19,033 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map cachedBlocks
2015-02-03 00:09:19,033 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type       = 64-bit
2015-02-03 00:09:19,034 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 0.25% max memory 3.6 GB = 9.1 MB
2015-02-03 00:09:19,034 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity      = 2^20 = 1048576 entries
2015-02-03 00:09:19,035 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(5010)) - dfs.namenode.safemode.threshold-pct = 0.9990000128746033
2015-02-03 00:09:19,035 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(5011)) - dfs.namenode.safemode.min.datanodes = 0
2015-02-03 00:09:19,035 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:<init>(5012)) - dfs.namenode.safemode.extension     = 0
2015-02-03 00:09:19,035 INFO  [Thread-2] metrics.TopMetrics (TopMetrics.java:logConf(65)) - NNTop conf: dfs.namenode.top.window.num.buckets = 10
2015-02-03 00:09:19,035 INFO  [Thread-2] metrics.TopMetrics (TopMetrics.java:logConf(67)) - NNTop conf: dfs.namenode.top.num.users = 10
2015-02-03 00:09:19,035 INFO  [Thread-2] metrics.TopMetrics (TopMetrics.java:logConf(69)) - NNTop conf: dfs.namenode.top.windows.minutes = 1,5,25
2015-02-03 00:09:19,036 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:initRetryCache(901)) - Retry cache on namenode is enabled
2015-02-03 00:09:19,036 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:initRetryCache(909)) - Retry cache will use 0.03 of total heap and retry cache entry expiry time is 600000 millis
2015-02-03 00:09:19,036 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(354)) - Computing capacity for map NameNodeRetryCache
2015-02-03 00:09:19,036 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(355)) - VM type       = 64-bit
2015-02-03 00:09:19,037 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(356)) - 0.029999999329447746% max memory 3.6 GB = 1.1 MB
2015-02-03 00:09:19,037 INFO  [Thread-2] util.GSet (LightWeightGSet.java:computeCapacity(361)) - capacity      = 2^17 = 131072 entries
2015-02-03 00:09:19,039 INFO  [Thread-2] common.Storage (Storage.java:tryLock(715)) - Lock on /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name1/in_use.lock acquired by nodename 50468@zxu-MBP.local
2015-02-03 00:09:19,040 INFO  [Thread-2] common.Storage (Storage.java:tryLock(715)) - Lock on /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name2/in_use.lock acquired by nodename 50468@zxu-MBP.local
2015-02-03 00:09:19,042 INFO  [Thread-2] namenode.FileJournalManager (FileJournalManager.java:recoverUnfinalizedSegments(363)) - Recovering unfinalized segments in /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name1/current
2015-02-03 00:09:19,043 INFO  [Thread-2] namenode.FileJournalManager (FileJournalManager.java:recoverUnfinalizedSegments(363)) - Recovering unfinalized segments in /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name2/current
2015-02-03 00:09:19,050 INFO  [Thread-2] namenode.FSImageFormatPBINode (FSImageFormatPBINode.java:loadINodeSection(216)) - Loading 1 INodes.
2015-02-03 00:09:19,051 INFO  [Thread-2] namenode.FSImageFormatProtobuf (FSImageFormatProtobuf.java:load(181)) - Loaded FSImage in 0 seconds.
2015-02-03 00:09:19,051 INFO  [Thread-2] namenode.FSImage (FSImage.java:loadFSImage(943)) - Loaded image for txid 0 from /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name1/current/fsimage_0000000000000000000
2015-02-03 00:09:19,051 INFO  [Thread-2] namenode.FSImage (FSImage.java:loadEdits(827)) - Reading org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream@6d3deca1 expecting start txid #1
2015-02-03 00:09:19,051 INFO  [Thread-2] namenode.FSImage (FSEditLogLoader.java:loadFSEdits(140)) - Start loading edits file /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name1/current/edits_0000000000000000001-0000000000000000007, /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name2/current/edits_0000000000000000001-0000000000000000007
2015-02-03 00:09:19,052 INFO  [Thread-2] namenode.EditLogInputStream (RedundantEditLogInputStream.java:nextOp(176)) - Fast-forwarding stream '/Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name1/current/edits_0000000000000000001-0000000000000000007' to transaction ID 1
2015-02-03 00:09:19,056 INFO  [Thread-2] namenode.FSImage (FSEditLogLoader.java:loadFSEdits(143)) - Edits file /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name1/current/edits_0000000000000000001-0000000000000000007, /Users/zxu/upstream/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/target/test-dir/dfs/name2/current/edits_0000000000000000001-0000000000000000007 of size 507 edits # 7 loaded in 0 seconds
2015-02-03 00:09:19,056 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:loadFSImage(978)) - Need to save fs image? false (staleImage=false, haEnabled=false, isRollingUpgrade=false)
2015-02-03 00:09:19,057 INFO  [Thread-2] namenode.FSEditLog (FSEditLog.java:startLogSegment(1234)) - Starting log segment at 8
2015-02-03 00:09:19,059 INFO  [Thread-2] namenode.NameCache (NameCache.java:initialized(143)) - initialized with 0 entries 0 lookups
2015-02-03 00:09:19,060 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(693)) - Finished loading FSImage in 22 msecs
2015-02-03 00:09:19,060 INFO  [Thread-2] namenode.NameNode (NameNodeRpcServer.java:<init>(339)) - RPC server is binding to localhost:57792
2015-02-03 00:09:19,060 INFO  [Thread-2] ipc.CallQueueManager (CallQueueManager.java:<init>(53)) - Using callQueue class java.util.concurrent.LinkedBlockingQueue
2015-02-03 00:09:19,061 INFO  [Socket Reader #1 for port 57792] ipc.Server (Server.java:run(609)) - Starting Socket Reader #1 for port 57792
2015-02-03 00:09:19,063 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:registerMBean(5872)) - Registered FSNamesystemState MBean
2015-02-03 00:09:19,075 INFO  [Thread-2] namenode.LeaseManager (LeaseManager.java:getNumUnderConstructionBlocks(132)) - Number of blocks under construction: 0
2015-02-03 00:09:19,075 INFO  [Thread-2] namenode.LeaseManager (LeaseManager.java:getNumUnderConstructionBlocks(132)) - Number of blocks under construction: 0
2015-02-03 00:09:19,075 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:initializeReplQueues(1175)) - initializing replication queues
2015-02-03 00:09:19,076 INFO  [Thread-2] hdfs.StateChange (FSNamesystem.java:leave(5084)) - STATE* Leaving safe mode after 0 secs
2015-02-03 00:09:19,076 INFO  [Thread-2] hdfs.StateChange (FSNamesystem.java:leave(5095)) - STATE* Network topology has 0 racks and 0 datanodes
2015-02-03 00:09:19,076 INFO  [Thread-2] hdfs.StateChange (FSNamesystem.java:leave(5098)) - STATE* UnderReplicatedBlocks has 0 blocks
2015-02-03 00:09:19,081 INFO  [Replication Queue Initializer] blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2660)) - Total number of blocks            = 0
2015-02-03 00:09:19,081 INFO  [Replication Queue Initializer] blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2661)) - Number of invalid blocks          = 0
2015-02-03 00:09:19,081 INFO  [Replication Queue Initializer] blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2662)) - Number of under-replicated blocks = 0
2015-02-03 00:09:19,081 INFO  [Replication Queue Initializer] blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2663)) - Number of  over-replicated blocks = 0
2015-02-03 00:09:19,081 INFO  [Replication Queue Initializer] blockmanagement.BlockManager (BlockManager.java:processMisReplicatesAsync(2665)) - Number of blocks being written    = 0
2015-02-03 00:09:19,081 INFO  [Replication Queue Initializer] hdfs.StateChange (BlockManager.java:processMisReplicatesAsync(2666)) - STATE* Replication Queue initialization scan for invalid, over- and under-replicated blocks completed in 5 msec
2015-02-03 00:09:19,086 INFO  [IPC Server Responder] ipc.Server (Server.java:run(838)) - IPC Server Responder: starting
2015-02-03 00:09:19,086 INFO  [IPC Server listener on 57792] ipc.Server (Server.java:run(678)) - IPC Server listener on 57792: starting
2015-02-03 00:09:19,087 INFO  [Thread-2] namenode.NameNode (NameNode.java:startCommonServices(649)) - NameNode RPC up at: localhost/127.0.0.1:57792
2015-02-03 00:09:19,088 INFO  [Thread-2] namenode.NameNode (NameNode.java:<init>(768)) - create NameNode haContext lock
2015-02-03 00:09:19,088 INFO  [Thread-2] namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1091)) - Starting services required for active state
2015-02-03 00:09:19,088 INFO  [IPC Server handler 0 on 57792] namenode.NameNode (NameNode.java:isStarted(1758)) - isStarted false
2015-02-03 00:09:19,089 INFO  [Thread-2] namenode.NameNode (NameNode.java:<init>(784)) - create NameNode end true
2015-02-03 00:09:19,089 INFO  [CacheReplicationMonitor(32532357)] blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(160)) - Starting CacheReplicationMonitor with interval 30000 milliseconds
2015-02-03 00:09:19,089 WARN  [Thread-2] hdfs.MiniDFSCluster (MiniDFSCluster.java:waitClusterUp(1189)) - Waiting for the Mini HDFS Cluster to start...
2015-02-03 00:09:19,089 INFO  [IPC Server handler 0 on 57792] ipc.Server (Server.java:run(2155)) - IPC Server handler 0 on 57792, call org.apache.hadoop.hdfs.protocol.ClientProtocol.mkdirs from 127.0.0.1:57805 Call#14 Retry#1
java.io.IOException: NameNode still not started
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkNNStartup(NameNodeRpcServer.java:1876)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:971)
{code}


> TestFSRMStateStore fails on trunk
> ---------------------------------
>
>                 Key: YARN-1778
>                 URL: https://issues.apache.org/jira/browse/YARN-1778
>             Project: Hadoop YARN
>          Issue Type: Test
>            Reporter: Xuan Gong
>            Assignee: zhihai xu
>         Attachments: YARN-1778.000.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)