You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2008/10/28 14:04:44 UTC

[jira] Created: (HADOOP-4532) Interrupting the namenode thread triggers System.exit()

Interrupting the namenode thread triggers System.exit()
-------------------------------------------------------

Key: HADOOP-4532
URL: https://issues.apache.org/jira/browse/HADOOP-4532
Project: Hadoop Core
Issue Type: Bug
Components: dfs
Affects Versions: 0.20.0
Reporter: Steve Loughran
Priority: Minor

My service setup/teardown tests are managing to trigger system exits in the namenode, which seems overkill.

1. Interrupting the thread that is starting the namesystem up raises a java.nio.channels.ClosedByInterruptException.
2. This is caught in FSImage.rollFSImage, and handed off to processIOError
3. This triggers a call to Runtime.getRuntime().exit(-1); "All storage directories are inaccessible.".

Stack trace to follow. Exiting the JVM is somewhat overkill; if someone has interrupted the thread is is (presumably) because they want to stop the namenode, which may not imply they want to kill the JVM at the same time. Certainly JUnit does not expect it.

Some possibilities
-ClosedByInterruptException get handled differently as some form of shutdown request
-Calls to system exit are factored out into something that can have its behaviour changed by policy options to throw a RuntimeException instead.
Hosting a Namenode in a security manager that blocks off System.exit() is the simplest workaround; this is fairly simple, but it means that what would be a straight exit does now get turned into an exception, so callers may be surprised by what happens.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4532) Interrupting the namenode thread triggers System.exit()

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643197#action_12643197 ] 

Steve Loughran commented on HADOOP-4532:
----------------------------------------

This is very race condition dependent; a small change in machine timing and the interrupt is handled gracefully by reporting failure to the caller.

[sf-startdaemon-debug] 08/10/28 13:15:07 [Thread-6] WARN util.Shell : Interrupted while reading the error stream
[sf-startdaemon-debug] java.lang.InterruptedException
[sf-startdaemon-debug]  at java.lang.Object.wait(Native Method)
[sf-startdaemon-debug]  at java.lang.Thread.join(Thread.java:1143)
[sf-startdaemon-debug]  at java.lang.Thread.join(Thread.java:1196)
[sf-startdaemon-debug]  at org.apache.hadoop.util.Shell.runCommand(Shell.java:189)
[sf-startdaemon-debug]  at org.apache.hadoop.util.Shell.run(Shell.java:134)
[sf-startdaemon-debug]  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)
[sf-startdaemon-debug]  at org.apache.hadoop.util.Shell.execCommand(Shell.java:338)
[sf-startdaemon-debug]  at org.apache.hadoop.security.UnixUserGroupInformation.executeShellCommand(UnixUserGroupInformation.java:326)
[sf-startdaemon-debug]  at org.apache.hadoop.security.UnixUserGroupInformation.getUnixUserName(UnixUserGroupInformation.java:305)
[sf-startdaemon-debug]  at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:232)
[sf-startdaemon-debug]  at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:275)
[sf-startdaemon-debug]  at org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupInformation.java:257)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setConfigurationParameters(FSNamesystem.java:426)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:304)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:290)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:165)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.NameNode.innerStart(NameNode.java:226)

given I am interrupting threads during their initialization, I shouldn't expect things to work. But stopping the JVM is probably inappropriate.

> Interrupting the namenode thread triggers System.exit()
> -------------------------------------------------------
>
>                 Key: HADOOP-4532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4532
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> My service setup/teardown tests are managing to trigger system exits in the namenode, which seems overkill.
> 1. Interrupting the thread that is starting the namesystem up raises a java.nio.channels.ClosedByInterruptException.
> 2. This is caught in FSImage.rollFSImage, and handed off to processIOError
> 3. This triggers a call to Runtime.getRuntime().exit(-1); "All storage directories are inaccessible.".
> Stack trace to follow. Exiting the JVM is somewhat overkill; if someone has interrupted the thread is is (presumably) because they want to stop the namenode, which may not imply they want to kill the JVM at the same time. Certainly JUnit does not expect it. 
> Some possibilities
>  -ClosedByInterruptException get handled differently as some form of shutdown request
>  -Calls to system exit are factored out into something that can have its behaviour changed by policy options to throw a RuntimeException instead. 
> Hosting a Namenode in a security manager that blocks off System.exit() is the simplest workaround; this is fairly simple, but it means that what would be a straight exit does now get turned into an exception, so callers may be surprised by what happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4532) Interrupting the namenode thread triggers System.exit()

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643361#action_12643361 ] 

dhruba borthakur commented on HADOOP-4532:
------------------------------------------

> ClosedByInterruptException get handled differently as some form of shutdown request
  This is closest to what we currently have in the NameNode.

> Hosting a Namenode in a security manager that blocks off System.exit() 
  A Security Manager imposes additional performance overhead, doesn't it?

> Interrupting the namenode thread triggers System.exit()
> -------------------------------------------------------
>
>                 Key: HADOOP-4532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4532
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> My service setup/teardown tests are managing to trigger system exits in the namenode, which seems overkill.
> 1. Interrupting the thread that is starting the namesystem up raises a java.nio.channels.ClosedByInterruptException.
> 2. This is caught in FSImage.rollFSImage, and handed off to processIOError
> 3. This triggers a call to Runtime.getRuntime().exit(-1); "All storage directories are inaccessible.".
> Stack trace to follow. Exiting the JVM is somewhat overkill; if someone has interrupted the thread is is (presumably) because they want to stop the namenode, which may not imply they want to kill the JVM at the same time. Certainly JUnit does not expect it. 
> Some possibilities
>  -ClosedByInterruptException get handled differently as some form of shutdown request
>  -Calls to system exit are factored out into something that can have its behaviour changed by policy options to throw a RuntimeException instead. 
> Hosting a Namenode in a security manager that blocks off System.exit() is the simplest workaround; this is fairly simple, but it means that what would be a straight exit does now get turned into an exception, so callers may be surprised by what happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4532) Interrupting the namenode thread triggers System.exit()

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643463#action_12643463 ] 

Steve Loughran commented on HADOOP-4532:
----------------------------------------

>> Hosting a Namenode in a security manager that blocks off System.exit()
>A Security Manager imposes additional performance overhead, doesn't it?

I've moved to the security manager hosting; it doesn't impose much of an overhead as only permissions for System.exit() are checked and intercepted. 

It's still a bit odd killing the process if startup fails, however, as if you run <junit fork="true"/> your process gets killed. In ant, <junit fork="false" /> runs under a security manager purely to stop anyone calling System.exit() in their code.

> Interrupting the namenode thread triggers System.exit()
> -------------------------------------------------------
>
>                 Key: HADOOP-4532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4532
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> My service setup/teardown tests are managing to trigger system exits in the namenode, which seems overkill.
> 1. Interrupting the thread that is starting the namesystem up raises a java.nio.channels.ClosedByInterruptException.
> 2. This is caught in FSImage.rollFSImage, and handed off to processIOError
> 3. This triggers a call to Runtime.getRuntime().exit(-1); "All storage directories are inaccessible.".
> Stack trace to follow. Exiting the JVM is somewhat overkill; if someone has interrupted the thread is is (presumably) because they want to stop the namenode, which may not imply they want to kill the JVM at the same time. Certainly JUnit does not expect it. 
> Some possibilities
>  -ClosedByInterruptException get handled differently as some form of shutdown request
>  -Calls to system exit are factored out into something that can have its behaviour changed by policy options to throw a RuntimeException instead. 
> Hosting a Namenode in a security manager that blocks off System.exit() is the simplest workaround; this is fairly simple, but it means that what would be a straight exit does now get turned into an exception, so callers may be surprised by what happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4532) Interrupting the namenode thread triggers System.exit()

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643194#action_12643194 ] 

Steve Loughran commented on HADOOP-4532:
----------------------------------------

stack trace. FSImage does not like to be interrupted.

[sf-startdaemon-debug] 08/10/28 12:50:22 [Thread-305] ERROR common.Storage : Cannot write file /tmp/hadoop/dfs/name
[sf-startdaemon-debug] java.nio.channels.ClosedByInterruptException
[sf-startdaemon-debug]  at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
[sf-startdaemon-debug]  at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:271)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.write(Storage.java:268)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.write(Storage.java:244)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSImage.rollFSImage(FSImage.java:1316)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1034)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:290)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:165)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.NameNode.innerStart(NameNode.java:226)
[sf-startdaemon-debug]  at org.apache.hadoop.util.Service.start(Service.java:188)
[sf-startdaemon-debug]  at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.innerDeploy(HadoopServiceImpl.java:479)
[sf-startdaemon-debug]  at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.access$000(HadoopServiceImpl.java:46)
[sf-startdaemon-debug]  at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl$ServiceDeployerThread.execute(HadoopServiceImpl.java:628)


> Interrupting the namenode thread triggers System.exit()
> -------------------------------------------------------
>
>                 Key: HADOOP-4532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4532
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> My service setup/teardown tests are managing to trigger system exits in the namenode, which seems overkill.
> 1. Interrupting the thread that is starting the namesystem up raises a java.nio.channels.ClosedByInterruptException.
> 2. This is caught in FSImage.rollFSImage, and handed off to processIOError
> 3. This triggers a call to Runtime.getRuntime().exit(-1); "All storage directories are inaccessible.".
> Stack trace to follow. Exiting the JVM is somewhat overkill; if someone has interrupted the thread is is (presumably) because they want to stop the namenode, which may not imply they want to kill the JVM at the same time. Certainly JUnit does not expect it. 
> Some possibilities
>  -ClosedByInterruptException get handled differently as some form of shutdown request
>  -Calls to system exit are factored out into something that can have its behaviour changed by policy options to throw a RuntimeException instead. 
> Hosting a Namenode in a security manager that blocks off System.exit() is the simplest workaround; this is fairly simple, but it means that what would be a straight exit does now get turned into an exception, so callers may be surprised by what happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4532) Interrupting the namenode thread triggers System.exit()

Posted by "Steve Loughran (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12643246#action_12643246 ] 

Steve Loughran commented on HADOOP-4532:
----------------------------------------

Intercepting System.exit() calls stops this and changes the stack to a transition to failed state instead.
http://jira.smartfrog.org/jira/browse/SFOS-1016

[sf-startdaemon-debug] 08/10/28 16:18:21 [Thread-305] INFO common.Storage : Image file of size 93 saved in 0 seconds.
[sf-startdaemon-debug] 08/10/28 16:18:21 [Thread-305] ERROR namenode.FSNamesystem : FSNamesystem initialization failed.
[sf-startdaemon-debug] java.nio.channels.ClosedByInterruptException
[sf-startdaemon-debug]  at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184)
[sf-startdaemon-debug]  at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:317)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSEditLog$EditLogFileOutputStream.<init>(FSEditLog.java:128)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSEditLog.createEditLogFile(FSEditLog.java:343)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImage(FSImage.java:1030)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:290)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:165)
[sf-startdaemon-debug]  at org.apache.hadoop.hdfs.server.namenode.NameNode.innerStart(NameNode.java:226)
[sf-startdaemon-debug]  at org.apache.hadoop.util.Service.start(Service.java:188)
[sf-startdaemon-debug]  at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.innerDeploy(HadoopServiceImpl.java:479)
[sf-startdaemon-debug]  at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl.access$000(HadoopServiceImpl.java:46)
[sf-startdaemon-debug]  at org.smartfrog.services.hadoop.components.cluster.HadoopServiceImpl$ServiceDeployerThread.execute(HadoopServiceImpl.java:628)
[sf-startdaemon-debug]  at org.smartfrog.sfcore.utils.SmartFrogThread.run(SmartFrogThread.java:279)
[sf-startdaemon-debug]  at org.smartfrog.sfcore.utils.WorkflowThread.run(WorkflowThread.java:117)
[sf-startdaemon-debug] 08/10/28 16:18:21 [Thread-305] INFO namenode.NameNode : State change: NameNode is now FAILED


> Interrupting the namenode thread triggers System.exit()
> -------------------------------------------------------
>
>                 Key: HADOOP-4532
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4532
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.20.0
>            Reporter: Steve Loughran
>            Priority: Minor
>
> My service setup/teardown tests are managing to trigger system exits in the namenode, which seems overkill.
> 1. Interrupting the thread that is starting the namesystem up raises a java.nio.channels.ClosedByInterruptException.
> 2. This is caught in FSImage.rollFSImage, and handed off to processIOError
> 3. This triggers a call to Runtime.getRuntime().exit(-1); "All storage directories are inaccessible.".
> Stack trace to follow. Exiting the JVM is somewhat overkill; if someone has interrupted the thread is is (presumably) because they want to stop the namenode, which may not imply they want to kill the JVM at the same time. Certainly JUnit does not expect it. 
> Some possibilities
>  -ClosedByInterruptException get handled differently as some form of shutdown request
>  -Calls to system exit are factored out into something that can have its behaviour changed by policy options to throw a RuntimeException instead. 
> Hosting a Namenode in a security manager that blocks off System.exit() is the simplest workaround; this is fairly simple, but it means that what would be a straight exit does now get turned into an exception, so callers may be surprised by what happens.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.