You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jim Brennan (JIRA)" <ji...@apache.org> on 2018/06/20 14:21:00 UTC

[jira] [Updated] (YARN-8444) NodeResourceMonitor crashes on bad swapFree value

     [ https://issues.apache.org/jira/browse/YARN-8444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Brennan updated YARN-8444:
------------------------------
    Description: 
Saw this on a node that was running out of memory. Can't have NodeResourceMonitor exiting. System was above 99% memory used at the time, so this is not a common occurrence, but we should fix since this is a critical monitor to the health of the node.

 
{noformat}
2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory used; 5.0 GB of 7.3 GB virtual memory used
2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] threw an Exception.
java.lang.NumberFormatException: For input string: "18446744073709551596"
 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:592)
 at java.lang.Long.parseLong(Long.java:631)
 at org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
 at org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
 at org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
 at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
 at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
2018-06-04 14:28:30,747 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9330ms
{noformat}

  was:
Saw this on a node that was having difficulty preempting containers. Can't have NodeResourceMonitor exiting. System was above 99% memory used at the time so it may only be something that happens when normal preemption isn't work right, but we should fix since this is a critical monitor to the health of the node.

 

{noformat}
2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory used; 5.0 GB of 7.3 GB virtual memory used
2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] threw an Exception.
java.lang.NumberFormatException: For input string: "18446744073709551596"
 at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Long.parseLong(Long.java:592)
 at java.lang.Long.parseLong(Long.java:631)
 at org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
 at org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
 at org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
 at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
 at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
2018-06-04 14:28:30,747 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9330ms
{noformat}


> NodeResourceMonitor crashes on bad swapFree value
> -------------------------------------------------
>
>                 Key: YARN-8444
>                 URL: https://issues.apache.org/jira/browse/YARN-8444
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.8.3, 3.0.2
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>
> Saw this on a node that was running out of memory. Can't have NodeResourceMonitor exiting. System was above 99% memory used at the time, so this is not a common occurrence, but we should fix since this is a critical monitor to the health of the node.
>  
> {noformat}
> 2018-06-04 14:28:08,539 [Container Monitor] DEBUG ContainersMonitorImpl.audit: Memory usage of ProcessTree 110564 for container-id container_e24_1526662705797_129647_01_004791: 2.1 GB of 3.5 GB physical memory used; 5.0 GB of 7.3 GB virtual memory used
> 2018-06-04 14:28:10,622 [Node Resource Monitor] ERROR yarn.YarnUncaughtExceptionHandler: Thread Thread[Node Resource Monitor,5,main] threw an Exception.
> java.lang.NumberFormatException: For input string: "18446744073709551596"
>  at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>  at java.lang.Long.parseLong(Long.java:592)
>  at java.lang.Long.parseLong(Long.java:631)
>  at org.apache.hadoop.util.SysInfoLinux.readProcMemInfoFile(SysInfoLinux.java:257)
>  at org.apache.hadoop.util.SysInfoLinux.getAvailablePhysicalMemorySize(SysInfoLinux.java:591)
>  at org.apache.hadoop.util.SysInfoLinux.getAvailableVirtualMemorySize(SysInfoLinux.java:601)
>  at org.apache.hadoop.yarn.util.ResourceCalculatorPlugin.getAvailableVirtualMemorySize(ResourceCalculatorPlugin.java:74)
>  at org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl$MonitoringThread.run(NodeResourceMonitorImpl.java:193)
> 2018-06-04 14:28:30,747 [org.apache.hadoop.util.JvmPauseMonitor$Monitor@226eba67] INFO util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 9330ms
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org