You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2011/09/22 18:31:26 UTC

[jira] [Resolved] (MAPREDUCE-3065) ApplicationMaster killed by NodeManager due to excessive virtual memory consumption

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Riccomini resolved MAPREDUCE-3065.
----------------------------------------

    Resolution: Fixed

Per vinod, new ticket to track this: https://issues.apache.org/jira/browse/MAPREDUCE-3068

> ApplicationMaster killed by NodeManager due to excessive virtual memory consumption
> -----------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3065
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3065
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.24.0
>            Reporter: Chris Riccomini
>
> > Hey Vinod,
> > 
> > OK, so I have a little more clarity into this.
> > 
> > When I bump my resource request for my AM to 4096, it runs. The important line in the NM logs is:
> > 
> > 2011-09-21 13:43:44,366 INFO  monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 25656 for container-id container_1316637655278_0001_01_000001 : Virtual 2260938752 bytes, limit : 4294967296 bytes; Physical 120860672 bytes, limit -1 bytes
> > 
> > The thing to note is the virtual memory, which is off the charts, even though my physical memory is almost nothing (12 megs). I'm still poking around the code, but I am noticing that there are two checks in the NM, one for virtual mem, and one for physical mem. The virtual memory check appears to be toggle-able, but is presumably defaulted to on.
> > 
> > At this point I'm trying to figure out exactly what the VMEM check is for, why YARN thinks my app is taking 2 gigs, and how to fix this.
> > 
> > Cheers,
> > Chris
> > ________________________________________
> > From: Chris Riccomini [criccomini@linkedin.com]
> > Sent: Wednesday, September 21, 2011 1:42 PM
> > To: mapreduce-dev@hadoop.apache.org
> > Subject: Re: ApplicationMaster Memory Usage
> > 
> > For the record, I bumped to 4096 for memory resource request, and it works.
> > :(
> > 
> > 
> > On 9/21/11 1:32 PM, "Chris Riccomini" <cr...@linkedin.com> wrote:
> > 
> >> Hey Vinod,
> >> 
> >> So, I ran my application master directly from the CLI. I commented out the
> >> YARN-specific code. It runs fine without leaking memory.
> >> 
> >> I then ran it from YARN, with all YARN-specific code commented it. It again
> >> ran fine.
> >> 
> >> I then uncommented JUST my registerWithResourceManager call. It then fails
> >> with OOM after a few seconds. I call registerWithResourceManager, and then go
> >> into a while(true) { println("yeh") sleep(1000) }. Doing this prints:
> >> 
> >> yeh
> >> yeh
> >> yeh
> >> yeh
> >> yeh
> >> 
> >> At which point, it dies, and, in the NodeManager,I see:
> >> 
> >> 2011-09-21 13:24:51,036 WARN  monitor.ContainersMonitorImpl
> >> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process tree for
> >> container: container_1316626117280_0005_01_000001 has processes older than 1
> >> iteration running over the configured limit. Limit=2147483648, current usage =
> >> 2192773120
> >> 2011-09-21 13:24:51,037 WARN  monitor.ContainersMonitorImpl
> >> (ContainersMonitorImpl.java:run(453)) - Container
> >> [pid=23852,containerID=container_1316626117280_0005_01_000001] is running
> >> beyond memory-limits. Current usage : 2192773120bytes. Limit :
> >> 2147483648bytes. Killing container.
> >> Dump of the process-tree for container_1316626117280_0005_01_000001 :
> >> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS)
> >> VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> >> |- 23852 20570 23852 23852 (bash) 0 0 108638208 303 /bin/bash -c java -Xmx512M
> >> -cp './package/*' kafka.yarn.ApplicationMaster
> >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1 1316626117280
> >> com.linkedin.TODO 1
> >> 1>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000
> >> 001/stdout
> >> 2>/tmp/logs/application_1316626117280_0005/container_1316626117280_0005_01_000
> >> 001/stderr
> >> |- 23855 23852 23852 23852 (java) 81 4 2084134912 14772 java -Xmx512M -cp
> >> ./package/* kafka.yarn.ApplicationMaster
> >> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 5 1 1316626117280
> >> com.linkedin.TODO 1
> >> 2011-09-21 13:24:51,037 INFO  monitor.ContainersMonitorImpl
> >> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root 23852
> >> 
> >> Either something is leaking in YARN, or my registerWithResourceManager code
> >> (see below) is doing something funky.
> >> 
> >> I'm trying to avoid going through all the pain of attaching a remote debugger.
> >> Presumably things aren't leaking in YARN, which means it's likely that I'm
> >> doing something wrong in my registration code.
> >> 
> >> Incidentally, my NodeManager is running with 1000 megs. My application master
> >> memory is set to 2048, and my -Xmx setting is 512M
> >> 
> >> Cheers,
> >> Chris
> >> ________________________________________
> >> From: Vinod Kumar Vavilapalli [vinodkv@hortonworks.com]
> >> Sent: Wednesday, September 21, 2011 11:52 AM
> >> To: mapreduce-dev@hadoop.apache.org
> >> Subject: Re: ApplicationMaster Memory Usage
> >> 
> >> Actually MAPREDUCE-2998 is only related to MRV2, so that isn't related.
> >> 
> >> Somehow, your JVM itself is taking so much of virtual memory. Are you
> >> loading some native libs?
> >> 
> >> And how many containers have already been allocated by the time the AM
> >> crashes. May be you are accumulating some per-container data. You can try
> >> dumping heap vai hprof.
> >> 
> >> +Vinod
> >> 
> >> 
> >> On Wed, Sep 21, 2011 at 11:21 PM, Chris Riccomini
> >> <cr...@linkedin.com>wrote:
> >> 
> >>> Hey Vinod,
> >>> 
> >>> I svn up'd, and rebuilt. My application's task (container) now runs!
> >>> 
> >>> Unfortunately, my application master eventually gets killed by the
> >>> NodeManager anyway, and I'm still not clear as to why. The AM is just
> >>> running a loop, asking for a container, and executing a command in the
> >>> container. It keeps doing this over and over again. After a few iterations,
> >>> it gets killed with something like:
> >>> 
> >>> 2011-09-21 10:42:40,869 INFO  monitor.ContainersMonitorImpl
> >>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 21666
> >>> for container-id container_1316626117280_0002_01_000001 : Virtual 2260938752
> >>> bytes, limit : 2147483648 bytes; Physical 77398016 bytes, limit -1 bytes
> >>> 2011-09-21 10:42:40,869 WARN  monitor.ContainersMonitorImpl
> >>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process tree for
> >>> container: container_1316626117280_0002_01_000001 has processes older than 1
> >>> iteration running over the configured limit. Limit=2147483648, current usage
> >>> = 2260938752
> >>> 2011-09-21 10:42:40,870 WARN  monitor.ContainersMonitorImpl
> >>> (ContainersMonitorImpl.java:run(453)) - Container
> >>> [pid=21666,containerID=container_1316626117280_0002_01_000001] is running
> >>> beyond memory-limits. Current usage : 2260938752bytes. Limit :
> >>> 2147483648bytes. Killing container.
> >>> Dump of the process-tree for container_1316626117280_0002_01_000001 :
> >>>        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> >>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> >>>        |- 21669 21666 21666 21666 (java) 105 4 2152300544 18593 java
> >>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster
> >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1 1316626117280
> >>> com.linkedin.TODO 1
> >>>       |- 21666 20570 21666 21666 (bash) 0 0 108638208 303 /bin/bash -c
> >>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster
> >>> /home/criccomi/git/kafka-yarn/dist/kafka-streamer.tgz 2 1 1316626117280
> >>> com.linkedin.TODO 1
> >>> 1>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00
> >>> 0001/stdout
> >>> 2>/tmp/logs/application_1316626117280_0002/container_1316626117280_0002_01_00
> >>> 0001/stderr
> >>> 
> >>> 2011-09-21 10:42:40,870 INFO  monitor.ContainersMonitorImpl
> >>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root 21666
> >>> 
> >>> I don't think that my AM is leaking memory. Full code paste after the break
> >>> 
> >>> 1. Do I need to release a container in my AM even if the AM receives it as
> >>> a finished container in the resource request response?
> >>> 2. Do I need to free any other resources after a resource request (e.g.
> >>> ResourceRequest, AllocateRequest, etc)?
> >>> 
> >>> Cheers,
> >>> Chris
> >>> 
> >>> 
> >>> def main(args: Array[String]) {
> >>>   // YARN will always give our ApplicationMaster
> >>>   // the first four parameters as: <package> <app id> <attempt id>
> >>> <timestamp>
> >>>   val packagePath = args(0)
> >>>   val appId = args(1).toInt
> >>>   val attemptId = args(2).toInt
> >>>   val timestamp = args(3).toLong
> >>> 
> >>>   // these are our application master's parameters
> >>>   val streamerClass = args(4)
> >>>   val tasks = args(5).toInt
> >>> 
> >>>   // TODO log params here
> >>> 
> >>>   // start the application master helper
> >>>   val conf = new Configuration
> >>>   val applicationMasterHelper = new ApplicationMasterHelper(appId,
> >>> attemptId, timestamp, conf)
> >>>     .registerWithResourceManager
> >>> 
> >>>   // start and manage the slaves
> >>>   val noReleases = List[ContainerId]()
> >>>   var runningContainers = 0
> >>> 
> >>>   // keep going forever
> >>>   while (true) {
> >>>     val nonRunningTasks = tasks - runningContainers
> >>>     val response =
> >>> applicationMasterHelper.sendResourceRequest(nonRunningTasks, noReleases)
> >>> 
> >>>     response.getAllocatedContainers.foreach(container => {
> >>>       new ContainerExecutor(packagePath, container)
> >>>         .addCommand("java -Xmx256M -cp './package/*'
> >>> kafka.yarn.StreamingTask " + streamerClass + " "
> >>>           + "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout "
> >>>           + "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR +
> >>> "/stderr").execute(conf)
> >>>     })
> >>> 
> >>>     runningContainers += response.getAllocatedContainers.length
> >>>     runningContainers -= response.getCompletedContainersStatuses.length
> >>> 
> >>>     Thread.sleep(1000)
> >>>   }
> >>> 
> >>>   applicationMasterHelper.unregisterWithResourceManager("SUCCESS")
> >>> }
> >>> 
> >>> 
> >>> class ApplicationMasterHelper(iAppId: Int, iAppAttemptId: Int, lTimestamp:
> >>> Long, conf: Configuration) {
> >>> val rpc = YarnRPC.create(conf)
> >>> val appId = Records.newRecord(classOf[ApplicationId])
> >>> val appAttemptId = Records.newRecord(classOf[ApplicationAttemptId])
> >>> val rmAddress =
> >>> NetUtils.createSocketAddr(conf.get(YarnConfiguration.RM_SCHEDULER_ADDRESS,
> >>> YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS))
> >>> val resourceManager = rpc.getProxy(classOf[AMRMProtocol], rmAddress,
> >>> conf).asInstanceOf[AMRMProtocol]
> >>> var requestId = 0
> >>> 
> >>> appId.setClusterTimestamp(lTimestamp)
> >>> appId.setId(iAppId)
> >>> appAttemptId.setApplicationId(appId)
> >>> appAttemptId.setAttemptId(iAppAttemptId)
> >>> 
> >>> def registerWithResourceManager(): ApplicationMasterHelper = {
> >>>   val req = Records.newRecord(classOf[RegisterApplicationMasterRequest])
> >>>   req.setApplicationAttemptId(appAttemptId)
> >>>   // TODO not sure why these are blank- This is how spark does it
> >>>   req.setHost("")
> >>>   req.setRpcPort(1)
> >>>   req.setTrackingUrl("")
> >>>   resourceManager.registerApplicationMaster(req)
> >>>   this
> >>> }
> >>> 
> >>> def unregisterWithResourceManager(state: String): ApplicationMasterHelper
> >>> = {
> >>>   val finReq = Records.newRecord(classOf[FinishApplicationMasterRequest])
> >>>   finReq.setAppAttemptId(appAttemptId)
> >>>   finReq.setFinalState(state)
> >>>   resourceManager.finishApplicationMaster(finReq)
> >>>   this
> >>> }
> >>> 
> >>> def sendResourceRequest(containers: Int, release: List[ContainerId]):
> >>> AMResponse = {
> >>>   // TODO will need to make this more flexible for hostname requests, etc
> >>>   val request = Records.newRecord(classOf[ResourceRequest])
> >>>   val pri = Records.newRecord(classOf[Priority])
> >>>   val capability = Records.newRecord(classOf[Resource])
> >>>   val req = Records.newRecord(classOf[AllocateRequest])
> >>>   request.setHostName("*")
> >>>   request.setNumContainers(containers)
> >>>   pri.setPriority(1)
> >>>   request.setPriority(pri)
> >>>   capability.setMemory(128)
> >>>   request.setCapability(capability)
> >>>   req.setResponseId(requestId)
> >>>   req.setApplicationAttemptId(appAttemptId)
> >>>   req.addAllAsks(Lists.newArrayList(request))
> >>>   req.addAllReleases(release)
> >>>   requestId += 1
> >>>   // TODO we might want to return a list of container executors here
> >>> instead of AMResponses
> >>>   resourceManager.allocate(req).getAMResponse
> >>> }
> >>> }
> >>> 
> >>> 
> >>> ________________________________________
> >>> From: Vinod Kumar Vavilapalli [vinodkv@hortonworks.com]
> >>> Sent: Wednesday, September 21, 2011 10:08 AM
> >>> To: mapreduce-dev@hadoop.apache.org
> >>> Subject: Re: ApplicationMaster Memory Usage
> >>> 
> >>> Yes, the process-dump clearly tells that this is MAPREDUCE-2998.
> >>> 
> >>> +Vinod
> >>> (With a smirk to see his container-memory-monitoring code in action)
> >>> 
> >>> 
> >>> On Wed, Sep 21, 2011 at 10:26 PM, Arun C Murthy <ac...@hortonworks.com>
> >>> wrote:
> >>> 
> >>>> I'll bet you are hitting MR-2998.
> >>>> 
> >>>> From the changelog:
> >>>> 
> >>>>   MAPREDUCE-2998. Fixed a bug in TaskAttemptImpl which caused it to fork
> >>>> bin/mapred too many times. Contributed by Vinod K V.
> >>>> 
> >>>> Arun
> >>>> 
> >>>> On Sep 21, 2011, at 9:52 AM, Chris Riccomini wrote:
> >>>> 
> >>>>> Hey Guys,
> >>>>> 
> >>>>> My ApplicationMaster is being killed by the NodeManager because of
> >>> memory
> >>>> consumption, and I don't understand why. I'm using -Xmx512M, and setting
> >>> my
> >>>> resource request to 2048.
> >>>>> 
> >>>>> 
> >>>>>   .addCommand("java -Xmx512M -cp './package/*'
> >>>> kafka.yarn.ApplicationMaster " ...
> >>>>> 
> >>>>>   ...
> >>>>> 
> >>>>>   private var memory = 2048
> >>>>> 
> >>>>>   resource.setMemory(memory)
> >>>>>   containerCtx.setResource(resource)
> >>>>>   containerCtx.setCommands(cmds.toList)
> >>>>>   containerCtx.setLocalResources(Collections.singletonMap("package",
> >>>> packageResource))
> >>>>>   appCtx.setApplicationId(appId)
> >>>>>   appCtx.setUser(user.getShortUserName)
> >>>>>   appCtx.setAMContainerSpec(containerCtx)
> >>>>>   request.setApplicationSubmissionContext(appCtx)
> >>>>>   applicationsManager.submitApplication(request)
> >>>>> 
> >>>>> When this runs, I see (in my NodeManager's logs):
> >>>>> 
> >>>>> 
> >>>>> 2011-09-21 09:35:19,112 INFO  monitor.ContainersMonitorImpl
> >>>> (ContainersMonitorImpl.java:run(402)) - Memory usage of ProcessTree 28134
> >>>> for container-id container_1316559026783_0003_01_000001 : Virtual
> >>> 2260938752
> >>>> bytes, limit : 2147483648 bytes; Physical 71540736 bytes, limit -1 bytes
> >>>>> 2011-09-21 09:35:19,112 WARN  monitor.ContainersMonitorImpl
> >>>> (ContainersMonitorImpl.java:isProcessTreeOverLimit(289)) - Process tree
> >>> for
> >>>> container: container_1316559026783_0003_01_000001 has processes older
> >>> than 1
> >>>> iteration running over the configured limit. Limit=2147483648, current
> >>> usage
> >>>> = 2260938752
> >>>>> 2011-09-21 09:35:19,113 WARN  monitor.ContainersMonitorImpl
> >>>> (ContainersMonitorImpl.java:run(453)) - Container
> >>>> [pid=28134,containerID=container_1316559026783_0003_01_000001] is running
> >>>> beyond memory-limits. Current usage : 2260938752bytes. Limit :
> >>>> 2147483648bytes. Killing container.
> >>>>> Dump of the process-tree for container_1316559026783_0003_01_000001 :
> >>>>>      |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> >>>>>      |- 28134 25886 28134 28134 (bash) 0 0 108638208 303 /bin/bash -c
> >>>> java -Xmx512M -cp './package/*' kafka.yarn.ApplicationMaster 3 1
> >>>> 1316559026783 com.linkedin.TODO 1
> >>>> 
> >>> 1>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00
> >>> 0001/stdout
> >>>> 
> >>> 2>/tmp/logs/application_1316559026783_0003/container_1316559026783_0003_01_00
> >>> 0001/stderr
> >>>>>      |- 28137 28134 28134 28134 (java) 92 3 2152300544 17163 java
> >>>> -Xmx512M -cp ./package/* kafka.yarn.ApplicationMaster 3 1 1316559026783
> >>>> com.linkedin.TODO 1
> >>>>> 
> >>>>> 2011-09-21 09:35:19,113 INFO  monitor.ContainersMonitorImpl
> >>>> (ContainersMonitorImpl.java:run(463)) - Removed ProcessTree with root
> >>> 28134
> >>>>> 
> >>>>> It appears that YARN is honoring my 2048 command, yet my process is
> >>>> somehow taking 2260938752 bytes. I don't think that I'm using nearly that
> >>>> much in permgen, and my heap is limited to 512. I don't have any JNI
> >>> stuff
> >>>> running (that I know of), so it's unclear to me what's going on here. The
> >>>> only thing that I can think of is that Java's Runtime exec is forking,
> >>> and
> >>>> copying its entire JVM memory footprint for the fork.
> >>>>> 
> >>>>> Has anyone seen this? Am I doing something dumb?
> >>>>> 
> >>>>> Thanks!
> >>>>> Chris
> >>>> 
> >>>> 
> >>> 
> > 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira