You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Reed Villanueva <rv...@ucera.org> on 2019/08/08 01:38:54 UTC
YARN job appears to have access to less resources than Ambari YARN manager reports

Getting confused when trying to run a YARN process and getting errors.
Looking in ambari UI YARN section, seeing... [image: enter image
description here] <https://i.stack.imgur.com/0Fohu.png>[image: enter image
description here] <https://i.stack.imgur.com/nHpX3.png>(note it says 60GB
available). Yet, when trying to run an YARN process, getting errors
indicating that there are less resources available than is being reported
in ambari, see...

➜  h2o-3.26.0.2-hdp3.1 hadoop jar h2odriver.jar -nodes 4 -mapperXmx 5g
-output /home/ml1/hdfsOutputDirDetermining driver host interface for
mapper->driver callback...
    [Possible callback IP address: 192.168.122.1]
    [Possible callback IP address: 172.18.4.49]
    [Possible callback IP address: 127.0.0.1]Using mapper->driver
callback IP address and port: 172.18.4.49:46721(You can override these
with -driverif and -driverport/-driverportrange and/or specify
external IP using -extdriverif.)Memory Settings:
    mapreduce.map.java.opts:     -Xms5g -Xmx5g -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Dlog4j.defaultInitOverride=true
    Extra memory percent:        10
    mapreduce.map.memory.mb:     5632Hive driver not present, not
generating token.19/08/07 12:37:19 INFO client.RMProxy: Connecting to
ResourceManager at hw01.ucera.local/172.18.4.46:805019/08/07 12:37:19
INFO client.AHSProxy: Connecting to Application History server at
hw02.ucera.local/172.18.4.47:1020019/08/07 12:37:19 INFO
mapreduce.JobResourceUploader: Disabling Erasure Coding for path:
/user/ml1/.staging/job_1565057088651_000719/08/07 12:37:21 INFO
mapreduce.JobSubmitter: number of splits:419/08/07 12:37:21 INFO
mapreduce.JobSubmitter: Submitting tokens for job:
job_1565057088651_000719/08/07 12:37:21 INFO mapreduce.JobSubmitter:
Executing with tokens: []19/08/07 12:37:21 INFO conf.Configuration:
found resource resource-types.xml at
file:/etc/hadoop/3.1.0.0-78/0/resource-types.xml19/08/07 12:37:21 INFO
impl.YarnClientImpl: Submitted application
application_1565057088651_000719/08/07 12:37:21 INFO mapreduce.Job:
The url to track the job:
http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/Job
name 'H2O_80092' submittedJobTracker job ID is
'job_1565057088651_0007'For YARN users, logs command is 'yarn logs
-applicationId application_1565057088651_0007'Waiting for H2O cluster
to come up...19/08/07 12:37:38 INFO client.RMProxy: Connecting to
ResourceManager at hw01.ucera.local/172.18.4.46:805019/08/07 12:37:38
INFO client.AHSProxy: Connecting to Application History server at
hw02.ucera.local/172.18.4.47:10200
----- YARN cluster metrics -----Number of YARN worker nodes: 4
----- Nodes -----Node: http://HW03.ucera.local:8042 Rack:
/default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3
vcores usedNode: http://HW04.ucera.local:8042 Rack: /default-rack,
RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores usedNode:
http://hw05.ucera.local:8042 Rack: /default-rack, RUNNING, 0
containers used, 0.0 / 15.0 GB used, 0 / 3 vcores usedNode:
http://HW02.ucera.local:8042 Rack: /default-rack, RUNNING, 0
containers used, 0.0 / 15.0 GB used, 0 / 3 vcores used
----- Queues -----Queue name:            default
    Queue state:       RUNNING
    Current capacity:  0.08
    Capacity:          1.00
    Maximum capacity:  1.00
    Application count: 1
    ----- Applications in this queue -----
    Application ID:                  application_1565057088651_0007 (H2O_80092)
        Started:                     ml1 (Wed Aug 07 12:37:21 HST 2019)
        Application state:           FINISHED
        Tracking URL:
http://HW01.ucera.local:8088/proxy/application_1565057088651_0007/
        Queue name:                  default
        Used/Reserved containers:    1 / 0
        Needed/Used/Reserved memory: 5.0 GB / 5.0 GB / 0.0 GB
        Needed/Used/Reserved vcores: 1 / 1 / 0
Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12 vcores used
----------------------------------------------------------------------

ERROR: Unable to start any H2O nodes; please contact your YARN administrator.

       A common cause for this is the requested container size (5.5 GB)
       exceeds the following YARN settings:

           yarn.nodemanager.resource.memory-mb
           yarn.scheduler.maximum-allocation-mb
----------------------------------------------------------------------
For YARN users, logs command is 'yarn logs -applicationId
application_1565057088651_0007'

Note the

ERROR: Unable to start any H2O nodes; please contact your YARN
administrator.

A common cause for this is the requested container size (5.5 GB) exceeds
the following YARN settings:

  yarn.nodemanager.resource.memory-mb
  yarn.scheduler.maximum-allocation-mb

Yet, I have YARN configured with

yarn.scheduler.maximum-allocation-vcores=3
yarn.nodemanager.resource.cpu-vcores=3
yarn.nodemanager.resource.memory-mb=15GB
yarn.scheduler.maximum-allocation-mb=15GB

and we can see both container and node resource restrictions are higher
than the requested container size.

So there are some things about this that I don't understand

   1.

   Queue 'default' approximate utilization: 5.0 / 60.0 GB used, 1 / 12
   vcores used

   I would like to use the full 60GB that YARN can ostensibly provide (or
   at least have the option to, rather than have errors thrown). Would think
   that there should be enough resources to have each of the 4 nodes provide
   15GB (> requested 4x5GB=20GB) to the process. Am I missing something here?
   Note that I only have the default root queue setup for YARN?
   2.

   ----- Nodes -----

   Node: http://HW03.ucera.local:8042 <http://hw03.ucera.local:8042/> Rack:
   /default-rack, RUNNING, 1 containers used, 5.0 / 15.0 GB used, 1 / 3 vcores
   used

   Node: http://HW04.ucera.local:8042 <http://hw04.ucera.local:8042/> Rack:
   /default-rack, RUNNING, 0 containers used, 0.0 / 15.0 GB used, 0 / 3 vcores
   used

   ....

   Why is only a single node being used before erroring out?

From these two things, it seems that neither the 15GB node limit nor the
60GB cluster limit are being exceeded, so why are these errors being
thrown? What about this situation am I misinterpreting here? What can be
done to fix (again, would like to be able to use all of the apparent 60GB
of YARN resources for the job without error)? Any debugging suggestions of
fixes?

-- 
This electronic message is intended only for the named 
recipient, and may 
contain information that is confidential or 
privileged. If you are not the 
intended recipient, you are 
hereby notified that any disclosure, copying, 
distribution or 
use of the contents of this message is strictly 
prohibited. If 
you have received this message in error or are not the 
named
recipient, please notify us immediately by contacting the 
sender at 
the electronic mail address noted above, and delete 
and destroy all copies 
of this message. Thank you.